Nonconvergence and Cycling | Block Relaxation Algorithms in Statistics -- Part I

I.3.8.3: Non-convergence and Cycling

Coordinate descent may not converge at all, even if the function is differentiable.

There is a nice example, due to Powell \citep{powblk}. It is somewhat surprising that Powell does not indicate what the source of the problem is, using Zangwill's convergence theory. The reason seems to be that the mathematical programming community has decided, at an early stage, that linearly convergent algorithms are not interesting and/or useful. The recent developments in statistical computing suggest that this is simply not true.

Powell's example involves three variables, and the function $\psi(\omega)=\frac{1}{2}\omega'A\omega+\hbox{dist}^2(\omega,\mathcal{K}),$ where $a_{ij}=\begin{cases} -1& \text{if $i\not=j$,}\\ 0& \text{if $i=j$,} \end{cases}$ and where $\mathcal{K}$ is the cube $\mathcal{K}=\{\omega\mid -1\leq\omega_i\leq +1\},$

The derivatives are $\mathcal{D}\psi=A\omega+2(\omega-\mathcal{P}_\mathcal{K}(\omega)).$ In the interior of the cube $\mathcal{D}\psi=A\omega,$ which means that the only stationary point in the interior is the saddle point at $\omega=0.$ In general at a stationary point we have $(A+2\mathcal{I})\omega=\mathcal{P}_\mathcal{K}(\omega)),$ which means that we must have $u'\mathcal{P}_\mathcal{K}(\omega))=0.$ The only points where the derivatives vanish are saddle points. Thus the only place where there can be minima is on the surface of the cube.

Also for $x=y=z=t>1$ we see that $\psi(x,y,z)=-3t^2+3(t-1)^2= 3-6t,$ which is unbounded. For $x=y=t>1$ and $z=-t$ we find $\psi(x,y,z)=-t^2+3(t-1)^2=2t^2-6t+3.$ This has its minimum $-1.5$ at $t=1.5$ and it has a root at $t=\frac{1}{2}(3+\sqrt{12})=4.9641.$

Let us apply coordinate descent. A search along the x-axis finds the optimum at $+1+\frac{1}{2}(y+z)$ if $y+z>0$ and at $-1+\frac{1}{2}(y+z)$ if $y+z<0$ . If $y+z=0$ the minimizer is any point in $[-1,+1]$ .

This guarantees that the partial derivative with respect to $x$ is zero. The other updates are given by symmetry. Thus, if we start from $(-1-\epsilon,1+{\frac{1}{2}}\epsilon,-1-{\frac{1}{4}}\epsilon),$ with $\epsilon$ some small positive number, then we generate the following sequence. $\begin{bmatrix} (+1+\frac{1}{8}\epsilon, &+1+\frac{1}{2}\epsilon,&-1-\frac{1}{4}\epsilon)\\ (+1+\frac{1}{8}\epsilon, &-1-\frac{1}{16}\epsilon,&-1-\frac{1}{4}\epsilon)\\ (+1+\frac{1}{8}\epsilon, &-1-\frac{1}{16}\epsilon,&+1+\frac{1}{32}\epsilon)\\ (-1-\frac{1}{64}\epsilon,&-1-\frac{1}{16}\epsilon,&+1+\frac{1}{32}\epsilon)\\ (-1-\frac{1}{64}\epsilon,&+1+\frac{1}{128}\epsilon,&+1+\frac{1}{32}\epsilon)\\ (-1-\frac{1}{64}\epsilon,&+1+\frac{1}{128}\epsilon,&-1-\frac{1}{256}\epsilon) \end{bmatrix}$

But the sixth point is of the same form as the starting point, with $\epsilon$ replaced by $\frac{\epsilon}{64}.$ Thus the algorithm will cycle around six edges of the cube. At these edges the gradient of the function is bounded away from zero, in fact two of the partials are zero, the others are $\pm 2.$ The function value is $+1.$ The other two edges of the cube, i.e. $(+1,+1,+1)$ and $(-1,-1,-1)$ are the ones we are looking for, because there the function value is $-3,$ the global minimum. At these two points all three partials are $\pm 2.$

Powell gives some additional examples which show the same sort of cycling behaviour, but are somewhat smoother.