I.3.8.3: Non-convergence and Cycling
Coordinate descent may not converge at all, even if the function is differentiable.
There is a nice example, due to Powell \citep{powblk}. It is somewhat surprising that Powell does not indicate what the source of the problem is, using Zangwill's convergence theory. The reason seems to be that the mathematical programming community has decided, at an early stage, that linearly convergent algorithms are not interesting and/or useful. The recent developments in statistical computing suggest that this is simply not true.
Powell's example involves three variables, and the function where and where is the cube
The derivatives are In the interior of the cube which means that the only stationary point in the interior is the saddle point at In general at a stationary point we have which means that we must have The only points where the derivatives vanish are saddle points. Thus the only place where there can be minima is on the surface of the cube.
Also for we see that which is unbounded. For and we find This has its minimum at and it has a root at
Let us apply coordinate descent. A search along the x-axis finds the optimum at if and at if . If the minimizer is any point in .
This guarantees that the partial derivative with respect to is zero. The other updates are given by symmetry. Thus, if we start from with some small positive number, then we generate the following sequence.
But the sixth point is of the same form as the starting point, with replaced by Thus the algorithm will cycle around six edges of the cube. At these edges the gradient of the function is bounded away from zero, in fact two of the partials are zero, the others are The function value is The other two edges of the cube, i.e. and are the ones we are looking for, because there the function value is the global minimum. At these two points all three partials are
Powell gives some additional examples which show the same sort of cycling behaviour, but are somewhat smoother.