Convergence to Incorrect Solution | Block Relaxation Algorithms in Statistics -- Part I

I.3.8.2: Convergence to Incorrect Solutions

Convergence needs not be towards a minimum, even if the function is convex. This example is an elaboration of the one in Abatzoglou and O'Donnell [1982].

Let $f(a,b)=\max_{x\in [0,1]}\mid x^2-bx-a\mid.$ To compute $\min f(a,b)$ we do the usual Chebyshev calculations. If $h(x)\mathop{=}\limits^{\Delta}bx+a$ and $g(x)\mathop{=}\limits^{\Delta}x^2-h(x)$ we must have $g(0)=\epsilon$ , $g(y)=-\epsilon$ for some $0<y<1$ and $g(1)=\epsilon$ . Moreover $g'(y)=0$ . Thus $\begin{align*} -a&=\epsilon,\\ y^2-by-a&=-\epsilon,\\ 1-b-a&=\epsilon,\\ 2y-b&=0. \end{align*}$ The solution is $b=1$ , $y=\frac12$ , $a=-\frac18$ , and $\epsilon=\frac18$ . Thus the best linear Chebyshev approximation to $x^2$ on the unit interval is $x-\frac{1}{8},$ which has function value $f(-\frac18,1)=\frac18$ ,

Now use coordinate decent. Start with $b^{(0)}=0.$ Then $a^{(0)}=\mathop{\mathbf{argmin}}\limits_{a} f(a,0)=\frac12.$ and $b^{(1)}=\mathop{\mathbf{argmin}}\limits_{b}f(\frac12,b)=0.$ Thus $b^{(1)}=b^{(0)}$ , and we have convergence after a single cycle to a point $(a,b)=(\frac12,0)$ for which $f(\frac12,0)=\frac12$ .

This example can be analyzed in more in detail. First we compute the best constant (zero degree polynomial) approximation to $h(x)\mathop{=}\limits^{\Delta}x^2-bx$ . The function $h$ is a convex quadratic with roots at zero and $b$ , with a mimimum equal to $-\frac14 b^2$ at $x=\frac12 b$ .

We start with the simple rule that the best constant approximation is the average of the maximum and the minimum on the interval. We will redo the calculations later on, using a different and more general approach.

Case A: If $b\leq 0$ then $h$ is non-negative and increasing in the unit interval, and thus $\mathop{\mathbf{argmin}}\limits_{a} f(a,b)=\frac12(h(0)+h(1))=\frac12(1-b)$ .

Case B: If $0\leq b\leq 1$ then $h$ attains its minimum at $\frac12 b$ in the unit interval, and its maximum at one, thus $\mathop{\mathbf{argmin}}\limits_{a} f(a,b)=\frac12(h(\frac12 b)+h(1))=-\frac18 b^2-\frac12 b +\frac12$ .

Case C: If $1\leq b\leq 2$ then $h$ still attains its minimum at $\frac12 b$ in the unit interval, but now the maximum is at zero, and thus $\mathop{\mathbf{argmin}}\limits_{a} f(a,b)=\frac12(h(\frac12 b)+h(0))=-\frac18 b^2$ .

Case D: If $b\geq 2$ then $h$ is non-positive and decreasing in the unit interval, and thus again $\mathop{\mathbf{argmin}}\limits_{a} f(a,b)=\frac12(h(0)+h(1))=\frac12(1-b)$ .

We can derive the same results, and more, by using a more general approach. First $f(a,b)= \max\left\{\max_{0\leq x\leq 1}(x^2-a-bx),-\min_{0\leq x\leq 1}(x^2-a-bx)\right\}.$ Since $x^2-a-bx$ is convex, we see $f(a,b)= \max\left\{-a,1-a-b,-\min_{0\leq x\leq 1}(x^2-a-bx)\right\}.$ Now $x^2-a-bx$ has a minimum at $x=\frac12 b$ equal to $-a-\frac14 b^2$ . This is the minimum over the closed interval if $0\leq b\leq 2$ , otherwise the minimum occurs at one of the boundaries. Thus $\min_{0\leq x\leq 1}(x^2-a-bx)=\begin{cases} -a-\frac14 b^2&\text{ if }0\leq b\leq 2,\\ \min(-a,1-a-b)&\text{ otherwise},\end{cases}$ and $f(a,b)=\begin{cases} \max\{1-a-b,a+\frac14 b^2\}&\text{ if }0\leq b\leq 1,\\ \max\{-a,a+\frac14 b^2\}&\text{ if }1\leq b\leq 2,\\ \max\{|a|,|1-a-b|\}&\text{ otherwise}.\end{cases}$ It follows that $\mathop{\mathbf{argmin}}\limits_{a} f(a,b)=\begin{cases} \frac12-\frac12 b-\frac18b^2&\text{ if }0\leq b\leq 1,\\ -\frac18 b^2&\text{ if }1\leq b\leq 2,\\ \frac12(1-b)&\text{ otherwise}.\end{cases}$ It is more complicated to compute $\mathop{\mathbf{argmin}}\limits_{b} f(a,b)$ , because the corresponding Chebyshev approximation problem does not satisfy the Haar condition, and the solution may not be unique.

We make the necessary calculations, starting from the left. Define $g_1(b)\mathop{=}\limits^{\Delta}\max\{|a|,|1-a-b|\}$ . For $b\leq 0$ we have $f(a,b)=g_1(b)$ . Define $b_-\mathop{=}\limits^{\Delta}(1-a)-|a|$ and $b_+\mathop{=}\limits^{\Delta}(1-a)+|a|$ . Then $g_1(b)= \begin{cases} (1-a)-b&\text{ if }&b\leq b_-,\\ |a|&\text{ if }&b_1<b<b+,\\ b-(1-a)&\text{ if }&b\geq b_+. \end{cases}$ Note that $b_+>0$ for all $a$ . If $b_-<0$ then $g_1$ has a minimum equal to $-b_-$ for all $b$ in $[b_-,0]$ . Now $b_-<0$ if and only if $a>\frac12$ . Thus for $a>\frac12$ we have $\mathbf{Arg}\mathop{\mathbf{min}}\limits_{b}f(a,b)=[(1-a)-|a|,0].$

Switch to $g_2(b)\mathop{=}\limits^{\Delta}\max\{1-a-b,a+\frac14 b^2\}$ . For $0\leq b\leq 1$ we have $f(a,b)=g_2(b)$ . We have $1-a-b>a+\frac14 b^2$ if and only if $\frac14 b^2+b+(2a-1)<0$ . The discriminant of this quadratic is $2(1-a)$ , which means that if $a>1$ we have $g_2(b)=a+\frac14 b^2$ everywhere. If $a<1$ define $b_-$ and $b_+$ as the two roots $-2\pm 2\sqrt{2(1-a)}$ of the quadratic. Now $g_2(b)= \begin{cases} 1-a-b&\text{ if }b_-\leq b\leq b_+,\\ a+\frac14 b^2&\text{ otherwise}. \end{cases}$ Clearly $b_-<0$ . If $a>\frac12$ then also $b_+<0$ and thus $g_2(b)=a+\frac14 b^2$ on $[0,1]$ . If $0<b_+<1$ then $g_2$ has a minimum at $b_+$ . Thus if $-\frac18<a<\frac12$ we have $\mathop{\mathbf{argmin}}\limits_{b} f(a,b)=2+2\sqrt{2(1-a)}.$

Next $g_3(b)\mathop{=}\limits^{\Delta}\max\{-a,a+\frac14 b^2\}$ , which is equal to $f(a,b)$ for $1\le b\leq 2$ . If $a>0$ then $g_3(b)=a+\frac14 b^2$ everywhere. If $a<0$ define $b_-$ and $b_+$ as $\pm\sqrt{-8a}$ . Then $g_3(b)= \begin{cases} -a&\text{ if }b_-\leq b\leq b_+,\\ a+\frac14 b^2&\text{ otherwise}. \end{cases}$ If $1<b_+<2$ then we have a minimum of $g_3$ at $b_+$ . Thus if $-\frac12<a<-\frac18$ we find $\mathop{\mathbf{argmin}}\limits_{b} f(a,b)=\sqrt{-8a}.$

And finally we get back to $g_1$ again at the right hand side of the real line. We have a minimum if $b_+>2$ , i.e. $a<-\frac12$ . In that case $\mathbf{Arg}\mathop{\mathbf{min}}\limits_{b} f(a,b)=[2,(1-a)+|a|]$

So, in summary, $\mathbf{Arg}\mathop{\mathbf{min}}\limits_{b} f(a,b)=\begin{cases} [(1-a)-|a|,0]&\text{ if }&a>\frac12,\\ \{2+2\sqrt{2(1-a)}\}&\text{ if }&-\frac18<a<\frac12,\\ \{\sqrt{-8a}\}&\text{ if }&-\frac12<a<-\frac18,\\ [2,(1-a)+|a|]&\text{ if }&a<-\frac12. \end{cases}$ We now have enough information to write a simple coordinate descent algorithm. Of course such an algorithm would have to include a rule to select from the set of minimizers if the minimers are not unique. In our R implementation in ccd.R we allow for different rules. If the minimizers are an interval, we always choose the smallest point, or always to largest point, or always the midpoint, or a uniform draw from the interval. We shall see in our example that these different options have a large influence on the approximation the algorithm converges too, in fact even on what the algorithm considers to be desirable points.

[Insert ccd.R Here](../code/ccd.R)

We give the function that transforms $b^{(k)}$ into $b^{(k+1)}$ with the four different selection rules in Figure 1.

[Insert upMe.R Here](../code/upMe.R)

The function is in red, the line $b^{(k+1)}=b^{(k)}$ in blue. Thus over most of the region of interest the algorithm does not change the slope, which means it converges in a single iteration to an incorrect solution. It needs more iterations only for the midpoint and random selection rules if started outside $[0,2].$

plot of chunk uprule

Figure 1: The UP, LOW, MID and RANDOM rules