Gauss-Newton Majorization

The least squares loss function was defined here. It has the form $f(x)=\frac12\sum_{j=1}^m\sum_{\ell=1}^m w_{j\ell}g_j(x)g_\ell(x),$ where $W$ is an $m\times m$ positive semi-definite matrix of weights, and we minimize $f$ over $x\in\mathcal{X}$ .

Now $\mathcal{D}f(x)=\sum_{j=1}^m\sum_{\ell=1}^m w_{j\ell}g_j(x)\mathcal{D}g_\ell(x),$ and $\mathcal{D}^2f(x)=\sum_{j=1}^m\sum_{\ell=1}^m w_{j\ell}\left\{g_j(x)\mathcal{D}^2g_\ell(x)+\mathcal{D}g_j(x)(\mathcal{D}g_\ell(x))'\right\}.$ The structure of the Hessian suggest to define $\begin{align*} A(x)&\mathop{=}\limits^{\Delta}\sum_{j=1}^m\sum_{\ell=1}^m w_{j\ell}g_j(x)\mathcal{D}^2g_\ell(x),\\ B(x)&\mathop{=}\limits^{\Delta}\sum_{j=1}^m\sum_{\ell=1}^m w_{j\ell}\mathcal{D}g_j(x)(\mathcal{D}g_\ell(x))'. \end{align*}$ We have $\mathcal{D}^2f(x)=A(x)+B(x)$ . Note that $B(x)$ is positive semi-definite for all $x$ .

In the classical Gauss-Newton method we make the approximation $f(x)\approx f(y)+(x-y)'\mathcal{D}f(y)+\frac12 (x-y)'B(y)(x-y),$ and the corresponding iterative algorithm is $x^{(k+1)}=x^{(k)}-B^{-1}(x^{(k)})\mathcal{D}f(x^{(k)}).$ If the algorithm converges to $x$ , and the $g_j(x)$ are small, then the least squares loss function will be small, and $A(x)$ will be small as well. Since the iteration matrix is $\mathcal{M}(x)=I-B^{-1}(x)\mathcal{D}^2f(x)=-B^{-1}(x)A(x),$ we can expect rapid convergence. But convergence is not guaranteed, and consequently we need safeguards. Majorization provides one such safeguard.

If we can find $\gamma(y)$ such that $\sup_{0\leq\lambda\leq 1}z'A(y+\lambda z)z\leq\gamma(y)z'z$

Note: Use Nesterov's Gauss-Newton paper 03/13/15