ALSOS | Block Relaxation Algorithms in Statistics -- Part I

I.5.2.1: ALSOS

ALSOS algorithms are ALS algorithms in which one or more of the blocks defines transformations of variables. $f(x,z)=\sum_{j=1}^m\sum_{\ell=1}^mw_{j\ell}g_j(x_1,\cdots,x_p,z)g_\ell(x_1,\cdots,x_p,z).$

Suppose we have $n$ observations on two sets of variables $x_i$ and $y_i.$ We want to fit a model of the form $F_\theta(\Phi(x_i))\approx G_\xi(\Psi(y_i))$ where the unknowns are the structural parameters $\theta$ and $\xi$ and the transformations $\Phi$ and $\Psi.$ In ALS we measure loss-of-fit by $\sigma(\theta,\xi,\Phi,\Psi)= \sum_{i=1}^n[F_\theta(\Phi(x_i))-G_\xi(\Psi(y_i))]^2$ This loss function is minimized by starting with initial estimates for the transformations, minimizing over the structural parameters, keeping the transformations fixed at their current values, and then minimizing over the transformations, with structural values kept fixed at their new values. These two minimizations are alternated, which produces a nonincreasing sequence of loss function values, bounded below by zero, and thus convergent. This is a version of the trivial convergence theorem.

The first ALS example is due to Kruskal \cite{krus}. We have a factorial ANOVA, with, say, two factors, and we minimize $\sigma(\phi,\mu,\alpha,\beta)= \sum_{i=1}^n\sum_{j=1}^m[\phi(y_{ij})-(\mu+\alpha_i+\beta_j)]^2.$ Kruskal required $\phi$ to be monotonic. Minimizing loss for fixed $\phi$ is just doing an analysis of variance, minimizing loss over $\phi$ for fixed $\mu,\alpha,\beta$ is doing a monotone regression. Obviously also some normalization requirement is needed to exclude trivial zero solutions.

This general idea was extended by De Leeuw, Young, and Takane around 1975 to $\sigma(\phi;\psi_1,\cdots,\psi_m)= \sum_{i=1}^n[\phi(y_i)-\sum_{s=1}^p\psi_j(x_{ij})]^2.$ This ALSOS work, in the period 1975-1980, is summarized in \cite{forrest}. Subsequent work, culminating in the book by Gifi \cite{gifi}, generalized this to ALSOS versions of principal component analysis, path analysis, canonical analysis, discriminant analysis, MANOVA, and so on. The classes of transformations over which loss was minimized were usually step-functions, splines, mo-no-to-ne functions, or low-degree polynomials. To illustrate the use of more sets in ALS, consider $\sigma(\psi_1,\cdots,\psi_m;\alpha,\beta)= \sum_{i=1}^n\sum_{j=1}^m (\psi_j(x_{ij})- \sum_{s=1}^p\alpha_{is}\beta_{js})^2.$ This is principal component analysis (or partial singular value decomposition) with optimal scaling. We can now cycle over three sets, the transformations, the component scores $\alpha_{is}$ and the component loadings $\beta_{js}.$ In the case of monotone transformations this alternates monotone regression with two linear least squares problems.