I.3.1: Introduction
The history of block relaxation methods is complicated, because many special cases were proposed before the general idea became clear. I make no claim here to be even remotely complete, but I will try to mention at least most of the general papers that were important along the way.
It makes sense to distinguish the coordinate descent methods from the more general block methods. Coordinate descent methods have the major advantage that they lead to one-dimensional optimization problems, which are generally much easier to handle than multidimensional ones.
We start our history with iterative methods for linear systems. Even there the history is complicated, but it has been ably reviewed by, among others, Forsythe [1953], Young [1990], Saad and Van der Vorst [2000], Benzi [2009], and Axelsson [2010]. The origins are in 19th century German mathematics, starting perhaps with a letter from Gauss to his student Gerling on December 26, 1823. See Forsythe [1950] for a translation. To quote Gauss: "I recommend this method to you for imitation. You will hardly ever again eliminate directly, at least not when you have more than 2 unknowns. The indirect procedure can be done while half asleep, or while thinking about other things." For discussion of subsequent contributions by Jacobi [1845], Seidel [1874], Von Mises and Pollaczek-Geiringer [1929], we refer to the excellent historical overviews, and to the monumental textbooks by Varga [1962] and Young [1971].
The next step in our history is the quadratic programming method proposed by Hildreth [1957]. Coordinate descent is applied to the dual program, which is a simple quadratic problem with non-negativity constraints, originating from the Lagrange multipliers for the primal problem. Because the constraints are separable in the dual problem the technique can easily handle a large numbers of inequality constraints and can easily be parallelized. Hildreth already considered the non-cyclic greedy and random versions of coordinate descent. A nice historical overview of Hildreth's method and its various extensions is in Dax [2003].
Coordinate relaxation for convex functions, not necessarily quadratic, was introduced by D'Esopo [1959] in an important paper, followed by influential papers of Schechter [1962],[1968],[1970]. The D'Esopo paper actually has an early version of Zangwill's general convergence theory, applied to functions that are convex in each variable separably are are minimized under separable bound constraints.
Ortega and Rheinboldt \cite{orrh67}, \cite{orrh67b}, Elkin [1968] , C\'ea [1968],[1970], C\'ea and Glowinski[1973], and Auslender \cite{ausnum},\cite{ausmart}. Many of these papers present the method as a nonlinear generalization of the Gauss-Seidel method of solving a system of linear equations.
Modern papers on block-relaxation are by Abatzoglou and O'Donnell [1982] and by Bezdek et al. \cite{bezdekea}.
So many more now
In Statistics .. Statistical applications to mixed linear models, with the parameters describing the mean structure collected in one block and the parameters describing the dispersion collected in the second block, are in Oberhofer and Kmenta \cite{oberkment}. Applications to exponential family likelihood functions, cycling over the canonical parameters, are in Jensen et al. \cite{jejola}.