Numerical Optimization

IN A Few Useful Things to Know about Machine Learning, Pedro Domingos put up a relation: $\color{aqua}{LEARNING}$ = $\color{green}{REPRESENTATION}$ + $\color{blue}{EVALUATION}$ + $\color{red}{OPTIMIZATION}$.

Representation as the core of the note is the general (mathematical) model that computer can handle.
Evaluation is criteria. An evaluation function (also called objective function, cost function or scoring function) is needed to distinguish good classifiers from bad ones.
Optimization is aimed to find the parameters that optimizes the evaluation function, i.e. $$ \arg\min_{\theta\in \Theta} f(\theta)={\theta^{\ast}|f(\theta^{\ast})=\min f(\theta)},\text{or} \ \quad\arg\max_{\theta\in \Theta}f(\theta)={\theta^{\ast}|f(\theta^{\ast})=\max f(\theta)}. $$

The objective function to be minimized is also called cost function.

Evaluation is always attached with optimization; the evaluation which cannot be optimized is not a good evaluation in machine learning.

https://www.wikiwand.com/en/Mathematical_optimization
https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf
http://www.cs.cmu.edu/~pradeepr/convexopt/
An interactive tutorial to numerical optimization
Patrick Louis' RECENT CONFERENCE TALKS on optimization
Northwestern University Open Text Book on Process Optimization
Introductory course on non-smooth optimisation
CS4540: Simple Algorithms
CS 798: Convexity and Optimization
Survival Guide for Students of Optimization, Dianne P. O'Leary, September 2017
https://nlopt.readthedocs.io/en/latest/

The proof of convergence or complexity is often based on the convex cases where the objective function as well as the constrained set is convex, i.e., $$t x+(1-t)y\in\Theta,\ f(t x+(1-t)y)\leq t f(x)+(1-t)f(y),\ \quad t\in [0,1], \quad \forall x, y\in\Theta.$$

And this optimization is called convex optimization. By the way, the name numerical optimization means the variables or parameters to estimate are in numeric format, which is far from performance optimization in concept.

First optimal condition is a necessary condition for the unconstrainted optimziation problems if the objective function is differentiable: if $\nabla f$ exists, and $x^{\ast}\in\arg\min_{x} f(x)$, we have $\nabla f(x^{\ast})=0$.

Wotao Yin wrote a summary on First-order methods and operator splitting for optimization:

First-order methods are described and analyzed with gradients or subgradients, while second-order methods use second-order derivatives or their approximations.

During the 70s–90s the majority of the optimization community focused on second-order methods since they are more efficient for those problems that have the sizes of interest at that time. Beginning around fifteen years ago, however, the demand to solve ever larger problems started growing very quickly. Many large problems are further complicated by non-differentiable functions and constraints. Because simple first-order and classic second-order methods are ineffective or infeasible for these problems, operator splitting methods regained their popularity.

Operators are used to develop algorithms and analyze them for a wide spectrum of problems including optimization problems, variational inequalities, and differential equations. Operator splitting is a set of ideas that generate algorithms through decomposing a problem that is too difficult as a whole into two or more smaller and simpler subproblems. During the decomposition, complicated structures like non-differentiable functions, constraint sets, and distributed problem structures end up in different subproblems and thus can be handled elegantly. We believe ideas from operator splitting provide the most eﬀective means to handle such complicated structures for computational problem sizes of modern interest.

The world of optimization
Gradient flow and gradient descent
Accelerated gradient descent
Accelerated gradient flow
Stochastic gradient flow
Aristotle vs. Newton
Advanced non-smooth optimization

Gradient Descent and More

Each iteration of a line search method computes a search direction $p^{k}$ and then decides how far to move along that direction. The iteration is given by

$$ x^{k+1}=x^{k}+\alpha_{k}p^{k}\tag{Line search} $$

where the positive scalar $\alpha_{k}$ is called the step length. The success of a line search method depends on effective choices of both the direction $p^{k}$ and the step length $\alpha_{k}$.

$\color{lime}{Note}$: we use the notation $x^{k}$ and $\alpha_k$ to represent the $k$th iteration of the vector variables $x$ and $k$ th step length, respectively. Most line search algorithms require $p^k$ to be a descent direction — one for which $\left< {p^k},\nabla f_k \right> < 0$ — because this property guarantees that the function $f$ can be reduced along this direction, where $\nabla f_k$ is the gradient of objective function $f$ at the $k$th iteration point $x_k$ i.e. $\nabla f_k=\nabla f(x^{k})$.

Gradient descent and its variants are to find the local solution of the unconstrained optimization problem:

$$ \min_{x} f(x) $$

where $x\in \mathbb{R}^{n}$.

It is not difficult to observe that $$f(x) \approx f(x^k)+(x - x^k)^{T}\nabla f(x^{k}) + \frac{s_k}{2}{|x-x^k|}_2^2$$ by Taylor expansion of $f$ near the point $x^{k}$ for some constant $s_k$.

Let $x^{k+1}=\arg\min_{x} f(x^k) + (x - x^k)^{T}\nabla f(x^{k}) + \frac{s_k}{2}{|x-x^k|}2^2$, we will obtain $x^{k+1} = x^{k}-\frac{1}{s_k}{\nabla}{x} f(x^k)$. However, the constant $s_k$ is difficult to estimate.

And the general gradient descent methods are given by

$$ x^{k+1}=x^{k}-\alpha_{k}{\nabla}_{x} f(x^k) $$

where $x^{k}$ is the $k$th iterative result, $\alpha_{k}\in{\alpha|f(x^{k+1})< f(x^{k})}$ and particularly $\alpha_{k}=\arg\min_{\alpha}f(x^{k}-\alpha\nabla_{x}f(x^{k}))$ so that $f(x^{k+1})=\min_{\alpha} f(x^k - \alpha\nabla_x f(x^k))$.

$$ x^{k+1}=\fbox{$x^{k}$}-\alpha_{k}{\nabla}{x} f(x^k) = \fbox{$x^{k-1}-\alpha{k-1}\nabla f(x_{k-1})$}-\alpha_{k}{\nabla}{x} f(x^k)\ = x^1-\sum{n=0}^{k}\alpha_n{\nabla}_{x} f(x^n) $$

http://59.80.44.100/www.seas.ucla.edu/~vandenbe/236C/lectures/gradient.pdf
http://wiki.fast.ai/index.php/Gradient_Descent

There are many ways to choose some proper step or learning rate sequence ${\alpha_k}$.

The first-order Taylor approximation of $f(x + v)$ around ${x}$ is $$f(x+v)\approx \hat{f}(x+v)=f(x)+\nabla_x f(x)^T v.$$

The second term on the righthand side, $\nabla f(x)^T v$, is the directional derivative of ${f}$ at ${x}$ in the direction ${v}$. It gives the approximate change in ${f}$ for a small step ${v}$. The step ${v}$ is a descent direction if the directional derivative is negative. Since the directional derivative $\nabla_x f(x)^T v$ is linear in ${v}$, it can be made as negative as we like by taking ${v}$ large (provided ${v}$ is a descent direction, i.e., $\nabla_x f(x)^T v< 0$). To make the question sensible we have to limit the size of ${v}$, or normalize by the length of ${v}$.

We define a normalized steepest descent direction (with respect to the norm $|\cdot |$ in $\mathbb{R}^n$) as

$$\Delta x_{nsd}=\arg\min_{v}{f(x)^T v\mid |v|=1}.$$

It is also convenient to consider a steepest descent step $\Delta x_{sd}$ that is unnormalized, by scaling the normalized steepest descent direction in a particular way:

$$\Delta x_{sd}={|\nabla f(x)|}{\ast}\Delta x{nsd}$$

where ${| \cdot |}_{\ast}$ denotes the dual norm.

(We say ‘a’ steepest descent direction because there can be multiple minimizers.)

Algorithm Steepest descent method

given a starting point $x \in domf$.
repeat
1. Compute steepest descent direction $\Delta x_{sd}$.
2. Line search. Choose ${t}$ via backtracking or exact line search. 3. Update. $x := x + t \Delta x_{sd}$.
until stopping criterion is satisfied.

If the variable ${x}$ is restricted in some bounded domain, i.e., $x\in D$, the steepest gradient descemt methods can be modified to conditional gradient descent method or Frank-Wolfe algorithm.

Algorithm Frank–Wolfe algorithm

given a starting point $x \in domf$.

repeat

Find $s^k$: $s^k =\arg\min_{v}{f(x^{k})^T v\mid v\in D }$.

Set $t=\frac{2}{k+2}$ or $t=\arg\min{f(x^k+t(s^k - x^k))\mid t\in [0, 1]}$

Update. $x^{k+1}= x^k + t(s^k-x^k), k\leftarrow k+1$.

until stopping criterion is satisfied.

梯度下降法和最速下降法的细微差别
An Introduction to Conditional Gradient
Frank–Wolfe algorithm
Greedy Algorithms, Frank-Wolfe and Friends - A modern perspective
https://sites.google.com/site/nips13greedyfrankwolfe/
Revisiting Frank-Wolfe
Nonsmooth Frank-Wolfe algorithm and Coresets

Some variants of gradient descent methods are not line search method. For example, the heavy ball methods or momentum methods:

$$ x^{k+1}=x^{k}-\alpha_{k}\nabla_{x}f(x^k)+\rho_{k}(x^k-x^{k-1}) $$

where the momentum coefficient $\rho_k\in[0,1]$ generally and the step length $\alpha_k$ cannot be determined by line search. We can unfold the recursion defining momentum gradient descent and write: $$ x^{k+1}=x^{k}-\alpha_{k}\nabla_{x}f(x^k)+\rho_{k}\underbrace{(x^k-x^{k-1})}{\text{denoted as $\Delta{k}$ }}\ \Delta_{k+1}= -\alpha_k\nabla_{x}f(x^k)+ \rho_{k}\Delta_{k}\ \ x^{k+1}=x^{k}-\alpha_{k}\nabla_{x}f(x^k)+\sum_{i=1}^{k-1}{ - \prod_{j=0}^{i}\rho_{k-j} }\alpha_{k-j-1}\nabla f(\underbrace{\color{red}{x}^{k-i}}_{\triangle})+\rho_1\Delta_1. $$

Nesterov accelerated gradient method at the $k$th step is given by:

$$ x^{k}=y^{k}-\alpha_{k+1}\nabla_{x}f(y^k) \qquad \text{Descent} \\ y^{k+1}=x^{k}+\rho_{k}(x^{k}-x^{k-1}) \qquad \text{Momentum} $$

where the momentum coefficient $\rho_k\in[0,1]$ generally.

$$ x^{k}=y^{k}-\alpha_{k+1}\nabla_{x}f(y^k) \\ y^{k}=x^{k-1}+\rho_{k-1}(x^{k-1}-x^{k-2})\\ x^{k+1}=x^{k}-\alpha_{k}\nabla_{x}f(y^k)+\sum_{i=1}^{k-1}{ -\prod_{j=0}^{i}\rho_{k-j} }\alpha_{k-j-1}\nabla f(\underbrace{\color{green}{y}^{k-i}}_{\triangle})+\rho_1\Delta_1. $$

They are called as inertial gradient methods or accelerated gradient methods. And there are some different forms.

Inventor of Nesterov accelerated Gradient

ORF523: Nesterov’s Accelerated Gradient Descent
Nesterov’s Accelerated Gradient Descent for Smooth and Strongly Convex Optimization
Revisiting Nesterov’s Acceleration
A short proof for Nesterov’s momentum
Nemirovski’s acceleration

https://zhuanlan.zhihu.com/p/41263068
https://zhuanlan.zhihu.com/p/35692553
https://zhuanlan.zhihu.com/p/35323828
On Gradient-Based Optimization: Accelerated, Asynchronous, Distributed and Stochastic
Nesterov Accelerated Gradient and Momentum
Why Momentum Really Works
http://www.optimization-online.org/DB_FILE/2018/11/6938.pdf
https://www.mat.univie.ac.at/~neum/glopt/mss/MasAi02.pdf
https://www.cs.cmu.edu/~ggordon/10725-F12/slides/09-acceleration.pdf
WHAT IS GRADIENT DESCENT IN MACHINE LEARNING?
https://www.fromthegenesis.com/gradient-descent-part1/
https://www.fromthegenesis.com/gradient-descent-part-2/
Deep Learning From Scratch IV: Gradient Descent and Backpropagation
https://ee227c.github.io/notes/ee227c-lecture08.pdf
Monotonicity, Acceleration, Inertia, and the Proximal Gradient algorithm

Conjugate Gradient Methods

It is from the methods to solve the Symmetric positive definite linear systems $Ax=b, A=A^T\succ 0 \in\mathbb{R}^{n\times n}$. $x^{\ast}=A^{-1}b$ is the solution and $x^{\ast}$ minimizes (convex function) $f(x)=\frac{1}{2}x^T Ax - b^T x$ thus $\nabla f(x) = Ax − b$ is gradient of $f$. $r = b − Ax=-\nabla f(x) = -A(x-x^{\ast})$ is called the residual at $x$.

The solution is to make the search directions-orthogonal instead of orthogonal. Two vectors $u$ and $v$ are $A$ -orthogonal, or conjugate, i.e., $$u^TAv=0.$$

Krylov subspace $\cal K_k=\operatorname{span}{b, Ab, \cdots, A^{k-1}b}$ and we define the Krylov sequence $$ x^{k} =\arg\min_{x\in\cal K_k} f(x)=\arg\min_{x\in\cal K}{|x-x^{\ast}|}_A^2$$

the CG algorithm (among others) generates the Krylov sequence.

$f\left(x^{(k+1)}\right) \leq f\left(x^{(k)}\right)$ but $|r|$ can increase.
$x^{(n)}=x^{\star}\left(i.e., x^{\star} \in \mathcal{K}{n}\right)$ even when $\mathcal{K}{n} \neq \mathbf{R}^{n}$
$x^{(k)}=p_{k}(A) b,$ where $p_{k}$ is a polynomial with $\operatorname{deg} p_{k}<k$.

There is a two-term recurrence $$\fbox{$x^{k+1} = x^k + \alpha_k (x^k-x^{\ast}) + \beta_k (x^{k}-x^{k-1})$}$$

$$ {x :=0, r :=b, \rho_{0} :={|r|}^2 } \ {\text { for } k=1, \ldots, N_{\text {max }}} \ {\text { quit if } \sqrt{\rho_{k}} \leq \epsilon {|b|}{2} \text {or} |r| \leq \epsilon{|b|}{2}} \ {w := Ap} \ {\alpha :=\frac{\rho_{k}}{w^{T} p}} \ {x := r+\alpha p} \ {r := r-\alpha w} \ {\rho_k := {|r|}^2} $$

Preconditioned conjugate gradient

with preconditioner $M \approx A^{-1}$ (hopefully) $$ \begin{array}{l} {x :=0, r :=b-A x_{0}, p :=r z :=M r, \rho_{1} :=r^{T} z} \ {\text { for } k=1, \ldots, N_{\text {max }}} \ {\begin{array}{l}{\text { quit if } \sqrt{\rho_{k}} \leq \epsilon {|b|}{2} \text {or} |r| \leq \epsilon{|b|}{2}} \ {w :=\frac{\rho_{k}}{w^{T} p}} \ {\alpha :=\frac{\rho_{k}}{w^{T} p}} \ {x :=r+\alpha p} \ {r :=r-\alpha w} \ {z :=M r} \ \rho_{k+1} :=z^{T} r\ {p := z+\frac{\rho_{k+1}}{\rho_{k}} p}\end{array}} \end{array} $$

The gradient descent methods transforms the multiply optimization to univariate optimization.

Here is an outline of the nonlinear CG method:

$d^{0}=r^{0}=-f^{\prime}(x^0)$
Find $\alpha^{i}$ that minimizes $f(x^{i}+\alpha^{i} d^{i})$,
$x^{i+1}=x^{i} + \alpha_{i} d^{i}$
$r^{i+1}=-f^{\prime}(x^{i+1})$
$\beta_{i+1}=\frac{\left<r^{i+1}, r^{i+1}\right>}{\left<r^{i}, r^{i}\right>}$ or $\beta_{i+1}=\max{\frac{\left<r^{i+1}, r^{i+1}-r^{i}\right>}{\left<r^{i}, r^{i}\right>}, 0}$
$d^{i+1}=r^{i+1}+\beta_{i+1}d^{i}$

https://wiki.seg.org/wiki/The_conjugate_gradient_method
https://stanford.edu/class/ee364b/lectures/conj_grad_slides.pdf
An Introduction to the Conjugate Gradient Method Without the Agonizing Pain
http://www.cs.umd.edu/users/oleary/cggg/historyindex.html
A BRIEF INTRODUCTION TO THE CONJUGATE GRADIENT METHOD

Variable Metric Methods

Newton's Methods

NEWTON’S METHOD and QUASI-NEWTON METHODS are classified to variable metric methods.

It is also to find the solution of unconstrained optimization problems, i.e. $$\min f(x)$$ where $x\in \mathbb{R}^{n}$. Specially, the objective function $f(x)$ is convex so that the local minima is global minima.

If ${x^{\star}}$ is the extrema of the cost function $f(x)$, it is necessary that $\nabla f(x^{\star}) = 0$. So if we can find all the solution of the equation system $\nabla f(x) = 0$, it helps us to find the solution to the optimization problem $\arg\min_{x\in\mathbb{R}^n} f(x)$.

Newton's method is one of the fixed-point methods to solve nonlinear equation system.

It is given by $$ x^{k+1}=\arg\min_{x}f(x^k)+(x-x^k)^T \nabla_x f(x^k)+\frac{1}{2\alpha_{k+1}}(x-x^k)^T H(x^k)(x-x^k) \=x^{k}-\alpha_{k+1}H^{-1}(x^{k})\nabla_{x},{f(x^{k})} $$ where $H^{-1}(x^{k})$ is inverse of the Hessian matrix of the function $f(x)$ at the point $x^{k}$. It is called Newton–Raphson algorithm in statistics. Especially when the log-likelihood function $\ell(\theta)$ is well-behaved, a natural candidate for finding the MLE is the Newton–Raphson algorithm with quadratic convergence rate.

$$ x^{k+1}=x^{k}-\alpha_{k+1}H^{-1}(x^{k})\nabla_{x},{f(x^{k})}\\ = x^{k-1}-\alpha_{k}H^{-1}(x^{k-1})\nabla_{x},{f(x^{k-1})}-\alpha_{k+1}H^{-1}(x^{k})\nabla_{x},{f(x^{k})} \\ =x^{1}-\sum_{n=1}^{k} \alpha_{n+1}H^{-1}(x^{n})\nabla_{x},{f(x^{n})} $$

https://brilliant.org/wiki/newton-raphson-method/
https://www.shodor.org/unchem/math/newton/
http://fourier.eng.hmc.edu/e176/lectures/NM/node21.html
https://www.cup.uni-muenchen.de/ch/compchem/geom/nr.html
http://web.stanford.edu/class/cme304/docs/newton-type-methods.pdfs

The Fisher Scoring Algorithm

In maximum likelihood estimation, the objective function is the log-likelihood function, i.e. $$ \ell(\theta)=\sum_{i=1}^{n}\log{P(x_i|\theta)} $$ where $P(x_i|\theta)$ is the probability of realization $X_i=x_i$ with the unknown parameter $\theta$. However, when the sample random variable ${X_i}_{i=1}^{n}$ are not observed or realized, it is best to replace negative Hessian matrix (i.e. -$\frac{\partial^2\ell(\theta)}{\partial\theta\partial\theta^{T}}$) of the likelihood function with the observed information matrix: $$ J(\theta)=\mathbb{E}({\color{red}{-1}}\frac{\partial^2\ell(\theta)}{\partial\theta\partial\theta^{T}}) =\text{$\color{red}{-}$}\int\frac{\partial^2\ell(\theta)}{\partial\theta\partial\theta^{T}}f(x_1, \cdots, x_n|\theta)\mathrm{d}x_1\cdots\mathrm{d}x_n $$ where $f(x_1, \cdots, x_n|\theta)$ is the joint probability density function of $X_1, \cdots, X_n$ with unknown parameter $\theta$.

And the Fisher scoring algorithm is given by $$ \theta^{k+1}=\theta^{k}+\alpha_{k}J^{-1}(\theta^{k})\nabla_{\theta} \ell(\theta^{k}) $$ where $J^{-1}(\theta^{k})$ is the inverse of observed information matrix at the point $\theta^{k}$.

See http://www.stats.ox.ac.uk/~steffen/teaching/bs2HT9/scoring.pdf or https://wiseodd.github.io/techblog/2018/03/11/fisher-information/.

Fisher scoring algorithm is regarded as an example of Natural Gradient Descent in information geometry as shown in https://wiseodd.github.io/techblog/2018/03/14/natural-gradient/ and https://www.zhihu.com/question/266846405.

Quasi-Newton Methods

Quasi-Newton methods, like steepest descent, require only the gradient of the objective function to be supplied at each iterate. By measuring the changes in gradients, they construct a model of the objective function that is good enough to produce super-linear convergence. The improvement over steepest descent is dramatic, especially on difficult problems. Moreover, since second derivatives are not required, quasi-Newton methods are sometimes more efficient than Newton’s method.[^11]

In optimization, quasi-Newton methods (a special case of variable-metric methods) are algorithms for finding local maxima and minima of functions. Quasi-Newton methods are based on Newton's method to find the stationary point of a function, where the gradient is 0. In quasi-Newton methods the Hessian matrix does not need to be computed. The Hessian is updated by analyzing successive gradient vectors instead. Quasi-Newton methods are a generalization of the secant method to find the root of the first derivative for multidimensional problems. In multiple dimensions the secant equation is under-determined, and quasi-Newton methods differ in how they constrain the solution, typically by adding a simple low-rank update to the current estimate of the Hessian. One of the chief advantages of quasi-Newton methods over Newton's method is that the Hessian matrix (or, in the case of quasi-Newton methods, its approximation) $\mathbf{B}$ does not need to be inverted. The Hessian approximation $\mathbf{B}$ is chosen to satisfy $$ \nabla f(x^{k+1})=\nabla f(x^{k})+B(x^{k+1}-x^{k}), $$ which is called the $\fbox{secant equation}$ (the Taylor series of the gradient itself). In more than one dimension B is underdetermined. In one dimension, solving for B and applying the Newton's step with the updated value is equivalent to the secant method. The various quasi-Newton methods differ in their choice of the solution to the secant equation (in one dimension, all the variants are equivalent).

The unknown $x_{k}$ is updated applying the Newton's step calculated using the current approximate Hessian matrix $B_{k}$:

${\displaystyle \Delta x_{k}=-\alpha_{k} B_{k}^{-1}\nabla f(x_{k})}$, with $\alpha$ chosen to satisfy the Wolfe conditions;
$x_{k+1}=x_{k}+\Delta x_{k}$;
The gradient computed at the new point $\nabla f(x_{k+1})$, and $y_{k}=\nabla f(x_{k+1})-\nabla f(x_{k})$ is used to update the approximate Hessian $B_{k+1}$, or directly its inverse $H_{k+1} = B_{k+1}^{-1}$ using the Sherman–Morrison formula.

A key property of the BFGS and DFP updates is that if $B_{k}$ is positive-definite, and ${\alpha}{k}$ is chosen to satisfy the Wolfe conditions, then $B{k+1}$ is also positive-definite.

For example,

Method	$\displaystyle B_{k+1}=$	$H_{k+1}=B_{k+1}^{-1}=$
DFP	$(I-\frac{y_k \Delta x_k^{\mathrm{T}}}{y_k^{\mathrm{T}} \Delta x_k}) B_k (I-\frac{ \Delta x_k y_k^{\mathrm{T}}}{y_k^{\mathrm{T}} \Delta x_k}) + \frac{y_k y_k^{\mathrm{T}}}{y_k^{\mathrm{T}}} \Delta x_k$	$H_k +\frac{\Delta x_k \Delta x_k^T}{\Delta x_k^T y_k} -\frac{H_k y_ky_k^T H_k}{y_K^T H_K y_k}$
BFGS	$B_k + \frac{y_k y_k^{\mathrm{T}}}{y_k^{\mathrm{T}}\Delta x_k} - \frac{B_k\Delta x_k(B_k\Delta x_k)^T}{\Delta x_k B_k \Delta x_k}$	$(I-\frac{ \Delta x_k^{\mathrm{T}} y_k}{ y_k^{\mathrm{T}} \Delta x_k}) H_k (I-\frac{y_k \Delta x_k^{\mathrm{T}}}{ y_k^{\mathrm{T}} \Delta x_k}) + \frac{\Delta x_k \Delta x_k^T}{y_k^T \Delta x_k}$
SR1	$B_{k} + \frac{(y_{k} - B_{k},\Delta x_{k} )(y_{k} - B_{k},\Delta x_{k})^{\mathrm{T}} }{(y_{k} - B_{k},\Delta x_{k})^{\mathrm{T} },\Delta x_{k}}$	$H_{k} + \frac{(\Delta x_{k}-H_{k}y_{k}) (\Delta x_{k} -H_{k} y_{k})^{\mathrm{T}} }{(\Delta x_{k}-H_{k}y_{k})^{\mathrm {T} }y_{k}}$

Quasi-Newton methods in Wikipedia page
http://59.80.44.98/www.seas.ucla.edu/~vandenbe/236C/lectures/qnewton.pdf
http://fa.bianp.net/teaching/2018/eecs227at/quasi_newton.html

The Barzilai-Borwein method

Consider the gradient iteration form $$ x^{k+1}=x^{k}-\alpha_k \nabla f(x^k) $$

which can be written as $$ x^{k+1}=x^{k}- D_k \nabla f(x^k) $$ where $D_k = \alpha_k I$. In order to make the matrix $D_k$ have quasi-Newton property, we compute $\alpha_k$ such that $$ \min |s_{k-1}-D_k y_{k-1}| $$ which yields

$$ \alpha_k =\frac{\left<s_{k-1},y_{k-1}\right>}{\left<y_{k-1},y_{k-1}\right>}\tag 1 $$

where $s_{k-1}= x_k-x_{k-1}, y_{k-1}=\nabla f(x^k)-\nabla f(x^{k-1})$.

By symmetry, we may minimize $|D_k^{-1}s_{k-1}- y_{k-1}|$ with respect to $\alpha_k$ and get

$$ \alpha_k = \frac{\left<s_{k-1}, s_{k-1}\right>}{\left<s_{k-1}, y_{k-1}\right>}.\tag 2 $$

In short, the iteration formula of Barzilai-Borwein method is given by

$$ x^{k+1}=x^{k}-\alpha_k \nabla f(x^k) $$ where $\alpha_k$ is determined by (1) or (2).

It is easy to see that in this method no matrix computations and no line searches (except $k = 0$) are required.

https://mp.weixin.qq.com/s/G9HH29b2-VBnk_Sqze8pDg
http://www.math.ucla.edu/~wotaoyin/math273a/slides/
http://bicmr.pku.edu.cn/~wenzw/courses/WenyuSun_YaxiangYuan_BB.pdf
https://www.math.lsu.edu/~hozhang/papers/cbb.pdf

L-BFGS

The BFGS quasi-newton approximation has the benefit of not requiring us to be able to analytically compute the Hessian of a function. However, we still must maintain a history of the $s_n$ and $y_n$ vectors for each iteration.

https://www.wikiwand.com/en/Limited-memory_BFGS
On the limited memory BFGS method for large scale optimization
Numerical Optimization: Understanding L-BFGS
Unconstrained optimization: L-BFGS and CG

L-BFGS algorithm builds and refines quadratic model of a function being optimized. Algorithm stores last M value/gradient pairs and uses them to build positive definite Hessian approximation. This approximate Hessian matrix is used to make quasi-Newton step. If quasi-Newton step does not lead to sufficient decrease of the value/gradient, we make line search along direction of this step.

Essential feature of the algorithm is positive definiteness of the approximate Hessian. Independently of function curvature (positive or negative) we will always get SPD matrix and quasi-Newton direction will always be descent direction.

Another essential property is that only last $M$ function/gradient pairs are used, where M is moderate number smaller than problem size N, often as small as 3-10. It gives us very cheap iterations, which cost just O(N·M) operations.

Wikipedia page on Newton Method
Newton-Raphson Visualization (1D)
Newton-Raphson Visualization (2D)
Using Gradient Descent for Optimization and Learning
ftp://lsec.cc.ac.cn/pub/home/yyx/papers/siamjna1987.pdf

Natural Gradient Descent

Natural gradient descent is to solve the optimization problem $\min_{\theta} L(\theta)$ by $$ \theta^{(t+1)}=\theta^{(t+1)}-\alpha_{(t)}F^{-1}(\theta^{(t)})\nabla_{\theta}L(\theta^{(t)}) $$ where $F^{-1}(\theta^{(t)})$ is the inverse of Remiann metric at the point $\theta^{(t)}$. And Fisher scoring algorithm is a typical application of Natural Gradient Descent to statistics.
Natural gradient descent for manifolds corresponding to exponential families can be implemented as a first-order method through mirror descent.

Originator of Information Geometry: Shunichi Amari

Natural gradient descent and mirror descent
Online Natural Gradient as a Kalman Filter
https://www.zhihu.com/question/266846405
2016 PKU Mini-Course: Information Geometry
Information Geometry and Natural Gradients
Tutorial on Information Geometry and Algebraic Statistics
True Asymptotic Natural Gradient Optimization
Shun-ichi Amari's CV in RIKEN
Energetic Natural Gradient Descent
Natural gradients and K-FAC
http://www.divergence-methods.org/
http://www.deeplearningpatterns.com/doku.php?id=natural_gradient_descent
NATURAL GRADIENTS AND STOCHASTIC VARIATIONAL INFERENCE
谈谈优化算法 - 郑思座的文章 - 知乎
Accelerating Natural Gradient with Higher-Order Invariance

Trust Region Methods

Trust-region methods define a region around the current iterate within which they trust the model to be an adequate representation of the objective function, and then choose the step to be the approximate minimizer of the model in this region. If a step is not acceptable, they reduce the size of the region and find a new minimizer. In general, the direction of the step changes whenever the size of the trust region is altered.

$$ \min_{p\in\mathbb{R}^n} m_k(p)=f_k+ g_k^T p+\frac{1}{2}p^T B_k p, , , s.t. |p|\leq {\Delta}_k, $$ where $f_k=f(x^k), g_k = \nabla f(x^k)$;$B_k$ is some symmetric matrix as an approximation to the Hessian $\nabla^2 f(x^k+tp)$; $\Delta_k>0$ is the trust-region radius. In most of our discussions, we define $|\cdot|$ to be the Euclidean norm. Thus, the trust-region approach requires us to solve a sequence of subproblems in which the objective function and constraint are both quadratic. Generally, we set $B_k = \nabla^2 f(x^k)$.

https://optimization.mccormick.northwestern.edu/index.php/Trust-region_methods
Concise Complexity Analyses for Trust-Region Methods
https://www.nmr-relax.com/manual/Trust_region_methods.html
http://lsec.cc.ac.cn/~yyx/worklist.html

Expectation Maximization Algorithm

Expectation-Maximization algorithm, popularly known as the EM algorithm has become a standard piece in the statistician’s repertoire. It is used in incomplete-data problems or latent-variable problems such as Gaussian mixture model in maximum likelihood estimation. The basic principle behind the EM is that instead of performing a complicated optimization, one augments the observed data with latent data to perform a series of simple optimizations.

It is really popular for Bayesian statistician.

Let $\ell(\theta|Y_{obs})\stackrel{\triangle}=\log{L(\theta | Y_{obs})}$ denote the log-likelihood function of observed datum $Y_{obs}$. We augment the observed data $Y_{obs}$ with latent variables $Z$ so that both the complete-data log-likelihood $\ell(\theta | Y_{obs}, Z)$ and the conditional predictive distribution $f(z|Y_{obs}, \theta)$ are available.

Each iteration of the EM algorithm consists of an expectation step (E-step) and a maximization step (M-step) Specifically, let $\theta^{(t)}$ be the current best guess at the MLE $\hat\theta$. The E-step is to compute the Q function defined by $$ \begin{align} Q(\theta|\theta^{(t)}) & \stackrel{\triangle}= \mathbb{E}(\ell(\theta | Y_{obs}, Z) |Y_{obs}, \theta^{(t)}) \ &= \int_{Z}\ell(\theta | Y_{obs}, Z)\times f(z | Y_{obs}, \theta^{(t)})\mathrm{d}z, \end{align} $$

and the M-step is to maximize Q with respect to $\theta$ to obtain

$$ \theta^{(t+1)}=\arg\max_{\theta} Q(\theta|\theta^{(t)}). $$

https://www.wikiwand.com/en/Expectation%E2%80%93maximization_algorithm
http://cs229.stanford.edu/notes/cs229-notes8.pdf
EM算法存在的意义是什么？ - 史博的回答 - 知乎

Diagram of EM algorithm

Generalized EM Algorithm

Each iteration of the generalized EM algorithm consists of an expectation step (E-step) and a maximization step (M-step) Specifically, let $\theta^{(t)}$ be the current best guess at the MLE $\hat\theta$. The E-step is to compute the Q function defined by $$ Q(\theta|\theta^{(t)}) = \mathbb{E}[ \ell(\theta|Y_{obs}, Z)|Y_{obs},\theta^{(t)} ] \ = \int_{Z}\ell(\theta|Y_{obs}, Z)\times f(z|Y_{obs}, \theta^{(t)})\mathrm{d}z, $$ and the another step is to find $\theta$ that satisfies $Q(\theta^{t+1}|\theta^{t})>Q(\theta^{t}|\theta^{t})$, i.e. $$ \theta^{(t+1)}\in {\hat{\theta}|Q(\hat{\theta}|\theta^{(t)} \geq Q(\theta|\theta^{(t)}) }. $$

It is not to maximize the conditional expectation.

See more on the book The EM Algorithm and Extensions, 2nd Edition by Geoffrey McLachlan , Thriyambakam Krishna.

The MM Algorithm by Kenneth Lange

Majorization Methods
Tangential Majorization
Quadratic Majorization
Sharp Majorization
Using Higher Derivatives
AN ASSEMBLY AND DECOMPOSITION APPROACH FOR CONSTRUCTING SEPARABLE MINORIZING FUNCTIONS IN A CLASS OF MM ALGORITHMS

Quadratic Lowwer Bound

http://www.cs.cmu.edu/afs/cs/user/dwoodruf/www/w10b.pdf
Quadratic Majorization

Projected Gradient Method and More

We will focus on projected gradient descent and its some non-Euclidean generalization in order to solve some simply constrained optimization problems.

If not specified, all these methods are aimed to solve convex optimization problem with explicit constraints, i.e. $$ \arg\min_{x\in\mathbb{S}}f(x) $$

where $f$ is convex and differentiable, $\mathbb{S}\subset\mathbb{R}^n$ is convex. The optimal condition for this constrained optimization problem is that any feasible direction is not descent direction: if $x^{\star}\in\mathbb{S}$ is the solution to the problem, we can assert that variational inequality holds: $$\forall x \in \mathbb{S}, \left<\nabla f(x^{\star}),x-x^{\star} \right> \geq 0.$$

And it is the optimal condition of constrained optimization problem.

We say the gradient $\nabla f$ of the convex function $f$ is a monotone operator because $$\left<x-y, \nabla f(x)-\nabla f(y)\right>\geq 0. $$

Projected Gradient Descent

Projected gradient descent has two steps: $$ z^{k+1} = x^{k}-\alpha_k\nabla_x f(x^{k}) \qquad \text{Descent}\ x^{k+1} = Proj_{\mathbb{S}}(z^{k+1})=\arg\min_{x\in \mathbb{S}}|x-z^{k+1}|^{2} \qquad\text{Projection} $$

or in the compact form $$ x^{k+1} = \arg\min_{x\in \mathbb{S}}|x-(x^{k}-\alpha_k\nabla_x f(x^{k}))|^{2} \= \arg\min_{x}{|x-(x^{k}-\alpha_k\nabla_x f(x^{k}))|^{2} +\delta_{\mathbb{S}}(x) } $$

where $\delta_{\mathbb{S}}$ is the indictor function of the set $\mathbb{S}$ $$ h(x) =\delta_{\mathbb{S}}(x)= \begin{cases} 0, & \text{if} \quad x \in \mathbb{S};\ \infty, & \text{otherwise}. \end{cases} $$

Each projection is an optimization so that the iterative points satisfy the optimal conditions, which also restricts the projection method into the case where the projection is available or simple to compute. And it is natural to search better Descent step or Projection step.

The following links are recommended if you are interested in more theoretical proof:

http://maths.nju.edu.cn/~hebma/slides/03C.pdf
http://maths.nju.edu.cn/~hebma/slides/00.pdf

For the non-differentiable but convex function such as the absolute value function $f(x)=|x|$, we therefore consider sub-gradients in place of the gradient, where sub-gradient $\phi$ at the point $\hat{x}$ is defined as the elements in the domain of convex function ${f}$ (denoted as $D_f$), satisfying

$$ \left<\phi, x-\hat{x}\right>\leq f(x) -f(\hat{x}),\forall x \in D_f. $$

Mirror descent

Mirror descent can be regarded as the non-Euclidean generalization via replacing the $\ell_2$ norm or Euclidean distance in projected gradient descent by Bregman divergence.

Bregman divergence is induced by convex smooth function ${h}$:

$$ B_h(x,y) = h(x) - h(y)-\left<\nabla h(y),x-y\right> $$

where $\left<\cdot,\cdot\right>$ is inner product and it also denoted as $D_h$.

The function ${h}$ is usually required to be strongly convex. And if the convex function ${h}$ is not differentiable, one element of the sub-gradient ${\partial h(y)}$ may replace the gradient $\nabla h(y)$.

It is convex in $x$ and $\frac{\partial B_h(x, y)}{\partial x} = \nabla h(x) - \nabla h(y)$.

Especially, when ${h}$ is quadratic function, the Bregman divergence induced by $h$ is $$ B_h(x, y)=x^2-y^2-\left<2y,x-y\right>=x^2+y^2-2xy=(x-y)^2 $$ i.e. the square of Euclidean distance.

A wonderful introduction to Bregman divergence is Meet the Bregman Divergences by Mark Reid at http://mark.reid.name/blog/meet-the-bregman-divergences.html.

The Bregman projection onto a convex set $C\subset \mathbb{R}^n$ given by $$ y^{\prime}= \arg\min_{x\in C} B(x,y) $$ is unique.

A generalized Pythagorean theorem holds: for convex $C\subset \mathbb{R}^n$ and for all $x\in C$ and $y\in \mathbb{R}^n$ we have $$B(x,y)\geq B(x,y^{\prime}) + B(y^{\prime},y)$$ where $y^{\prime}$ is the Bregman projection of ${y}$, and equality holds when the convex set C defining the projection $y^{\prime}$ is affine.

It is given in the projection form:

$$ z^{k+1} = x^{k}-\alpha_k\nabla_x f(x^{k}) \qquad \text{Gradient descent}；\\ x^{k+1} = \arg\min_{x\in\mathbb{S}}B_h(x,z^{k+1}) \qquad\text{Bregman projection}. $$

In another compact form, mirror gradient can be described in the proximal form: $$ x^{k+1} = \arg\min_{x\in\mathbb{S}} { f(x^k) + \left<g^k, x-x^k\right> + \frac{1}{\alpha_k} B_h(x,x^k)}\tag {1} $$ with $g^k=\nabla f(x^k)$.

Note that the next iteration point $x^{k+1}$ only depends the current iteration point $x^{k}$ no more previous iteration points.

By the optimal conditions of equation (1), the original "mirror" form of mirror gradient method is described as $$ \nabla h(x^{k+1}) = \nabla h(x^k) - \alpha_k \nabla f(x^k), \ = \nabla h(x^1) - \sum_{n=1}^{k}\alpha_n \nabla f(x^n) , x\in \mathbb{S}, $$ where the convex function ${h}$ induces the Bregman divergence.

One special method is called entropic mirror descent(Exponential Gradient Descent) when the Bregman divergence induced by $e^x$ and the constraint set $\mathbb{S}\subset\mathbb{R}^n$ is simplex, i.e. $\sum_{i=1}^{n}x_i =1, \forall x_i \geq 0$.

Entropic descent method at step ${k}$ is given as follows:

$$ {x_{i}^{k+1} = \frac{x_i^{k}\exp(-\alpha \nabla {f(x^k)}{i})}{\sum{j=1}^{n} x_j^{k}\exp(-\alpha \nabla {f(x^k)}{j})}}, i=1,2,\dots, n. $$ it is obvious that entropic decscent methods are in the coordinate-wise update formula. Whast is more , it can be rewritten as $$x^{k+1}=\frac{x^{1}\exp(\sum{n=1}^{k}-\alpha \nabla f(x^n))}{\prod_{n=1}^{k}\left<x^n, \exp(-\alpha \nabla f(x^n))\right>}\propto x^{1}\exp(\sum_{n=1}^{k}-\alpha \nabla f(x^n)).$$

Multiplicative Weights Update is closely related with entropic descent method. See more on the following link list.

The Divergence Methods Web Site (under construction)
Bregman Divergence and Mirror Descent, Xinhua Zhang(张歆华)
CS 294 / Stat 260, Fall 2014: Learning in Sequential Decision Problems
ELE522: Large-Scale Optimization for Data Science , Yuxin Chen, Princeton University, Fall 2019
Mirror Descent and the Multiplicative Weight Update Method, CS 435, 201, Nisheeth Vishnoi
Mirror descent: 统一框架下的first order methods
ORF523: Mirror Descent, part I/II
ORF523: Mirror Descent, part II/II
Thibaut Lienart: MIrror Descent
Sinkhorn Algorithm as a Special Case of Stochastic Mirror Descent
https://web.stanford.edu/class/cs229t/2017/Lectures/mirror-descent.pdf
https://www.cs.ubc.ca/labs/lci/mlrg/slides/mirrorMultiLevel.pdf
https://konstmish.github.io/

Proximal Gradient Method

Recall the projected gradient method, it converts the constrained problem $\arg\min_{x}{f(x)\mid x\in \mathbb{S}}$ into unconstrained (also uncontinuous) problem $\arg\min_{x}{f(x)+\delta_{\mathbb{S}}(x)}$ literally and it follows the following iteration: $$ x^{k+1} = \arg\min_{x}{|x-(x^{k}-\alpha_k\nabla_x f(x^{k}))|^{2} +\delta_{\mathbb{S}}(x) } $$ where the function $\delta_{\mathbb{S}}(x)$ is obviously non-differentiable, defined as follows $$ h(x) =\delta_{\mathbb{S}}(x)= \begin{cases} 0, & \text{if} \quad x \in \mathbb{S};\ \infty, & \text{otherwise}. \end{cases} $$

If the function $\delta_{\mathbb{S}}(x)$ is replaced by general non-differential while convex function $\mathbf{h}(x)$, the proximal mapping (or prox-operator) of a convex function $\mathbf{h}$ is defined as $$ prox_h(x)=\arg\min_{u}{\mathbf{h}(u)+\frac{1}{2} {|x-u|}_2^2} $$

Unconstrained problem with cost function split in two components: $$minimize \qquad f(x) = g(x)+\mathbf{h}(x)$$

${g}$ is convex, differentiable;
$\mathbf{h}$ is closed, convex, possibly non-differentiable while $prox_h(x)$ is inexpensive.

Proximal gradient algorithm $$x^{k}=prox_{t_k h} {x^{k-1}-t_k\nabla g(x^{k-1})}$$

$t_k > 0$ is step size, constant or determined by line search.

$$x^+ = prox_{th}(x-t\nabla g(x))$$ from the definition of proximal mapping: $$ x^+ = \arg\min_{u}( h(u)+\frac{1}{2}{|u-(x-t\nabla g(x))|}2^2 ) \= \arg\min{u}( h(u) + g(x) + \nabla g(x)^T(u-x) +\frac{1}{2t} {|u-x|}2^2 ) \= \arg\min{u}( h(u) + \nabla g(x)^T(u-x) +\frac{1}{2t} {|u-x|}_2^2 ) $$

$x^{+}$ minimizes $h(u)$ plus a simple quadratic local model of $g(u)$ around ${x}$.

And projected gradient method is a special case of proximal gradient method.

And it is natural to consider a more general algorithm by replacing the squared Euclidean distance in definition of proximal mapping with a Bregman distance: $$ \arg\min_{u}{h(u)+\frac{1}{2} {|x-u|}2^2}\to \arg\min{u}{h(u)+ B(x,u)}. $$ so that the primary proximal gradient methods are modified to the Bregman version, which is called as Bregman proximal gradient method.

A collection of proximity operators implemented in Matlab and Python.
L. Vandenberghe ECE236C (Spring 2019): 4. Proximal gradient method
https://people.eecs.berkeley.edu/~elghaoui/Teaching/EE227A/lecture18.pdf
Accelerated Bregman Proximal Gradient Methods for Relatively Smooth Convex Optimization
A unified framework for Bregman proximal methods: subgradient, gradient, and accelerated gradient schemes
Proximal Algorithms, N. Parikh and S. Boyd
For more (non exhaustive) informations regarding the proximity operator and the associated proximal algorithms

Proximal and Projected Newton Methods

A fundamental difference between gradient descent and Newton’s method was that the latter also iteratively minimized quadratic approximations, but these used the local Hessian of the function in question. Proximal Newton method can be also applied to such optimization problem: $$ y^{k} = prox_{H_{k-1}}(x^{k-1}-H_{k-1}^{-1}\nabla g(x^{k-1}))\ x^{k} = x^{k-1}+t_{k}(y^{k}-x^{k-1}). $$

http://www.stat.cmu.edu/~ryantibs/convexopt-S15/lectures/24-prox-newton.pdf

Douglas–Rachford method

$$\min_{x}{f(x)+g(x)}$$ where $f$ and $g$ are closed convex functions. Douglas–Rachford iteration: start at any $y^0$ and repeat for $k = 0, 1, \cdots,$ $$ x^{k+1} = prox_f(y^k)\\ y^{k+1} = y^k + prox_g(2x^{k+1} − y^k) − x^{k+1} $$

useful when $f$ and $g$ have inexpensive prox-operators;
$x^k$ converges to a solution of $0 \in \partial f (x) + \partial g(x)$ (if a solution exists);
not symmetric in $f$ and $g$.

The iteration can be written as fixed-point iteration $$y^{k+1} = F(y^k)$$ where $F(y)=y + prox_g(2prox_f(y) − y) − prox_f(y)$.

http://www.seas.ucla.edu/~vandenbe/236C/lectures/dr.pdf

Proximal point algorithms and its Beyond

Now, let us consider the simple convex optimization $$\min_{x}{\theta(x)+f(x)\mid x \in\mathcal X},\tag{2.1}$$ where $\theta(x)$ and $f(x)$ are convex but $\theta(x)$ is not necessary smooth, $\mathcal X$ is a closed convex set. For solving (2.1), the k-th iteration of the proximal point algorithm (abbreviated to PPA) begins with a given $x^k$, offers the new iterate $x^{k+1}$ via the recursion $$x^{k+1}=\arg\min_{x}{\theta(x)+f(x)+\frac{r}{2}{|x-x^k|}_2^2\mid x \in\mathcal X}.\tag{2.2}$$

Since $x^{k+1}$ is the optimal solution of (2.2), it follows from optimal condition that $$\theta(x)-\theta(x^{k+1})+(x-x^{k+1})^{T}{ \nabla f(x^{k+1}) + r( f(x^{k+1}) - f(x^k)) } \quad\forall x\in\mathcal X.$$

It is proved that The sequence ${x^k}$ generated by PPA is Fejer monotone, i.e., $|x^{k+1}-x^{\ast}|^2\leq |x^{k}-x^{\ast}|^2-|x^{k+1}-x^{k}|^2$.

http://maths.nju.edu.cn/~hebma/Talk/VI-PPA.pdf
https://lostella.github.io/software/
The proximal point method revisited

$\color{quad}{Note}$: the materials in this section is taken from the lectures of Bingshneg He.

Prediction-Correction Methods

The optimal condition of the linearly constrained convex optimization is characterized as a mixed monotone variational inequality:

$$w^{\ast}\in\Omega, \theta(u)-\theta(u^{\ast})+(w - w^{\ast})^{T} \nabla F(w^{\ast}) \quad\forall w\in\mathcal \Omega.$$

[Prediction Step.] : With given $v^k$, find a vector $\tilde{w}^k\in \Omega$ such that $$\theta(u)-\theta(\tilde u^{k})+(w - \tilde w^{k})^{T} \nabla F(w^{\ast})\geq (v-\tilde v^{k})^T Q (v-\tilde v^{k}),\quad w\in\Omega$$ where the matrix $Q$ is not necessary symmetric, but $Q^T+Q$ is positive definite.

[Correction Step.] : The new iterate $v^{k+1}$ by $$v^{k+1} =v^{k}- \alpha M(v^{k}-\tilde v^{k+1}).$$

Convergence Conditions : For the matrices $Q$ and $M$ above, there is a positive definite matrix $H$ such that $$HM = Q. \tag{2.20a}$$ Moreover, the matrix $$G = Q^T + Q − \alpha M^T H M \tag{2.20b}$$ is positive semi-definite.

https://www.seas.upenn.edu/~spater/assets/papers/conference/c_2019_paternain_et_al.pdf
http://export.arxiv.org/pdf/1709.05850
https://www.hindawi.com/journals/aaa/2013/845459/
https://arxiv.org/pdf/1709.05850.pdf

Customized PPA

Now, let us consider the simple convex optimization $$\min_{u}{\theta(u)\mid Au=b, u \in\mathcal U}.\tag{2.3}$$

The related variational inequality of the saddle point of the Lagrangian function is $$w^{\ast}\in \Omega, \theta(u) - \theta(u^{\ast})+(w-w^{\ast})^T F(w^{\ast})\geq 0 \quad\forall w\in\Omega$$ where $w=(u, \lambda)^T, F(w)=(-A^T\lambda, Au-b)^T$.

For given $v^k = w^k = (u^k, \lambda^k)$, the predictor is given by $$ \begin{aligned} \tilde{u}^{k} &=\arg \min \left{L\left(u, \lambda^{k}\right)+\frac{r}{2}\left|u-u^{k}\right|^{2} | u \in \mathcal{U}\right}, \ \tilde{\lambda}^{k} &=\arg \max \left{L\left(\left[2 \tilde{u}^{k}-u^{k}\right], \lambda\right)-\frac{s}{2}\left|\lambda-\lambda^{k}\right|^{2}\right}. \end{aligned} $$

https://link.springer.com/article/10.1007/s10589-013-9616-x
https://core.ac.uk/display/23878067
http://www.optimization-online.org/DB_FILE/2017/03/5922.pdf

$\color{aqua}{Note}$: the projection from a point $x^0$ into a subset $C\subset\mathbb{R}^n$ is defined in proximal operator as

$$ x^+ =\arg\min_{x}{\delta_C(x)+\frac{1}{2}{|x-x^0|}_2^2} $$

while it can also written in the following form: $$ x^{+}=\arg\min_{x}{\frac{1}{1_C(x)}\cdot {|x-x^0|}_2^2} $$

where $$ 1_C(x)= \begin{cases} 1, & \text{if} \quad x \in C,\ 0, & \text{otherwise}. \end{cases} $$ How we can generalize this form into the proximal form? And what is the difference with the original addition proximal operator?

If $x^0$ is in the set ${C}$, the projection can be rewritten in proximal operator: $$ x^{+}=\arg\min_{x}{\exp[\delta_C(x)]\cdot \frac{1}{2}{|x-x^0|}_2^2}. $$

How we can generalize the function $\delta_{C}$ to more general convex function? What are the advantages of this generalization? As likelihood and log-likelihood, we can transfer the product of some functions to sum by taking logarithm.

$$ x^{+} = \arg\min_{x}{\exp[\delta_C(x)]\cdot \frac{1}{2}{|x-x^0|}2^2} \ = \arg\min{x}\log{\exp[\delta_C(x)]\cdot \frac{1}{2}{|x-x^0|}2^2} \= \arg\min{x} {\delta_C(x)+\log({|x-x^0|}_2^2)}. $$

Penalty/Barrier Function Methods

In projected gradient method, it converts the constrained problem into an unsmooth problem.

constrained problem	unconstrained problem	desired properties
$\arg\min_{x}{f(x)\mid x\in \mathbb{S}}$	$\arg\min_{x}{f(x)+\delta_{\mathbb{S}}(x)}$	unsmooth, inexpensive
$\arg\min_{x}{f(x)\mid x\in \mathbb{S}}$	$\arg\min_{x}{f(x) + g_{\mathbb{S}}(x)}$	smooth, inexpensive
$\arg\min_{x}{f(x)\mid g(x)\in \mathbb{S}}$	$\arg\min_{x}{f(x)+\fbox{?}}$	smooth, inexpensive

In constrained optimization, a field of mathematics, a barrier function is a continuous function whose value on a point increases to infinity as the point approaches the boundary of the feasible region of an optimization problem.

Recall the indicator function of constrained set in projected methods: $$ \delta_{\mathbb{S}}(x)= \begin{cases} 0, & \text{if} \quad x \in \mathbb{S};\ \infty, & \text{otherwise}, \end{cases} $$ there is a dramatic leap between the boundary of $\mathbb S$ and the outer.

The basic idea of the barrier method is to approximate the indicator function by some functions ${f}$ satisfying:

convex and differentiable;
$f(x)< \infty$ if $x\in\mathbb S$;
for every point constraint on the boundary $f(x)\to\infty$ as $x\to\partial\mathbb S$.

For example, Barrier Method in Convex Optimization: Fall 2018 pushs the inequality constraints in the problem in a smooth way, reducing the problem to an equality-constrained problem. Consider now the following minimization problem: $$ \min_{x} f(x) \ \text{subject to} \quad h_n \leq 0, n=1,2,\cdots, N \ Ax=b $$ where $f, h_1, h_2,\cdots, h_N$ are assumed to be convex and twice differential functions in $\mathbb R^p$. The log-barrier function is defined as $$\phi(x)=-\sum_{n=1}^{N}\log(-h_n(x)).$$

When we ignore equality constraints, the problem above can be written by incorporating the inequality with the identity function, which in turn can be approximated by the log-barrier functions as follows $$\min_{x} f(x)+\sum_{n=1}^{N}\mathbb{I}{h_n(x)\leq 0}\approx \min{x} f(x)+\frac{1}{t}\phi(x)$$

for $t > 0$ being a large number.

A penalty method replaces a constrained optimization problem by a series of unconstrained problems whose solutions ideally converge to the solution of the original constrained problem. The unconstrained problems are formed by adding a term, called a penalty function, to the objective function that consists of a penalty parameter multiplied by a measure of violation of the constraints. The measure of violation is nonzero when the constraints are violated and is zero in the region where constraints are not violated.

In the context of the equality-constrained problem $$\min_{x} f(x), ,, s.t. c_n(x)=0, n = 1,2,\cdots, N,$$

the quadratic penalty function $Q(x;\mu)$ for this formulation is $$Q(x;\mu)=f(x)+\frac{\mu}{2}\sum_{n=1}^{N}c_n^2(x)$$ where $\mu>0$ is the penalty parameter. By driving $\mu\to\infty$, we penalize the constraint violations with increasing severity.

Quadratic Penalty Method

Given $\mu_0 > 0$, a nonnegative sequence ${\tau_k}$ with $\tau_k\to 0$, and a starting point $x^s_0$;
for $k = 0, 1, 2, . . .$:
- Find an approximate minimizer $x^k=\arg\min_{x}Q(x; \mu_k)$, starting at $x^s_k$ and terminating when ${|\nabla_x Q(x;\mu_k)|}_2\leq \tau_k$;
- if final convergence test satisfied,
  - stop with approximate solution $x^k$;
- end if
- Choose new penalty parameter $\mu_{k+1} > \mu_{k}$;
- Choose new starting point $x^s_{k+1}$
end (for)

Line Search Procedures for the Logarithmic Barrier Function
Penalty method
http://s-mat-pcs.oulu.fi/~keba/Optimointi/OP_penalty_engl.pdf
https://www.me.utexas.edu/~jensen/ORMM/supplements/units/nlp_methods/const_opt.pdf
http://users.jyu.fi/~jhaka/opt/TIES483_constrained_indirect.pdf

Path Following Methods

The main ideas of path following by predictor–corrector and piecewise-linear methods, and their application in the direction of homotopy methods and nonlinear eigenvalue problems are reviewed. Further new applications to areas such as polynomial systems of equations, linear eigenvalue problems, interior methods for linear programming, parametric programming and complex bifurcation are surveyed. Complexity issues and available software are also discussed.

Continuation and path following
https://nisheethvishnoi.files.wordpress.com/2018/05/lecture71.pdf
http://www.stat.cmu.edu/~ryantibs/convexopt-S15/lectures/16-primal-dual.pdf
L. Vandenberghe EE236C (Spring 2016): 17. Path-following methods

Lagrange Multipliers and Duality

It is to solve the constrained optimization problem $$\arg\min_{x}f(x), \quad s.t.\quad g(x)=b.$$

The barrier or penalty function methods are to add some terms to $f(x)+\Omega(g(x)-b)$ converting constrained problems into unconstrained problems by introducing an artificial penalty for violating the constraint. For example, $$P(x, \lambda)= f(x) + \lambda {|g(x)-b|}_2^2$$ where the penalty function $\Omega(x) = {|x|}_2^2$, $\lambda\in\mathbb{R}^{+}$. We can regard it as a surrogate loss technique. Although the penalty function is convex and differentiable, it is more difficult than directly optimizing $f(x)$ when the constrain is complicated.

Lagrange Multipliers and Generalized Lagrange Function

The penalty function methods do not take the optimal conditions into consideration although it works. If $x^{\star}$ is in the solution set of the optimization problem above, it is obvious that $Ax^{\star}=b$ and $L(x^{\star}, \lambda)=f(x^{\star})$ where $$L(x, \lambda)= f(x) + \lambda^T(Ax-b).$$

In another direction, we want to prove that $x^{\star}$ is the optima of the optimization problem if $$ L(x^{\star}, {\lambda}^{\star})=\quad \min_{x}[\max_{\lambda} L(x, \lambda)]. $$

By the definition, $L(x^{\star}, {\lambda}^{\star})\geq L(x^{\star}, \lambda)=f(x^{\star})+\lambda^T(Ax^{\star}-b)$, which implies that $\lambda^T(Ax^{\star}-b)=0$ i.e., $Ax^{\star}=b$. And $L(x^{\star}, {\lambda}^{\star}) = f(x^{\star})\leq L(x, {\lambda}^{\star})\forall x$ if $Ax=b$ thus $x^{\star}$ is the solution to the primary problem.

It is dual problem is in the following form: $$\max_{\lambda}[\min_{x} L(x, \lambda)].$$

Note that $$ \min_{x} L(x, \lambda)\leq L(x,\lambda)\leq \max_{\lambda}L(x,\lambda) $$ implies $$\ \max_{\lambda}[\min_{x} L(x, \lambda)]\leq \max_{\lambda} L(x, \lambda)\leq \min_{x}[\max_{\lambda} L(x, \lambda)]. $$

And note that necessary condition of extrema is that the gradients are equal to 0s: $$ \frac{\partial L(x,\lambda)}{\partial x}= \frac{\partial f(x)}{\partial x}+ A\lambda = 0 \ \frac{\partial L(x,\lambda)}{\partial \lambda} = Ax-b=0 $$

Dual Ascent takes advantages of this properties:

$x^{k+1}=\arg\min_{x} L(x,\lambda)$;

${\lambda}^{k+1}= {\lambda}^{k}+\alpha_k(Ax^{k+1}-b)$.

Entropy minimization algorithms: $$ x^{k+1}\in \arg\min_{x}{f(x)+\frac{1}{c_k} \sum_{i=1}^{n} x^{i}(\ln(\frac{x_i}{x^{k}_{i}})-1)}. $$

Duality
https://cs.stanford.edu/people/davidknowles/lagrangian_duality.pdf
https://people.eecs.berkeley.edu/~elghaoui/Teaching/EE227A/lecture7.pdf
https://www.svm-tutorial.com/2016/09/duality-lagrange-multipliers/
https://www.cs.jhu.edu/~svitlana/papers/non_refereed/optimization_1.pdf
Constrained Optimization and Lagrange Multiplier Methods
https://zhuanlan.zhihu.com/p/50823110
The proximal augmented Lagrangian method for nonsmooth composite optimization

Exponential Augmented Lagrangian Method

A special case for the convex problem $$ \text{minimize}\quad f(x),\ s.t. g_1(x) \leq 0, g_2(x) \leq 0, \cdots, g_r(x) \leq 0, x\in X $$

is the exponential augmented Lagrangean method.

It consists of unconstrained minimizations:

$$ x^{k}\in \arg\min_{x\in X}{f(x)+\frac{1}{c_k} \sum_{j=1}^{r} {\mu}^{k}{j}\exp(c_k g_j(x)) $$ followed by the multiplier iterations $$ {\mu}^{k+1}{j} = {\mu}^{k}_{j}\exp(c_k g_j(x^k)). $$

If the constraints are more complex, KKT theorem may be necessary.

An exponential augmented Lagrangian method with second order convergence
On the convergence of the exponential multiplier method for convex programming

Generalized Lagrangian function

Minimize $f(x)$, subject to $g_i(x) \ge 0, i = 1,\cdots, m$. Here, $x \in\Omega$, and $\Omega$ is a subset of the Euclidean space $E$. We assume that $f(x)$ and $g_i(x)$ are twice continuously differentiable. A generalized Lagrangian function is defined as $$L(x, \sigma) = f(x) - G[g(x), \sigma]. $$

A generalized Lagrangian function and multiplier method
Generalized Lagrange Function and Generalized Weak Saddle Points for a Class of Multiobjective Fractional Optimal Control Problems
https://blog.csdn.net/shayashi/article/details/82529816
https://suzyahyah.github.io/calculus/optimization/2018/04/07/Lagrange-Multiplier.html
Lagrange Multipliers without Permanent Scarring

KKT condition

Given general problem $$ \begin{aligned} \min_{x \in \mathbb{R}^{n}} f(x) & \ \text { subject to } & h_{i}(x) \leq 0, \quad i=1, \ldots m \ & \ell_{j}(x)=0, \quad j=1, \ldots r \end{aligned} $$

We defined the Lagrangian: $$L(x, u, v) = f(x) +\sum_{i=1}^{m}u_ih_i(x)+\sum_{j=1}^r v_j \ell_j.$$

The Karush-Kuhn-Tucker conditions or KKT conditions are: $$\begin{aligned} &\bullet\quad 0 \in \partial f(x)+\sum_{i=1}^{m} u_{i} \partial h_{i}(x)+\sum_{j=1}^{T} v_{j} \partial \ell_{j}(x) &\text{(stationarity)} \ &\bullet\quad u_{i} \cdot h_{i}(x)=0 \text { for all } i &\text{ (complementary slackness) } \ &\bullet\quad h_{i}(x) \leq 0, \ell_{j}(x)=0 \text { for all } i, j &\text{(primal feasibility) } \ &\bullet\quad u_{i} \geq 0 \text{for all} i &\text{ (dual feasibility) } \end{aligned}$$

I learnt this theorem in functional analysis at graduate level course.

https://mitmgmtfaculty.mit.edu/rfreund/educationalactivities/
Nonlinear Programming by Robert M. Freund
https://www.cs.cmu.edu/~ggordon/10725-F12/slides/16-kkt.pdf

Conjugate Duality

Dual problem transform the original primary problem into a new optimization problems.

Primary-dual hybrid gradient,
The Complexity of Primal-Dual Fixed Point Methods for Ridge Regression ,
A primal–dual fixed point algorithm for convex separable minimization with applications to image restoration

https://web.maths.unsw.edu.au/~gyli/papers/ljl-conjugate-revised-final-18-11-10.pdf
Conjugate Duality and Optimization
LECTURE 5: CONJUGATE DUALITY
Duality Theory of Constrained Optimization by Robert M. Freund
Gauge optimization, duality, and applications
https://mitmgmtfaculty.mit.edu/rfreund/educationalactivities/
http://www.mit.edu/~mitter/publications/6_conjugate_convex_IS.pdf

Primal-dual fixed point algorithm and Primary Dual Hybrid Gradient

We demonstrate how different algorithms can be obtained by splitting the problems in different ways through the classic example of sparsity regularized least square model with constraint. In particular, for a class of linearly constrained problems, which are of great interest in the context of multi-block ADMM, can be also solved by PDFP with a guarantee of convergence. Finally, some experiments are provided to illustrate the performance of several schemes derived by the PDFP algorithm.

A primal-dual fixed point algorithm for multi-block convex minimization
A primal–dual fixed point algorithm for convex separable minimization
A Unified Primal-Dual Algorithm Framework Based on Bregman Iteration
Proximal ADMM
PDHG

Splitting Methods

In following context we will talk the optimization problem with linear constraints: $$ \arg\min_{x} f(x) \ s.t. Ax = b $$ where $f(x)$ is always convex.

ADMM

Alternating direction method of multipliers is called ADMM shortly. It is aimed to solve the following convex optimization problem: $$ \min F(x,y) {=f(x)+g(y)} \ \text{subject to }\quad Ax+By =b $$ where $f(x)$ and $g(y)$ is convex; ${A}$ and ${B}$ are matrices.

Define the augmented Lagrangian: $$ L_{\beta}(x, y)=f(x)+g(y) - \lambda^{T}(Ax + By -b)+ \frac{\beta}{2}{|Ax + By - b|}_{2}^{2}. $$

Augmented Lagrangian method

Augmented Lagrange Method at step $k$ is described as following:

$(x^{k+1}, y^{k+1})=\arg\min_{x\in\mathbf{X}}L_{\beta}(x, y,\lambda^{\color{aqua}{k}});$

$\lambda^{k+1} = \lambda^{k} - \beta (Ax^{\color{red}{k+1}} + By^{\color{red}{k+1}}-b).$

ADMM at step $t$ is described as following:

$x^{k+1}=\arg\min_{x\in\mathbf{X}}L_{\beta}(x,y^{\color{aqua}{k}},\lambda^{\color{aqua}{k}});$

$y^{k+1}=\arg\min_{y\in\mathbf{Y}} L_{\beta}(x^{\color{red}{k+1}}, y, \lambda^{\color{aqua}{k}});$

$\lambda^{k+1} = \lambda^{k} - \beta (Ax^{\color{red}{k+1}} + By^{\color{red}{k+1}}-b).$

The convergence proof of ADMM in convex optimization can be reduced to verifying the stability of a dynamical system or based on the optimal condition variational inequality like On the O(1/t) convergence rate of alternating direction method.

Linearized ADMM

Note that the $x$ subproblem in ADMM $$ \begin{align} \arg\min_{x}L_{\beta}(x,y^{\color{aqua}{k}},\lambda^{\color{aqua}{k}})\ =\arg\min_{x}{f(x)+g(y^k)+{\lambda^k}^{T}(Ax+By^{k}-b)+\frac{\beta}{2}{|Ax+By^{k}-b|}{2}^{2}} \ =\arg\min{x}f(x)+\frac{\beta}{2}{|Ax+By^{k}-b-\frac{1}{\beta}\lambda|}_{2}^{2}\tag 1 \end{align} $$

However, the solution of the subproblem (1) does not have the closed form solution because of the general structure of the matrix ${A}$. In this case, we linearize the quadratic term of $$\frac{\beta}{2}{|Ax+By^{k}-b-\frac{1}{\beta}\lambda^k|}_{2}^{2}$$

at $x^k$ and add a proximal term $\frac{r}{2}{|x-x^k|}2^2$ to the objective function. In another word, we solve the following ${x}$ subproblem if ignoring the constant term of the objective function: $$\min{x}f(x)+\beta(Ax)^T(A x^k + B y^k - b-\frac{1}{\lambda^k}) + \frac{r}{2}{|x - x^k|}_2^2.$$

$x^{k+1}=\arg\min_{x\in\mathbf{X}} f(x) + \beta(A x)^T (A x^k + B y^k - b -\frac{1}{\lambda^k})+ \frac{r}{2}{|x - x^k|}_2^2$,

$y^{k+1}=\arg\min_{y\in\mathbf{Y}} L_{\beta}(x^{\color{red}{k+1}}, y, \lambda^{\color{aqua}{k}})$,

$\lambda^{k+1} = \lambda^{k} - \beta (Ax^{\color{red}{k+1}} + By^{\color{red}{k+1}}-b).$

For given $\beta > 0$, choose ${r}$ such that the matrix $rI_{1}-\beta A^TA$ is definitely positive, i.e., $$rI_{1}-\beta A^TA\geq 0.$$

We can also linearize the ${y}$ subproblem:

$x^{k+1}=\arg\min_{x\in\mathbf{X}}L_{\beta}(x, y^k, \lambda^k)$,

$y^{k+1}=\arg\min_{y\in\mathbf{Y}} g(y) + \beta (By)^{T}(Ax^{\color{red}{k+1}} + B y^k - b -\frac{1}{\beta}\lambda^k) + \frac{r}{2}|y - y^k|^2$,

$\lambda^{k+1} = \lambda^{k} - \beta (Ax^{\color{red}{k+1}} + By^{\color{red}{k+1}}-b).$

For given $\beta > 0$, choose ${r}$ such that the matrix $rI_{2}-\beta B^T B$ is definitely positive, i.e., $$rI_{2}-\beta B^T B\geq 0.$$

Taking $\mu\in(0, 1)$ (usually $\mu=0.9$), the Symmetric ADMM is described as

$x^{k+1}=\arg\min_{x\in\mathbf{X}}L_{\beta}(x, y^{\color{aqua}{k}},\lambda^{\color{aqua}{k}})$,

$\lambda^{k + \frac{1}{2}} = \lambda^{k} - \mu\beta (Ax^{\color{red}{k+1}} + By^{\color{red}{k}}-b)$,

$y^{k+1}=\arg\min_{y\in\mathbf{Y}} L_{\beta}(x^{\color{red}{k+1}}, y, \lambda^{\color{aqua}{k+\frac{1}{2}}})$,

$\lambda^{k+1} = \lambda^{\color{red}{k+\frac{1}{2}}} - \mu\beta (A x^{\color{red}{k+1}} + B y^{\color{red}{k+1}}-b)$.

http://www.optimization-online.org/DB_FILE/2015/05/4925.pdf

$\color{aqua}{\text{Thanks to Professor He Bingsheng who taught me those.}}$[^9]

One of the particular ADMM is also called Split Bregman methods. And Bregman ADMM replace the quadratic penalty function with Bregman divergence: $$ L_{\beta}^{\phi}(x, y)=f(x)+g(y) - \lambda^{T}(Ax + By - b) + \frac{\beta}{2} B_{\phi}(b- Ax, By). $$

where $B_{\phi}$ is the Bregman divergence induced by the convex function $\phi$.

BADMM

$x^{k+1}=\arg\min_{x\in\mathbf{X}}L_{\beta}^{\phi}(x,y^{\color{aqua}{k}},\lambda^{\color{aqua}{k}});$

$y^{k+1}=\arg\min_{y\in\mathbf{Y}} L_{\beta}^{\phi}(x^{\color{red}{k+1}}, y, \lambda^{\color{aqua}{k}});$

$\lambda^{k+1} = \lambda^{k} - \beta (Ax^{\color{red}{k+1}} + By^{\color{red}{k+1}}-b).$

Bregman Alternating Direction Method of Multipliers
https://www.swmath.org/software/20288

Relaxed, Inertial and Fast ADMM

$$\min_{x}f(x)+g(Ax)$$ The penalty parameter is $\rho > 0$ and the relaxation parameter is $\alpha\in (0, 2)$. Standard ADMM is recovered with $\alpha = 1$.

Family of relaxed A-ADMM algorithms for the above problem

$x^{k+1}=\arg\min_{x\in\mathbf{X}}f(x)+\frac{\rho}{2}{|Ax- y^k+ \lambda^k|}^2$,

$y^{k+1}=\arg\min_{y\in\mathbf{Y}} g(y) + \frac{\rho}{2}{|\alpha Ax^{\color{red}{k+1} }+(1-\alpha_k)y^k-y+\lambda^k|}^2$,

$\lambda^{k+1} = \lambda^{k} +\alpha Ax^{\color{red}{k+1}}+(1-\alpha_k)y^k-z^{\color{red}{k+1}}$.

Family of relaxed ADMM algorithms for the above problem The damping constant is $r \geq 3$.

$x^{k+1}=\arg\min_{x\in\mathbf{X}}f(x)+\frac{\rho}{2}{|Ax-\hat y^k+\lambda^k|}^2$,

$y^{k+1}=\arg\min_{y\in\mathbf{Y}} g(y) + \frac{\rho}{2}{|\alpha Ax^{\color{red}{k+1} }+(1-\alpha_k)\hat y^k-y+\lambda^k|}^2$,

$\lambda^{k+1} = \hat\lambda^{k} +\alpha Ax^{\color{red}{k+1}}+(1-\alpha_k)\hat y^k-z^{\color{red}{k+1}}$,

$\gamma_{k+1}=\frac{k}{k+r}$,

$\hat\lambda^{k+1}=\lambda^{k+1}+\gamma_{k+1}(\lambda^{k+1}-\lambda^{k})$,

$\hat y^{k+1}=y^{k+1}+\gamma_{k+1}(y^{k+1}-y^{k})$.

Fast ADMM

$x^{k+1}=\arg\min_{x\in\mathbf{X}}L_{\beta}(x,\hat y^{\color{aqua}{k}},\hat\lambda^{\color{aqua}{k}});$

$y^{k+1}=\arg\min_{y\in\mathbf{Y}} L_{\beta}(x^{\color{red}{k+1}}, y, \hat\lambda^{\color{aqua}{k}});$

$\lambda_{k+1} = \lambda^{k} - \beta (Ax^{\color{red}{k+1}} + By^{\color{red}{k+1}}-b).$

$\alpha_{k+1}=\frac{1+\sqrt{1+4\alpha_k^2}}{2}$

$\hat y^{k+1}=y^{k+1}+\frac{\alpha_{k}-1}{\alpha_{k+1}}(y^{k}-y^{k-1})$

$\hat \lambda^{k+1}=\lambda^{k+1}+\frac{\alpha_{k}-1}{\alpha_{k+1}}(\lambda^{k}-\lambda^{k-1})$

FAST ALTERNATING DIRECTION OPTIMIZATION METHODS
The Classical Augmented Lagrangian Method and Nonexpansiveness
Relative-error inertial-relaxed inexact versions of Douglas-Rachford and ADMM splitting algorithms
Relax, and Accelerate: A Continuous Perspective on ADMM
https://www.semanticscholar.org/author/Guilherme-Fran%C3%A7a/145512630
An explicit rate bound for over-relaxed ADMM

Multi-Block ADMM

Firstly we consider the following optimization problem

$$ \min f_1(x_1) + f_2(x_2) + \cdots + f_n(x_n)\\ s.t.\quad A_1x_1 + A_2x_2 + \cdots + A_n x_n = b, \\ x_i\in\mathop{X_i}\in\mathbb{R}^{d_i}, i=1,2,\cdots, n. $$

We defined its augmented Lagrangian multipliers as

$$ L_{\beta}^{n}(x_1,x_2,\cdots, x_n\mid \lambda)=\sum_{i=1}^{n} f_i(x_i) -\lambda^T (\sum_{i=1}^{n} A_i x_i - b) + \frac{\beta}{2} ({|\sum_{i=1}^{n} A_i x_i - b|})_{2}^{2}. $$

Particularly, we firstly consider the case when $n=3$:

$$ L_{\beta}^{3}(x, y, z\mid \lambda)=f_1(x) + f_2(y) + f_3(z)-\lambda^T (A_1 x + A_2 y + A_3 z - b) \+\frac{\beta}{2}{|A_1 x + A_2 y + A_3 z - b|}_2^2. $$

It is natural and computationally beneficial to extend the original ADMM directly to solve the general n-block problem. A counter-example shows that this method diverges.

And Professor Bingsheng He, who taught me this section in his class, and his coauthors proposed some schemes for this problem based on his unified frame work for convex optimization and monotonic variational inequality.

Parallel splitting augmented Lagrangian method (abbreviated to PSALM) is described as follows:

$x^{k+1}=\arg\min_{x}{L_{\beta}^3(x,y^k,z^k,\lambda^k)\mid x\in\mathbb{X}}$;

$y^{k+1}=\arg\min_{x}{L_{\beta}^3(x^{\color{red}{k+1}},y,z^k,\lambda^k)\mid y\in\mathbb{Y}}$;

$z^{k+1}=\arg\min_{x}{L_{\beta}^3(x^{\color{red}{k+1}},y^{\color{yellow}{k}},z,\lambda^k)\mid z\in\mathbb{Z}}$;

$\lambda^{k+1} = {\lambda}^{k}-\beta(A_1x^{k+1}+A_2y^{k+1}+A_3z^{k+1}-b)$.

We can add one more correction step $$ v^{k+1} := v^{k}-\alpha(v^k - v^{k+1}),\alpha\in (0, 2 -\sqrt{2}) $$

where $v^{k+1}=(y^{k+1},z^{k+1},\lambda^{k+1})$.

Another approach is to add an regularized terms:

$x^{k+1}=\arg\min_{x}{L_{\beta}^3(x, y^k, z^k, \lambda^k)\mid x\in\mathbb{X}}$,

$y^{k+1}=\arg\min_{x}{L_{\beta}^3(x^{\color{red}{k+1}}, y, z^k,\lambda^k)+\color{red}{\frac{\tau}{2}\beta{|A_2(y-y^k)|}^{2}}\mid y\in\mathbb{Y}}$,

$z^{k+1}=\arg\min_{x}{L_{\beta}^3(x^{\color{red}{k+1}}, y^{\color{yellow}{k}}, z, \lambda^k)+\color{blue}{\frac{\tau}{2}\beta{|A_3(z - z^k)|}^{2}}\mid z\in\mathbb{Z}}$,

$\lambda^{k+1} = {\lambda}^{k} - \beta(A_1 x^{k+1} + A_2 y^{k+1} + A_3 z^{k+1}-b)$,

where $\tau>1$.

http://scis.scichina.com/en/2018/122101.pdf
http://maths.nju.edu.cn/~hebma/slides/17C.pdf
http://maths.nju.edu.cn/~hebma/slides/18C.pdf
https://link.springer.com/article/10.1007/s10107-014-0826-5

Davis-Yin three operator splitting

If $f_1$ is strongly convex, then apply Davis-Yin (to dual problem) gives:

$x^{k+1}=\arg\min_{x}{L^3_{\beta}(\color{green}{x},y^k,z^k,\lambda^k)\mid x\in\mathbb{X}}$;

$y^{k+1}=\arg\min_{x}{L_{\beta}^3(x^{k+1},\color{green}{y},z^k,\lambda^k)\mid y\in\mathbb{Y}}$;

$z^{k+1}=\arg\min_{x}{L_{\beta}^3(x^{k+1},y^{k+1},\color{green}{z},\lambda^k)\mid z\in\mathbb{Z}}$;

$\lambda^{k+1} = {\lambda}^{k}-\beta(A_1x^{k+1}+A_2y^{k+1}+A_3z^{k+1}-b)$.

where the notation $L^3(x, y, z, \lambda)$ is deefined by $$L^3(x, y, z, \lambda) = f_1(x) + f_2(y) + f_3(z)-{\lambda}^T(A_1 x + A_2 y + A_3 z - b)$$ is the Lagrangian rather than augmented Lagrangian.

Three-Operator Splitting
A Three-Operator Splitting Scheme and its Optimization Applications
Three-Operator Splitting and its Optimization Applications
A Three-Operator Splitting Scheme and its Optimization Applications
A Three-Operator Splitting Scheme and its Applications

Randomly Permuted ADMM given initial values at round $k$ is described as follows:

Primal update
- Pick a permutation $\sigma$ of ${1,.. ., n}$ uniformly at random;
- For $i = 1,2,\cdots, n$, compute ${x}^{k+1}{\sigma(i)}$ by $$ x^{k+1}{\sigma(i)}=\arg\min_{x_{\sigma(i)}} L(x^{k+1}{\sigma(1)},\cdots, x^{k+1}{\sigma(i-1)}, x_{\sigma(i)}, x^{k+1}_{\sigma(i+1)},\cdots\mid \lambda^{k}). $$
Dual update. Update the dual variable by $${\lambda}^{k+1}={\lambda}^{k}-\mu(\sum_{i=1}^{n}A_i x_i -b).$$

Randomly Permuted ADMM
On the Expected Convergence of Randomly Permuted ADMM
On the Efficiency of Random Permutation for ADMM and Coordinate Descent
Multi-Block ADMM and its Convergence Random Permutation Helps-A talk by Ye

dlADMM

$$ \begin{array}{l}{\mathrm{ Problem 1. }} \ {\min {W{l}, b_{l}, z_{l}, a_{l}} R\left(z_{L} ; y\right)+\sum_{l=1}^{L} \Omega_{l}\left(W_{l}\right)} \ {\text { s.t. } z_{l}=W_{l} a_{l-1}+b_{l}(l=1, \cdots, L), a_{l}=f_{l}\left(z_{l}\right)(l=1, \cdots, L-1)}\end{array} $$ where $a_0\in\mathbb R^d$ is the input of the deep neural network where $n_0$ is the number of feature dimensions, and $y$ is a predefined label vector. $R\left(z_{L} ; y\right)$ is a risk function for the L-th layer, which is convex, continuous and proper, and $\Omega_{l}\left(W_{l}\right)$ is a regularization term for the l-th layer, which is also convex, continuous, and proper.

Rather than solving Problem 1 directly, we can relax Problem 1 by adding an $\ell_2$ penalty to address Problem 2 as follows:

$$ \begin{array}{l}{\mathrm{ Problem 2 }} \ {\begin{aligned} \min {W{l}, b_{l}, z_{l}, a_{l}} F(\boldsymbol{W}, \boldsymbol{b}, z, \boldsymbol{a})=& R\left(z_{L} ; y\right)+\sum_{l=1}^{L} \Omega_{l}\left(W_{l}\right) \ +\underbrace{(v / 2) \sum_{l=1}^{L-1}\left(\left|z_{l}-W_{l} a_{l-1}-b_{l}\right|{2}^{2}+\left|a{l}-f_{l}\left(z_{l}\right)\right|{2}^{2}\right)}{\text{$\ell_2$ penalty} } \ \text { s.t. } z_{L}=W_{L} a_{L-1}+b_{L} \end{aligned}}\end{array} $$

where $\mathbf{W}=\left{W_{l}\right}{l=1}^{L}, \mathbf{b}=\left{b{l}\right}{l=1}^{L}, \mathbf{z}=\left{z{l}\right}{l=1}^{L}, \mathbf{a}=\left{a{l}\right}_{l=1}^{L-1}$ and $\nu >0$ is a tuning parameter.

Compared with Problem 1, Problem 2 has only a linear constraint $z_L = W_La_L−1 + b_L$ and hence is easier to solve

dlADMM: Deep Learning Optimization via Alternating Direction Method of Multipliers
Deep Learning Optimization via Alternating Direction Method of Multiplier

http://maths.nju.edu.cn/~hebma/
http://stanford.edu/~boyd/admm.html
A General Analysis of the Convergence of ADMM
用ADMM实现统计学习问题的分布式计算
https://www.wikiwand.com/en/Augmented_Lagrangian_method
凸优化：ADMM(Alternating Direction Method of Multipliers)交替方向乘子算法
Splitting methods and ADMM, Thibaut Lienart
Split Bregman
Accelerated Bregman operator splitting with backtracking
Splitting Algorithms, Modern Operator Theory, and Applications (17w5030)
17w5030 Workshop on Splitting Algorithms, Modern Operator Theory, and Applications

Block Coordinate Descent

The methods such as ADMM, proximal gradient methods do not optimize the cost function directly. For example, we want to minimize the following cost function $$ f(x,y) = g(x) + h(y) $$

with or without constraints. Specially, if the cost function is additionally separable, $f(x) = f_1(x_1) + f_2(x_2) + \cdots + f_n(x_n)$, we would like to minimize the sub-function or component function $f_i(x_i), i=1,2,\cdots, n$ rather than the cost function itself

$$ \min_{x} \sum_{i} f_i(x_i) \leq \sum_{i}\min_{x_i}{f_i(x_i)}. $$

And ADMM or proximal gradient methods are to split the cost function to 2 blocks, of which one is differentiable and smooth while the other may not be differentiable. In another word, we can use them to solve some non-smooth optimization problem. However, what if there is no constraints application to the optimization problem?

Coordinate descent is aimed to minimize the following cost function $$ f(x) = g(x) +\sum_{i}^{n} h_i(x_i) $$

where $g(x)$ is convex, differentiable and each $h_i(x)$ is convex. We can use coordinate descent to find a minimizer: start with some initial guess $x^0$, and repeat for $k = 1, 2, 3, \dots$:

$x_{1}^{k} \in\arg\min_{x_1}f(x_1, x_2^{k-1}, x_3^{k-1}, \dots, x_n^{k-1});$

$x_{2}^{k} \in\arg\min_{x_1}f(x_1^{\color{red}{k}}, x_2,x_3^{k-1},\dots, x_n^{k-1});$

$x_{3}^{k} \in\arg\min_{x_1}f(x_1^{\color{red}{k}}, x_2^{\color{red}{k}},x_3,\dots, x_n^{k-1});$

$\vdots$

$x_{n}^{k} \in\arg\min_{x_1}f(x_1^{\color{red}{k}}, x_2^{\color{red}{k}},x_3^{\color{red}{k}},\dots, x_n).$

It can extended to block coordinate descent(BCD) if the variables ${x_1, x_2, \dots, x_n}$ are separable in some blocks.

http://bicmr.pku.edu.cn/conference/opt-2014/index.html
https://calculus.subwiki.org/wiki/Additively_separable_function
https://www.cs.cmu.edu/~ggordon/10725-F12/slides/25-coord-desc.pdf
http://bicmr.pku.edu.cn/~wenzw/opt2015/multiconvex_BCD.pdf
http://pages.cs.wisc.edu/~swright/LPS/sjw-abcr-v3.pdf
MS72: Recent Progress in Coordinate-wise Descent Methods

Block Spliiting Methods

Block Splitting for Distributed Optimization by N. Parikh and S. Boyd

Linear Programming

A linear program (LP) is an optimization problem in which the objective function is linear in the unknowns and the constraints consist of linear equalities and linear inequalities.

The standard problem is $$\min_{x}\mathbf{c^Tx} \ s.t.\quad \mathbf{Ax=b, x\geq 0}.$$

Here $\mathbf x$ is an n-dimensional column vector, $\mathbf{c^T}$ is an n-dimensional row vector, $\mathbf{A}$ is an $m \times n$ matrix, and $\mathbf b$ is an m-dimensional column vector. The vector inequality $\mathbf{x\geq 0}$ means that each component of $\mathbf x$ is nonnegative.

Fundamental Theorem of Linear Programming. Given a linear program in standard form where $\mathbf A$ is an $m × n$ matrix of rank $m$,

if there is a feasible solution, there is a basic feasible solution;

if there is an optimal feasible solution, there is an optimal basic feasible solution.

Linear programming is constrainted convex optimization problem. It is the simplest case of constrainted optimization problem in theory. However, it is useful in many cases.

Application of Linear Programming (With Diagram)
Applications of Linear Programming

If there is no constraints, the linear objectve function is unbounded.

Surrogate Optimization

It is a unified principle that we optimize an objective function via sequentially optimizing surrogate functions such as EM, ADMM.

It is obvious that the choice of optimization methods relies on the objective function. Surrogate loss transform the original problem $\min_{x} f(x)$ into successive trackable subproblems: $$ x^{k+1}=\arg\min_{x} Q_{k}(x). $$

We will call $Q_k(x)$ surrogate function. Surrogate function is also known as merit function.

A good surrogate function should:

Approximate the objective function.
Easy to optimize.

This always leads to the convexification technique where the surrogate function is convex. For example, we can rewrite gradient descent in the following form $$ x^{k+1}=\arg\min_{x} {f(x^k)+\left<\nabla f(x^k), x-x^k\right>+\frac{1}{2\alpha_k}{|x-x^k|}_2^2}. $$

In Newton’s method, we approximate the objective with a quadratic surrogate of the form: $$ Q_k(x) = f(x^k)+\left<\nabla f(x^k), x-x^k\right>+\frac{1}{2\alpha_k}(x-x^k)^{T}H_k(x-x^k). $$

Note that the Hessian matrix $H_k$ is supposed to be positive definite. The quasi-Newton methods will approximate the Hessian matrix with some inverse symmetric matrix. And they can rewrite in the principle of surrogate function, where the surrogate function is convex in the form of linear function + squared functions in some sense.

Note that momentum methods can be rewrite in the surrogate function form: $$ x^{k+1}=x^{k}-\alpha_{k}\nabla_{x}f(x^k)+\rho_{k}(x^k-x^{k-1}) \ = \arg\min_{x}{f(x^k) + \left<\nabla f(x^k), x-x^k\right> + \frac{1}{2\alpha_k} {|x-x^k-\rho_k(x^k-x^{k-1})|}_2^2}. $$

$\color{aqua}{PS}$: How we can extend it to Nesterov gradient methods or stochastic gradient methods?

Discover acceleration

It is natural to replace the squared function with some non-negative function such as mirror gradient methods $$ x^{k+1} = \arg\min_{x} { f(x^k) + \left<\nabla f(x^k), x-x^k\right> + \frac{1}{\alpha_k} B(x,x^k)}. $$

More generally, auxiliary function $g_k(x)$ may replace the Bregman divergence $B(x, x^k)$, where the auxiliary functions $g_k(x)$ have the properties $g_k(x)\geq 0$ and $g_k(x^{k-1})=0$.

http://fa.bianp.net/teaching/2018/eecs227at/
http://fa.bianp.net/teaching/2018/eecs227at/newton.html
http://faculty.uml.edu/cbyrne/NBHSeminar2015.pdf

The parameters $\alpha_k$ is chosen so that the surrogate function is convex by observing that the condition number of Hessian matrix is determined by the factor $\frac{1}{\alpha_k}$. And gradient are used in these methods to construct the surrogate function. There are still two problems in unconstrained optimization:

If the cost function is not smooth or differential such as absolute value function, the gradient is not available so that it is a problem to construct a convex surrogate function without gradient;
In another hand, there is no unified principle to construct a convex surrogate function if we know more information of higher order derivatives.

Practically, we rarely meet pure black box models; rather, we know something about structure of underlying problems One possible strategy is:

approximate nonsmooth objective function by a smooth function
optimize smooth approximation instead (using, e.g., Nesterov’s accelerated method).

A convex function $f$ is called $(\alpha, \beta)$-smoothable if, for any $\mu > 0, \exists$ convex function $f_{\mu}$ s.t.

$f_{\mu}(x) \leq f(x) \leq f_{\mu}(x) + \beta\mu, \forall x$
$f_{\mu}(x)$ is $\frac{\alpha}{\mu}$ smooth.

Moreau envelope (or Moreau-Yosida regularization) of a convex function $f$ with parameter $\mu > 0$ is defined as $$ M_{\mu f}(x)=\inf_{z}{f(z) + \frac{1}{2\mu}{|z-x|}_2^2}. $$

Minimizing $f$ and $M_f$ are equivalent. $prox_{\mu f} (x)$ is unique point that achieves the infimum that defines $M_{\mu f}$ , i.e.

$$ M_{\mu f}(x)=f(prox_{\mu f} (x)) + \frac{1}{2}{|prox_{\mu f} (x)-x|}_2^2}. $$

Moreau envelope $M_f$ is continuously differentiable with gradients $$ \nabla M_{\mu f} = \frac{1}{\mu}(prox_{\mu f} (x)-x). $$

This means $$ prox_{\mu f} (x)=x-\mu \nabla M_{\mu f} $$

i.e., $prox_{\mu f} (x)$ is gradient step for minimizing $M_{\mu f}$.

Fenchel Conjugate of a function ${h}$ is the function $h^{\star}$ defined by

$$ h^{\star}(x)=\sup_{z} {\left<z, x\right>-h(z)}. $$

Smoothing for non-smooth optimization, ELE522: Large-Scale Optimization for Data Science
Smoothing, EE236C (Spring 2013-14)
Smoothing and First Order Methods: A Unified Framework

Charles Byrne gives a unified treatment of some iterative optimization algorithms such as auxiliary function methods.

Young	Recent	Now

He is a featured author of CRC press and professor in UML

Auxiliary-Function Methods minimize the function $$ G_k(x) = f(x) + g_k(x) $$ over $x\in \mathbb{S}$ for $k=1,2,\cdots$ if $f(x):\mathbb{S}\mapsto\mathbb{R}$ and $g_k(x)$ is auxiliary function. We do not linearize the cost function as the previous methods.

Auxiliary-Function Methods(AF methods) include

Barrier- and Penalty-Function Algorithms such as sequential unconstrained minimization (SUM) methods, interior-point methods exterior-point methods;
Proximal Minimization Algorithms such as majorization minimization.

And an AF method is said to be in the SUMMA class if the SUMMA Inequality holds:

$$ G_k(x) - G_k(x^k)\geq g_{k+1}(x), \forall x\in \mathbb{S}. $$

Proximal minimization algorithms using Bregman distances, called here PMAB, minimize the function

$$ G_k(x) = f(x) + B(x, x^{k-1}),\forall x\in \mathbb{S}. $$

The forward-backward splitting (FBS) methods is to minimize the function $$f_1(x) + f_2(x)$$ where both functions are convex and $f_2(x)$ is differentiable with its gradient L-Lipschitz continuous in the Euclidean norm. The iterative step of the FBS algorithm is $$ x^{k} = \operatorname{prox}_{\gamma f_1}(x^{k-1}-\gamma \nabla f_2(x^{k-1})). $$

It is equivalent to minimize $$ G_k(x) = f(x) + \frac{1}{2\gamma} {|x-x^{k-1}|}_2^2-B(x, x^{k-1}), $$

where $B(x,x^{k-1})=f_1(x)-f_1(x^{k-1})-\left<\nabla f_1(x^{k-1}),x-x^{k-1}\right>,, 0 < \gamma\leq \frac{1}{L}$. Alternating Minimization (AM) can be regarded as coordinate optimization methods with 2 blocks:

$p^{n+1}=\arg\min_{p\in P} f(p,q^n)$,

$q^{n+1}=\arg\min_{p\in Q} f(p^{n+1},q)$,

where $f(p, q), p\in P, q\in Q$ is the objective function. It is proved that the sequence $f(p^n, q^n)$ converge tothe minimizer if the Five-Point Property hold: $$ f(p, q) + f(p, q^{n-1}) \geq f(p, q^{n}) + f(p^n, q^{n-1}). $$

For each p in the set P, define $q(p)$ in Q as a member of Q for which $f(p; q(p)) \leq f(p; q)$, for all $q \in P$. Let $\hat{f}(p) = f(p; q(p))$. At the nth step of AM we minimize $$G_n(p)=f(p; q(p))+[f(p; q^{n-1})-f(p; q(p))]$$ to get $p^n$, where $g_n(p)=f(p; q^{n-1})-f(p; q(p))$ is the auxiliary function. So we can write that $G_n(p)= f(p;q(p))+g_n(p)$.

See Iterative Convex Optimization Algorithms; Part Two: Without the Baillon–Haddad Theorem or Alternating Minimization, Proximal Minimization and Optimization Transfer Are Equivalent for more information on AF methods.

Expectation-Maximization can be classified to AF method.

$Q(\theta|\theta^{(t)})=\mathbb{E}(\ell(\theta|Y_{obs}, Z)|Y_{obs},\theta^{(t)})$;

$\theta^{(t+1)}=\arg\min_{\theta} Q(\theta|\theta^{(t)})$.

The $Q(\theta|\theta^{(t)})$ function is log-likelihood function of complete data $(Y_{os}, Z)$ given $(Y_{obs}, \theta^{(t)})$.

Iterative Convex Optimization Algorithms; Part Two: Without the Baillon–Haddad Theorem
Alternating Minimization, Proximal Minimization and Optimization Transfer Are Equivalent
A unified treatment of some iterative algorithms in signal processing and image reconstruction
Convergence Rate of Expectation-Maximization

Optimization using surrogate models applied to Gaussian Process models (Kriging).

Optimization using surrogate models

Surrogate loss function
Divergences, surrogate loss functions and experimental design
Surrogate Regret Bounds for Proper Losses
Bregman Divergences and Surrogates for Learning
Meet the Bregman Divergences
Some Theoretical Properties of an Augmented Lagrangian Merit Function
https://people.eecs.berkeley.edu/~wainwrig/stat241b/lec11.pdf
http://fa.bianp.net/blog/2014/surrogate-loss-functions-in-machine-learning/

Relaxation and Convexification

https://www.di.ens.fr/~aspremon/PDF/Oxford14.pdf

The methods discussed in the book Block Relaxation Methods in Statistics are special cases of what we shall call block-relaxation methods, although other names such as decomposition or nonlinear Gauss-Seidel or ping-pong or seesaw methods have also been used.

In a block relaxation method we minimize a real-valued function of several variables by partitioning the variables into blocks. We choose initial values for all blocks, and then minimize over one of the blocks, while keeping all other blocks fixed at their current values. We then replace the values of the active block by the minimizer, and proceed by choosing another block to become active. An iteration of the algorithm steps through all blocks in turn, each time keeping the non-active blocks fixed at current values, and each time replacing the active blocks by solving the minimization subproblems. If there are more than two blocks there are different ways to cycle through the blocks. If we use the same sequence of active blocks in each iteration then the block method is called cyclic.

In the special case in which blocks consist of only one coordinate we speak of the coordinate relaxation method or the coordinate descent (or CD) method. If we are maximizing then it is coordinate ascent (or CA). The cyclic versions are CCD and CCA.

Augmentation Methods

Augmentation and Decomposition Methods Note: augmentation duality. $$ h(y)=\min_{x\in\mathcal{X}} g(x,y) $$ then $$ \min_{x\in\mathcal{X}}f(x)=\min_{x\in\mathcal{X}}\min_{y\in\mathcal{Y}}g(x,y)=\min_{y\in\mathcal{Y}}\min_{x\in\mathcal{X}}g(x,y)=\min_{y\in\mathcal{Y}}h(y). $$

Alternating Conditional Expectations

The alternating descent conditional gradient method

A ubiquitous prior in modern statistical signal processing asserts that an observed signal is the noisy measurement of a few weighted sources. In other words, compared to the entire dictionary of possible sources, the set of sources actually present is sparse. In many cases of practical interest the sources are parameterized and the measurement of multiple weighted sources is linear in their individual measurements.

As a concrete example, consider the idealized task of identifying the aircraft that lead to an observed radar signal. The sources are the aircraft themselves, and each is parameterized by, perhaps, its position and velocity relative to the radar detectors. The sparse inverse problem is to recover the number of aircraft present, along with each of their parameters.

Convex Relaxations

Convex relaxations are one of the most powerful techniques for designing polynomial time approximation algorithms for NP-hard optimization problems such as Chromatic Number, MAX-CUT, Minimum Vertex Cover etc. Approximation algorithms for these problems are developed by formulating the problem at hand as an integer program.

Quadratic Majorization

A quadratic $g$ majorizes $f$ at $y$ on $\mathbb{R}^n$ if $g(y)=f(y)$ and $g(x)\geq f(x)$ for all $x$. If we write it in the form $$g(x)=f(y)+(x-y)'b+\frac12 (x-y)'A(x-y)$$

https://pratikac.github.io/
Block Relaxation Methods in Statistics by Jan de Leeuw
Deep Relaxation: partial differential equations for optimizing deep neural networks
Deep Relaxation tutorials
CS 369S: Hierarchies of Integer Programming Relaxations
Convex Relaxations and Integrality Gaps
LP/SDP Hierarchies Reading Group
Proofs, beliefs, and algorithms through the lens of sum-of-squares
Iterative Convex Optimization Algorithms; Part Two: Without the Baillon–Haddad Theorem
Relaxation and Decomposition Methods for Mixed Integer Nonlinear Programming

In order for primal-dual methods to be applicable to a constrained minimization problem, it is necessary that restrictive convexity conditions are satisfied. A nonconvex problem can be convexified and transformed into one which can be solved with the aid of primal-dual methods.

Convexification and Global Optimization in Continuous and Mixed-Integer Nonlinear Programming: Theory, Algorithms, Software, and Applications
Convexification and Global Optimization of Nonlinear Programs
Convexification Procedure and Decomposition Methods for Nonconvex Optimization Problem
Conic Optimization Theory: Convexification Techniques and Numerical Algorithms
Convexification of polynomial optimization problems by means of monomial patterns
On convexification/optimization of functionals including an $\ell^2$-misfit term
Lossless Convexification of Nonconvex Control Bound and Pointing Constraints of the Soft Landing Optimal Control Problem
A General Class of Convexification Transformation for the Noninferior Frontier of a Multiobjective Program
Implementation of a Convexification Technique for Signomial Functions
Sequential quadratic programming
A method to convexify functions via curve evolution

Fixed Point Iteration Methods

The fixed point algorithm is initially to find approximate solutions of the equation

$$f(x)=0\tag 1$$ where $f:\mathbb{R}\to\mathbb{R}$.

In this method, we first rewrite the question(1) in the following form $$x=g(x)\tag 2$$

in such a way that any solution of the equation (2), which is a fixed point of the function ${g}$, is a solution of equation (1). For example, we can set $g(x)=f(x)+x, g(x)=x-f(x)$. Then consider the following algorithm.

Give the initial point $x^{0}$;

Compute the recursive procedure $x^{n+1}=g(x^n), n=1,2,\ldots$

So that finally we obtain an sequence ${x^0, x^1, \cdots, x^{n},\cdots}$. There are many methods to test whether this sequence is convergent or not as learnt in calculus.

The order of convergence is defined as the constant $p$ such that $\lim_{n\to \infty}\frac{| x^{n+1}-x^{\ast}|}{| x^{n}-x^{\ast}|^p}=C$ if $\lim_{n\to\infty}x^{n}=x^{\ast}$.

name	the order of convergence
sublinear	$p<1$
linear	$p = 1$ and $C < 1,$
superlinear	$p >1$
quadratic	$p = 2$

Solving nonlinear equations

Solving nonlinear equation $$f(x)=0,$$ means to find such points $x^{\ast}$ such that $f(x^{\ast})=0$, where $f:\mathbb R\mapsto \mathbb R$ is nonlinear.

There are many methods for determining the value of $x^{\ast}$ by successive approximation. With any such method we begin by choosing one or more values $x_0, x_1, \cdots, x_r$, more or less arbitrarily, and then successively obtain new values $x_n$, as certain functions of the previously obtained $x_0, x_1, x_2,\cdots, x_{n-1}$ and possibly those of the derivatives $f^{\prime}(x_0), \cdots, f^{\prime}(x_{n-1})$ or higher derivative information.

Numerical methods: Solving nonlinear equations
Newton, Chebyshev, and Halley Basins of Attraction; A Complete Geometric Approach by Bart D. Stewart, Department of Mathematics, United States Military Academy
Solutions of Equations in One Variable Fixed-Point Iteration II

Bisection method

Find a midpoint of interval $(a^k, b^k )$ and designate it $x^{k+1}=\frac{a^k+b^k}{2}$.

$$ \left(a_{k+1}, b_{k+1}\right) =\left{\begin{array}{ll}{\left(a_{k}, x_{k+1}\right),} & {\text { if } \quad f\left(a_{k}\right) f\left(x_{k+1}\right)<0} \\ {\left(x_{k+1}, b_{k}\right),} & {\text { if } \quad f\left(a_{k}\right) f\left(x_{k+1}\right)>0}\end{array}\right. $$

Regula falsi (false position) method

$$f(x^{k+1})=f(x^k)-\alpha \frac{b^k -a^{k}}{f(b^k) - f(a^{k})}f(x^k)$$

and

$$ \left(a_{k+1}, b_{k+1}\right) =\left{\begin{array}{ll}{\left(a_{k}, x_{k+1}\right),} & {\text { if } \quad f\left(a_{k}\right) f\left(x_{k+1}\right)<0} \ {\left(x_{k+1}, b_{k}\right),} & {\text { if } \quad f\left(a_{k}\right) f\left(x_{k+1}\right)>0}\end{array}\right. $$

Secant method

The $k$-th approximation of root is obtained by $$f(x^{k+1})=f(x^k)-\alpha \frac{x^k -x^{k-1}}{f(x^k) - f(x^{k-1})}f(x^k)$$

Solutions of Equations in One Variable Secant & Regula Falsi Methods

Newton’s method

$$f(x^{k+1})=f(x^k)-\alpha {(f^{\prime})}^{-1}f(x^k)$$

Steffensen’s Method

Steffensen’s method is modified Newton’s method

$$f(x^{k+1})=f(x^k)-\alpha {(\frac{f(x^k+f(x^k))-f(x^k)}{f(x^k)})}^{-1}f(x^k)$$

http://www.fyzikazeme.sk/mainpage/stud_mat/nm/lecture2.pdf
On the development of Steffensen’s method and applications to Hammerstein equations
An improvement of Steffensen's method for solving nonlinear equations by N. Gattal1, and A. S. Chibi

Muller's Method

Muller’s method is a generalization of the secant method, in the sense that it does not require the derivative of the function.

It is an iterative method that requires three starting points $(p_0, f (p_0)), (p_1, f (p_1)),$ and $(p_2, f (p_2))$. A parabola is constructed that passes through the three points; then the quadratic formula is used to find a root of the quadratic for the next approximation.

http://mathworld.wolfram.com/MullersMethod.html
http://mathfaculty.fullerton.edu/mathews/n2003/mullersmethod/MullersMethodProof.pdf
Muller's Method Wiki

Chebyshev Method

http://mathworld.wolfram.com/ChebyshevIteration.html
http://www.sam.math.ethz.ch/~mhg/pub/Cheby-02ParComp.pdf

Halley's Methods

Newton's versus Halley's Method： A Dynamical Systems Approach
MODIFIED HALLEY’S METHOD FOR SOLVING NONLINEAR FUNCTIONS WITH CONVERGENCE OF ORDER SIX AND EFFICIENCY INDEX 1.8171
http://benisrael.net/NEWTON-MONTHLY.pdf

Aitken’s $\Delta^2$ method

Let $\left{p_{n}\right}$ be generated by a method which has a linear convergence,
Having $p_{0}, p_{1}$ and $p_{2}$ compute $\hat{p}{0}=p{0}-\frac{\left(p_{1}-p_{0}\right)^{2}}{p_{2}-2 p_{1}+p_{0}}, n=1,2, \ldots$
1. compute $p_{n+2}$
2. compute $\hat{p}{n}=p{n}-\frac{\left(p_{n+1}-p_{n}\right)^{2}}{p_{n+2}-2 p_{n+1}+p_{n}};$ and
3. the algorithm terminates $p \approx \hat{p}{n}$ if $\left|\hat{p}{n}-\hat{p}_{n-1}\right|<\epsilon$.

http://mathfaculty.fullerton.edu/mathews/n2003/AitkenSteffensenMod.html
2.6 - Accelerating Convergence Aitken’s Delta squared Method
Aitken’s $\Delta^2$ method extended
Higher Order Aitken Extrapolation with Application to Converging and Diverging Gauss-Seidel Iterations
AN ACCELERATION TECHNIQUE FOR SLOWLY CONVERGENT FIXED POINT ITERATIVE METHODS
https://www.math.usm.edu/lambers/mat460/fall09/lecture13.pdf
Fixed-Point Iteration
Lecture 8 : Fixed Point Iteration Method, Newton’s Method
2.2 Fixed-Point Iteration

Homotopy Continuation Methods

Homotopy Methods transform a hard problem into a simpler one whit easily calculated zeros and then gradually deform this simpler problem into the original one computing the zeros of the intervening problems and eventually ending with a zero of the original problem.

The homotopy method (continuation method, successive loading method) can be used to generate a good starting value.

Suppose one wishes to obtain a solution to a system of $N$ nonlinear equations in $N$ variables, say $$F(x)=0$$ where $F : \mathbb R^N \to \mathbb R^N$ is a mapping which, for purposes of beginning our discussion we will assume is smooth.

Since we assume that such a priori knowledge is not available, the iteration will often fail, because poor starting values are likely to be chosen.

We construct a parameter depending function $$H(x, s) = sF(x) + (1 − s)F_0(x), s\in [0,1]$$ and note, that $H(x, 0) = 0$ is the problem with known solution and $H(x, 1) = 0$ is the original problem $F(x) = 0$. As the solution of $H(x, s) = 0$ depends on s we denote it by $x^{\ast}(s)$. We discretize now the intervall into $0 = s_0 < s_1 < \cdots < s_n = 1$ and solve a sequence of nonlinear systems with Newton’s method $$H(x, s_i) = 0$$

http://homepages.math.uic.edu/~jan/
http://people.bu.edu/fdc/H-topy.pdf
3.10: Homotopy Method
CHAPTER 2: Numerical homotopy continuation
https://blog.csdn.net/jbb0523/article/details/52460408
Homotopy Continuation, Max Buot (CMU)&Donald Richards (PSU)
On the Link Between Gaussian Homotopy Continuation and Convex Envelopes
HOPE: A Homotopy Optimization Method for Protein Structure Prediction
Solving inequality Constrained Optimization Problems by Differential Homotopy Continuation Methods
PHCpack: a general-purpose solver for polynomial systems by homotopy continuation

In high dimensional space, it is a little different. Fixed point iteration as well as the fixed point itself arises in many cases such as [https://arxiv.org/pdf/1511.06393.pdf].

The contracting mapping ${F}:\mathbb{R}^{d}\to\mathbb{R}^{d}$ is defined as $$|F(x)-F(y)|\leq \alpha|x-y|, \forall x,y \in\mathbb{R},\alpha\in[0,1).$$ Thus $\lim_{|x-y|\to 0}\frac{|F(x)-F(y)|}{|x-y|}\leq \alpha\in[0,1)$.

Now we rewrite the necessary condition of unconstrainted optimization problems $\nabla f(x) = 0$ to the fixed point equation:

$$ \begin{align} \nabla f(x) = 0 \Rightarrow & x - \alpha\nabla f(x) = x \\ \nabla f(x) = 0 \Rightarrow g(x) - \alpha\nabla f(x) = g(x) \Rightarrow & x - g^{-1}(\alpha\nabla f(x)) = x\\ \nabla f(x) = 0\Rightarrow H(x)x- \alpha\nabla f(x) = H(x)x \Rightarrow & x - \alpha H(x)^{-1} \nabla f(x) = x \\ \nabla f(x) = 0 \Rightarrow M(x)\nabla f(x) = 0 \Rightarrow & x -\alpha M(x)\nabla f(x) = x \end{align} $$

where $H(x)$ is the lambda-matrix.

These correspond to gradient descent, mirror gradient methods, Newton's methods and quasi-Newton's methods, respectively.

And the projected (sub)gradient methods are in the fixed-point iterative form: $$ x = Proj_{\mathbb{S}}(x-\alpha \nabla f(x)) $$

as well as the mirror gradient and proximal gradient methods different from the projection operator.

Expectation maximization is also an accelerated fixed point iteration as well as Markov chain.

The following figures in the table is form Formulations to overcome the divergence of iterative method of fixed-point in nonlinear equations solution

Fixed Point Iterations

https://www.wikiwand.com/en/Fixed-point_theorem
FixedPoint: A suite of acceleration algorithms with Application
Books on Fixed Point Theory
Recent Advances in Convex Optimization and Fixed Point Algorithms by Jean-Christophe Pesquet
Acceleration methods TIES594 PDE-solvers, Lecture 14, 6.5.2015, Olli Mali
The SUNNonlinearSolver_FixedPoint implementation¶

Gaussian-Seidel Method and Component Solution Methods

The mapping $\hat T_i :X\mapsto X$, corresponding to an update of the ith block-component only, is given by $$\hat T_i(x)=\hat T_i(x_1, \dots, x_m)=(x_1,\dots, x_{i-1}, T_i(x), x_{i+1}, \dots, x_m).$$

Updating all block-components of $x$, one at a time in increasing order, is equivalent to applying the mapping $S: X\mapsto X$, defining by $$S=\hat{T}m\circ \hat{T}{m-1}\circ\cdots \circ \hat{T}1,$$ where $\circ$ denotes composition.
An equivalent definition of $S$ is given by the equation $$S_i=T{i}(S_1(x), \dots, S_{i-1}(x), x_i, \dots, x_m)$$ where $S_i:X\mapsto X_i$ is the $i$th block-component of $S$. The mapping $S$ will be called the Gaussian-Seidel mapping based on the mapping $T$ and the iteration $x(t+1)=S(x(t))$ will be called Gaussian-Seidel algorithm based on the mapping $T$.

The system $x=T(x)$ can be decomposed into m smaller system of equations of the form $$x_i=T_i(x_1,\dots, x_m), \quad i=1,\dots,m,$$ which have to be solved simultaneously. We will consider an algorithm that solves at iteration the $i$th equation in the system for $x_i$, while keeping the other component fixed.

Given a vector $x(t)\in X$, the $i$th block-component $x_i(t+1)$ of the next vector is chosen to be a solution of the $i$th equation in the system, that is, $$x_i(t+1)\in{y_i\in X_i\mid y_i=T_i(x_1,\dots,x_{i-1}, y_i, x_{i+1},\dots, x_m)}.$$

ISTA and FASTA

The $\ell_1$ regularization is to solve the ill-conditioned equations such as $$\min_{x}{|Ax-b|}_2^2+\lambda{|x|}_1.$$

It is less sensitive to outliers and obtain much more sparse solutions (as opposed to $\ell_2$ regularization). Its application includes and is not restricted in LASSO, compressed sensing and sparse approximation of signals.

It is clear that the absolute value function is not smooth or differentiable everywhere even the objective function ${|Ax-b|}_2^2+\lambda{|x|}_1$ is convex. It is not best to solve this problem by gradient-based methods.

Iterative Shrinkage-Threshold Algorithms(ISTA) for $\ell_1$ regularization is $$x^{k+1}=\mathbf{T}{\lambda t}(x^{k}-tA ^{T}(Ax-b))$$ where $t> 0$ is a step size and $\mathbf{T}{\alpha}$ is the shrinkage operator defined by $${\mathbf{T}{\alpha}(x)}{i}={(x_i-\alpha)}{+}sgn(x{i})$$ where $x_i$ is the $i$ th component of $x\in\mathbb{R}^{n}$.

FISTA with constant stepsize

$x^{k}= p_{L}(y^k)$ computed as ISTA;

$t_{k+1}=\frac{1+\sqrt{1+4t_k^2}}{2}$;

$y^{k+1}=x^k+\frac{t_k -1}{t_{k+1}}(x^k-x^{k-1})$.

A Fast Iterative Shrinkage Algorithm for Convex Regularized Linear Inverse Problems
https://pylops.readthedocs.io/en/latest/gallery/plot_ista.html
ORF523: ISTA and FISTA

This will lead to the operator splitting methods analysesed by Wotao Yin and others.

ORIE 6326: Convex Optimization Operator Splitting
Monotone Operator Splitting Methods
A Course on First-Order, Operator Splitting, and Coordinate Update Methods for Optimization
Operator Splitting Methods for Convex Optimization Analysis and Implementation
Some Operator Splitting Methods for Convex Optimization
FAST ALTERNATING DIRECTION OPTIMIZATION METHODS
https://www.math.ucla.edu/~wotaoyin/math285j.18f/
Fixed-Point Continuation for $\ell_1$-Minimization: Methodology and Convergence@https://epubs.siam.org/doi/abs/10.1137/070698920

Generic Acceleration Framework

Given
- existing optimization procedure $M(f, x)$
- previous iterates $x^{1},x^{2}, \cdots ,x^{k}$ and
- new proposed guess $x^P = M(f, x^{k})$.
Find step $x^{k+1}$ using information at $x^{1},x^{2}, \cdots ,x^{k}, x^P$.

Discover acceleration
The zen of gradient descent
Introduction to Optimization Theory MS&E213 / CS269O - Spring 2017 Chapter 4 Acceleration
Algorithms, Nature, and Society
https://damienscieur.com/sections/paper.html
Generalized Framework for Nonlinear Acceleration
Nonlinear Acceleration of Stochastic Algorithms

Relaxation and inertia

A Generic online acceleration scheme for Optimization algorithms via Relaxation and Inertia
RELAXATION AND INERTIA IN FIXED-POINT ITERATIONS WITH APPLICATIONS
Weak Convergence of a Relaxed and Inertial Hybrid Projection-Proximal Point Algorithm for Maximal Monotone Operators in Hilbert Space
FIRE: Fast Inertial Relaxation Engine for Optimization on All Scales
Monotonicity, Acceleration, Inertia, and the Proximal Gradient algorithm
Online Relaxation Method for Improving Linear Convergence Rates of the ADMM

http://www.iutzeler.org/
https://www.math.uni-bielefeld.de/~gaehler/

Approximate Minimal Polynomial Extrapolation

$$f(x)=\frac{1}{2}{|Ax-b|}_2^2$$ Given $A\in\mathbb{R}^{n\times n}$ such that 1 is not an eigenvalue of $A$ and $v\in\mathbb{R}^n$, the minimal polynomial of $A$ with respect to the vector $v$ is the lowest degree polynomial $p(x)$ such that $p(A)v = 0, p(1) = 1$.

Efficient implementation of minimal polynomial and reduced rank extrapolation methods
https://core.ac.uk/download/pdf/82614502.pdf
Minimal polynomial and reduced rank extrapolation methods are related
http://www.cs.technion.ac.il/~asidi/
https://www.di.ens.fr/~aspremon/PDF/FOCM17.pdf
https://simons.berkeley.edu/sites/default/files/docs/8821/alexsimons17.pdf
Nonlinear Schwarz iterations with Reduced Rank Extrapolation
A BLOCK RECYCLED GMRES METHOD WITH INVESTIGATIONS INTO ASPECTS OF SOLVER PERFORMANCE

Anderson Acceleration

Let $H$ be a Hilbert space equipped with a symmetric inner product $\left<\cdot, \cdot\right>: H \times H \to R$. Let $T : H \to H$ be a nonexpansive mapping and consider for fixed $x_0 \in H$ the Halpern-Iteration (named after Benjamin Halpern, who introduced it): $$x^{k+1}= (1-{\alpha}_k)x^0+ {\alpha}_k T(x^k), {\alpha}_k \in (0,1)$$

with ${\alpha}_k = \frac{k+1}{k+2}$ for approximating a fixed point of $T$.

It is proved that $\frac{1}{2}{| x^k -T(x^k)|}\leq\frac{|x^0-x^{\ast} |}{k+1}$.

On the Convergence Rate of the Halpern-Iteration

Krasnosel'skii-Mann(KM, or averaged) iterations update $x^k$ in iteration ${k}$ to

$$ x^{k+1}=(1-\alpha_k)x^k+ {\alpha}_k T(x^k), {\alpha}_k \in (0,1) $$

for the fixed point problem (2). Specially, $T(x^k)= x^k-\alpha_k \nabla f(x^k)$ for the convex optimization problems. If the operator $T$ is non-expensive, then the sequence ${x^{k}\mid k=0,1,2,\dots}$ is convergent by Krasnosel'skii-Mann Theorem.

In 1953, Mann defined an iterative method: $$ x^{k+1}=M(x^k, \alpha_k , T)=(1-\alpha_k)x^k+ {\alpha}k T(x^k),\ \alpha_k \in [0,1),,\text{satisfying}, \sum{k=1}^{\infty}{\alpha}_k = \infty. $$

The sequence ${x^{k}}$ defined by

$$ x^{k+1}= (1-\alpha_k)x^k+ {\alpha}_k T(x^k), 0 < a\leq {\alpha}k \leq b < 1 $$ (additionally $\sum{k=1}^{\infty}{\alpha}k = \infty$ )is called a modied Mann iteration. Take ${\alpha_k, \beta_k}$ two sequences in $[0, 1]$ satisfying $$ \sum{k=1}^{\infty}{\alpha}_k {\beta}k = \infty, \lim{k\infty}\beta_k =0, 0 \leq {\alpha}_k \leq \beta_k \leq 1. $$

Then the sequence ${x^{k}}$ defined by

$$ x^{k+1}= (1-\alpha_k)x^k+ {\alpha}_k T(y^k), \\ y^{k} = (1-\beta_k)x^k + {\beta}_k T(x^k), $$

is called the Ishikawa iteration.

Hojjat Afshari and Hassen Aydi proposes another Mann type iteration: $$ x^{k+1} = {\alpha}_k x^k + {\beta}_k T(x^k) + {\gamma}_k T^{2}(x^k) $$

where ${\alpha}_k + {\beta}_k + {\gamma}_k = 1, {\alpha}_k, {\beta}k, {\gamma}k \in [0,1)$ for all $k\geq 1$, $\sum{k}^{\infty}(1-\alpha_k)=\infty$, $\sum{k=1}^{\infty}\gamma_k < \infty$.

If the Mann type iteration ${x^k}$ converges strongly to a point $p$, then $p$ is a fixed point of $T$.

Krasnoselskii-Mann method for non-self mappings
http://www.krasnoselskii.iitp.ru/papereng.pdf
Strong convergence of the modified Mann iterative method for strict pseudo-contractions
Some results on a modified Mann iterative scheme in a reflexive Banach space
Some results about Krasnosel'skiĭ-Mann iteration process
Convergence theorems for inertial KM-type algorithms
Modified inertial Mann algorithm and inertial CQ-algorithm for nonexpansive mappings

There is an acceleration framework of fixed point iterations for the problem (2) called Anderson Acceleration or regularized nonlinear acceleration：

$F_k = (h_{k-m_k}, \dots, h_k)$ where $h_i=g(x_i)-x_i$;

Determine $\alpha^{k}=(\alpha_0^{k},\dots, \alpha_{m_k}^{k})^{T}$ that solves $\min_{\alpha^{k}}{|F_k\alpha^k|}2^2$ s.t. $\sum{i=0}^{m_k}\alpha_i^{k}=1$;

Set $x_{k+1}=\sum_{i=0}^{m_k} \alpha_i^{k} g(x_{k - m_k + i})$.

It is maybe interesting to introduce some Bregman divergence $D_{f}(\cdot, \cdot)$ instead of the squared $\ell_2$ norm when choosing $\alpha^{k}$ so that $$\alpha^{k}=\arg\min_{\alpha^k}{D_{f}(F_k \alpha^k)\mid \sum_{i=0}^{m_k}\alpha_i^{k}=1}.$$ Thus we would use mirror gradient methods to solve this problem.

It is proved that the Anderson acceleration converges if the fixed point mapping is cotractive.

Anderson acceleration for fixed point iterations
Anderson Acceleration
MS142 Anderson Acceleration and Applications
Convergence Analysis For Anderson Acceleration
Comments on "Anderson Acceleration, Mixing and Extrapolation"
Globally Convergent Type-I Anderson Acceleration for Non-Smooth Fixed-Point Iterations
A proof that Anderson acceleration improves the convergence rate in linearly converging fixed point methods (but not in those converging quadratically)

The Baillon-Haddad Theorem provides an important link between convex optimization and fixed-point iteration, which proves that if the gradient of a convex and continuously differentiable function is non-expansive, then it is actually firmly non-expansive.

The Baillon-Haddad Theorem Revisited
A Generic online acceleration scheme for Optimization algorithms via Relaxation and Inertia
RELAXATION AND INERTIA IN FIXED-POINT ITERATIONS WITH APPLICATIOn
Monotonicity, Acceleration, Inertia, and the Proximal Gradient algorithm
Iterative Convex Optimization Algorithms; Part One: Using the Baillon–Haddad Theorem
Recent Advances in Convex Optimization and Fixed Point Algorithms by Jean-Christophe Pesquet

Anderson Acceleration of the Alternating Projections Method for Computing the Nearest Correlation Matrix

A correlation matrix is symmetric, has unit diagonal, and is positive semidefinite. Frequently, asynchronous or missing observations lead to the obtained matrix being indefinite.

A standard way to correct an invalid correlation matrix, by which we mean a real, symmetric indefinite matrix with unit diagonal, is to replace it by the nearest correlation matrix in the Frobenius norm, that is, by the solution of the problem $$\min{ {|A − X|}F : \text{X is a correlation matrix} },$$ where ${|A|}F^2=\sum{ij}a{ij}^2$.

The Nearest Correlation Matrix
Anderson Acceleration of the Alternating Projections Method for Computing the Nearest Correlation Matrix
https://github.com/higham/anderson-accel-ncm

DAAREM

Damped Anderson acceleration with restarts and monotonicity control for accelerating EM and EM-like algorithms Talk
Damped Anderson acceleration with restarts and monotonicity control for accelerating EM and EM-like algorithms
Accelerating the EM Algorithm for Mixture-density Estimation
http://nhenderstat.com/research/

Anderson Accelerated Douglas-Rachford Splitting

Anderson Accelerated Douglas-Rachford Splitting
https://ctk.math.ncsu.edu/TALKS/Anderson.pdf
Using Anderson Acceleration to Accelerate the Convergence of Neutron Transport Calculations with Anisotropic Scattering

Nonlinear Acceleration

As shown before, the acceleration schemes are based on the linear combination of last iterated sequence. The question is why it is linear combination? Why not other extrapolation of the last updated values?

Alternating Anderson-Richardson method

Alternating Anderson-Richardson method: An efficient alternative to preconditioned Krylov methods for large, sparse linear systems
Anderson acceleration of the Jacobi iterative method: An efficient alternative to Krylov methods for large, sparse linear systems

Regularized Nonlinear Acceleration

We describe a convergence acceleration technique for generic optimization problems. Our scheme computes estimates of the optimum from a nonlinear average of the iterates produced by any optimization method. The weights in this average are computed via a simple linear system, whose solution can be updated online. This acceleration scheme runs in parallel to the base algorithm, providing improved estimates of the solution on the fly, while the original optimization method is running. Numerical experiments are detailed on classical classification problems.

$$ \begin{array}{l}{\text { Input: Sequence }\left{x_{0}, x_{1}, \ldots, x_{k+1}\right}, \text { parameter } \lambda>0} \\ {\text { 1: Form } U=\left[x_{1}-x_{0}, \ldots, x_{k+1}-x_{k}\right]} \\ {\text { 2. Solve the linear system }\left(U^{T} U+\lambda I\right) z=1} \\ {\text { 3. Set } c=z /\left(z^{T} \mathbf{1}\right)} \\ {\text { Output: Return } \sum_{i=0}^{k} c_{i} x_{i}, \text { approximating the optimum } x^{\ast}} \end{array} $$

REGULARIZED NONLINEAR ACCELERATION
Regularized Nonlinear Acceleration@lids.mit.edu
Regularized Nonlinear Acceleration@simons.berkeley
Regularized Nonlinear Acceleration by Damien Scieur
Regularized nonlinear acceleration, Mathematical Programming
http://spars2017.lx.it.pt/index_files/papers/SPARS2017_Paper_16.pdf
https://github.com/windows7lover/RegularizedNonlinearAcceleration
https://damienscieur.com/

Objective Acceleration

O-ACCEL (objective acceleration), is novel in that it minimizes an approximation to the objective function on subspaces of $\mathbb{R}^n$. We prove that O-ACCEL reduces to the full orthogonalization method for linear systems when the objective is quadratic, which differentiates our proposed approach from existing acceleration methods. Comparisons with the limited-memory Broyden–Fletcher–Goldfarb–Shanno and nonlinear conjugate gradient methods indicate the competitiveness of O-ACCEL.

Objective acceleration for unconstrained optimization, Optimization methods and software conference 2017, Havana
Objective acceleration for unconstrained optimization: code
Objective acceleration for unconstrained optimization by Asbjørn Nilsen Riseth
http://julianlsolvers.github.io/Optim.jl/stable/#algo/ngmres/
http://spars2017.lx.it.pt/

However, it is best to think from the necessary condition of optima in non-convex optimization in my opinion. Another question is to generalize the fixed point iteration to stochastic gradient methods.

PyUNLocBoX: Optimization by Proximal Splitting
https://lts2.epfl.ch/

Direct Nonlinear Acceleration

Optimization acceleration techniques such as momentum play a key role in state-of-the-art machine learning algorithms. Recently, generic vector sequence extrapolation techniques, such as regularized nonlinear acceleration (RNA) of Scieur et al. (Scieur et al., 2016), were proposed and shown to accelerate fixed point iterations. In contrast to RNA which computes extrapolation coefficients by (approximately) setting the gradient of the objective function to zero at the extrapolated point, we propose a more direct approach, which we call direct nonlinear acceleration (DNA). In DNA, we aim to minimize (an approximation of) the function value at the extrapolated point instead. We adopt a regularized approach with regularizers designed to prevent the model from entering a region in which the functional approximation is less precise. While the computational cost of DNA is comparable to that of RNA, our direct approach significantly outperforms RNA on both synthetic and real-world datasets. While the focus of this paper is on convex problems, we obtain very encouraging results in accelerating the training of neural networks.

https://arxiv.org/abs/1905.11692
https://www.aritradutta.com/

Proportional–Integral–Derivative

The principle of feedback is simple an input, $x^n$, is given, processed through some function, $f$, and then the output, $y^n$, becomes the next input, $x^{n+1}$, repeatedly. When allowing the ouput to equal the next input, an identity exists so that $x^{n+1}=y^n$. Cobweb diagrams exploit the relationship, map the iterations, and reveal the behaviors of fixed points.

https://www.mi.sanu.ac.rs/vismath/stewart/index.html

A PID controller continuously calculates an error $e(t)$, which is the difference between the desired optimal output and a measured system output, and applies a correction $u(t)$ to the system based on the proportional $(P)$, integral $(I)$, and derivative $(D)$ terms of $e(t)$. Mathematically, there is: $$u(t)= K_p e(t) + K_i\int_{0}^{t}e(x)\mathrm d x + K_d\frac{\mathrm d}{\mathrm dt}e(t) $$

where $u$ is the control signal and $e$ is the control error. The control signal is thus a sum of three terms:

the P-term (which is proportional to the error);
the I-term (which is proportional to the integral of the error);
and the D-term (which is proportional to the derivative of the error).

The controller can also be parameterized as $$u(t)= K_p {e(t) + \frac{1}{T_i}\int_{0}^{t}e(x)\mathrm d x + T_d\frac{\mathrm d}{\mathrm dt}e(t)},\tag{PID}$$

where $T_i$ is called integral time and $T_d$ derivative time.

The proportional part acts on the present value of the error, the integral represent and average of past errors and the derivative can be interpreted as a prediction of future errors based on linear extrapolation.

http://www.scholarpedia.org/article/Optimal_control
EE365: Stochastic Control Spring Quarter 2014
PID Theory Explained
Chapter 8: PID Control
CDS 101/110 -- Fall 2004 Analysis and Design of Feedback Systems

Methods	Recursion	Integration
Gradient Descent	$x^{t+1} = x^t -\alpha_t g(x^t)$	?
Momentum Methods	$x^{t+1} = x^t -\alpha_t g(x^t) + \rho_t(x^t - x^{t-1})$	?
Nesterov's Gradient Methods	$x^{t+1} =y^t -\alpha_t g(y^t), y^t = x^t + \rho_t(x^t -x^{t -1})$	?
Newton's Methods	$x^{t+1} = x^t - \alpha_i H_t^{-1}g(x^t)$	?
Mirror Gradient Methods	$\nabla h(x^{t+1})-\nabla h(x^t) = x^t - \alpha_t \nabla f(x^t) , x\in \mathbb{S}$	?

By viewing the gradient $g(x^t)$ as error $e(t)$, and comparing it to PID controller, one can see that gradient descent only uses the present gradient to update the weights.

We rewrite the fomula $x^{t+1} = x^t -\alpha_t g(x^t) + \rho_t(x^t - x^{t-1})$ as $$x^{t+1} = x^t -\alpha_t g(x^t) + \rho_t\underbrace{(x^t - x^{t-1})}{-\alpha{t-1}g(x^{t-1})+\rho_{t-1}(x^{t-1}-x^{t-2})}\ = x^t -\alpha_t g(x^t) - \sum_{i=1}^{t-1}[\prod_{j=0}^{i-1}\rho_{t-j}]\alpha_{t-i}g(x^{t-i})+ \rho_1(x^1-x^0) .$$

One can see that the update of parameters relies on both the present gradient and the integral of past gradients. The only difference is that there is a decay $\prod_{j=0}^{i-1}\rho_{t-j}$ term in the I term.

PID optimizer

The proposed PID optimizer updates parameter $x$ at iteration $(t +1)$ by:

$V^{t+1}=\alpha V^t -r g(x^t)$
$D^{t+1}=\alpha D^t +(1-\alpha)(g(x^t)-g(x^{t-1}))$
$x^{t+1}=x^t+V^{t+1}+K_d D^{t+1}$

CVPR 2018 | 加速模型收敛的新思路（控制理论+深度学习）
一种用于深度网络随机优化的PID控制器方法
A PID Controller Approach for Stochastic Optimization of Deep Networks

To overcome the oscillation problem in the classical momentum-based optimizer, recent work associates it with the proportional-integral (PI) controller, and artificially adds D term producing a PID controller. It suppresses oscillation with the sacrifice of introducing extra hyper-parameter.

SPI-Optimizer: an integral-Separated PI Controller for Stochastic Optimization

By further analyzing the underlying constrained optimization problem, we have found that the two camps of distributed optimization can actually be related through the framework of proportional-integral control. It turns out that consensus methods with constant step-sizes are akin to proportional control and dual-decomposition is akin to integral control. In much the same way that proportional and integral control can be combined to create a control method with damped response and zero steady state error, the two methods of distributed optimization can be combined to produce a damped response with zero error.

Proportional Integral Distributed Optimization for Dynamic Network Topologies
Proportional-Integral Distributed Optimization
Continuous-time Proportional-Integral Distributed Optimization for Networked Systems
Continuous-time Proportional-Integral Distributed Optimization for Networked Systems
A Control Perspective for Centralized and Distributed Convex Optimization
https://arxiv.org/pdf/1905.03468
Feedback-Feedforward Control Approach to Distributed Optimization

Sample Recurrence Relation	Idea of Successive Approximations
$x^{k+1}=M(x^k)$	$x^k=\underbrace{M(M(\cdots M(x^0)))}_{\text{k times}}$

Accelerated Optimization in the PDE Framework: Formulations for the Manifold of Diffeomorphism
Integration Methods and Accelerated Optimization Algorithms

Dynamical Systems

We will focus on the optimization methods in the form of fixed point iteration and dynamical systems. It is to minimize the following function $$ f(x), x\in\mathbb{R}^p, \quad\nabla f(x) = g(x), \quad\nabla^2 f(x)= H(x). $$

Iteration	ODE	Name
$x^{k+1}=x^{k}-\alpha_k g(x^k)$	$\dot{x}(t)=- g(x(t))$	Gradient descent
$x^{k+1}=x^{k}-\alpha_kH_k^{-1} g(x^k)$	$\dot{x}(t) =- H^{-1}(x)g(x(t))$	Newton's method

Like Newton interpolation, more points can compute higher order derivatives. The dynamics of accelerated gradient methods are expected to correspond to higher order differential equations.

Some acceleration methods are iterations of the corresponding algorithms of Asymptotic Vanishing Damping called by Hedy Attouch:

$$ \quad \quad \ddot{x}(t) + \frac{\alpha}{t} \dot{x}(t) + \nabla \Phi (x(t)) =0.\tag{AVD} $$

where $\Phi(x(t))$ is dependent on the objective function; $\alpha >0$ is constant in $\mathbb{R}$.

The fast minimization properties of the trajectories of the second-order evolution equation is also studied by Hedy's group in 2016: $$ \ddot{x}(t) + \frac{\alpha}{t} \dot{x}(t) + \nabla^2 \Phi (x(t))\dot{x}(t) + \nabla \Phi (x(t)) =0\tag{HDD} $$

When it comes to numerical solution to differential equations, it is to find the solution of the equations $x(t)$ so that the equations hold; in optimization, the optima is our goal so that the focus is limit order $$\lim_{t\to t_0} x(t)=x^{\star}$$ if possible where $x^{\star}$ optimizes the cost/objective function $f(x)$ specially $t_0=\infty$.

https://perso.math.univ-toulouse.fr/spot/resumes/
Fast convex optimization via inertial dynamics with Hessian driven damping
A proximal-Newton method for monotone inclusions in Hilbert spaces with complexity $O(1/k^2)$
https://www.researchgate.net/profile/Hedy_Attouch
https://www.ljll.math.upmc.fr/~plc/sestri/
Hedy Attouch at https://biography.omicsonline.org
Rate of convergence of the Nesterov accelerated gradient method in the subcritical case $\alpha\leq 3$
A Dynamical Approach to an Inertial Forward-Backward Algorithm for Convex Minimization
An Inertial Proximal Method for Maximal Monotone Operators via Discretization of a Nonlinear Oscillator with Damping
THE HEAVY BALL WITH FRICTION METHOD, I. THE CONTINUOUS DYNAMICAL SYSTEM: GLOBAL EXPLORATION OF THE LOCAL MINIMA OF A REAL-VALUED FUNCTION BY ASYMPTOTIC ANALYSIS OF A DISSIPATIVE DYNAMICAL SYSTEM
Viscosity Solutions of Minimization Problems
Rate of convergence of the Nesterov accelerated gradient method in the subcritical case $\alpha \leq 3$
A dynamic approach to a proximal-Newton method for monotone inclusions in Hilbert spaces, with complexity $O(\frac{1}{n^2})$
FAST CONVERGENCE OF INERTIAL DYNAMICS AND ALGORITHMS WITH ASYMPTOTIC VANISHING DAMPING

It is difficult to generalize these methods to stochastic cases.

There is a wonderful summary DYNAMICAL, SYMPLECTIC AND STOCHASTIC PERSPECTIVES ON GRADIENT-BASED OPTIMIZATION given by Micheal I Jordan at ICM 2018.

Some new connections between dynamical systems and optimization is found.

Variational and Dynamical Perspectives On Learning and Optimization
Continuous and Discrete Dynamics For Online Learning and Convex Optimization
DYNAMICAL, SYMPLECTIC AND STOCHASTIC PERSPECTIVES ON GRADIENT-BASED OPTIMIZATION
On Symplectic Optimization
A variational perspective on accelerated methods in optimization
A Dynamical Systems Perspective on Nesterov Acceleration
Generalized Momentum-Based Methods: A Hamiltonian Perspective
https://people.eecs.berkeley.edu/~jordan/optimization.html

Weijie J. Su (joint with Bin Shi, Simon Du, and Michael Jordan) introduced a set of high-resolution differential equations to model, analyze, interpret, and design accelerated optimization methods.

$$ \ddot{x}(t) + 2\sqrt{\mu}\dot{x}(t) + \sqrt{s}\nabla^2 f(x) \dot{x}(t) + (1+\sqrt{\mu s})\nabla f(x) = 0, \tag{Su} $$

A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method: Theory and Insights
Acceleration via Symplectic Discretization of High-Resolution Differential Equations
Understanding the Acceleration Phenomenon via High-Resolution Differential Equations
Global Convergence of Langevin Dynamics Based Algorithms for Nonconvex Optimization

Analysis of Hamilton-Jacobi Equation: Optimization, Dynamics and Control - Part II of II
Optimization via the Hamilton-Jacobi-Bellman Method: Theory and Applications by Navin Khaneja, Harvard & KITP
GRADIENT FLOW DYNAMICS
Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem
The dynamics of Lagrange and Hamilton
Optimization and Dynamical Systems
Direct Runge-Kutta Discretization Achieves Acceleration
The Physical systems Behind Optimization Algorithms
Integration Methods and Optimization Algorithms

ADMM and Dynamics

A Dynamical Systems Perspective on Nonsmooth Constrained Optimization
https://kgatsis.github.io/learning_for_control_workshop_CDC2018/assets/slides/Vidal_CDC18.pdf
ADMM and Accelerated ADMM as Continuous Dynamical Systems
ADMM, Accelerated-ADMM, and Continuous Dynamical Systems, Talk @DIMACS
Relax, and Accelerate: A Continuous Perspective on ADMM
http://people.ee.duke.edu/~lcarin/Xuejun12.11.2015.pdf

ESAIM: Control, Optimization and Calculus of Variations (ESAIM: COCV)
MCT'03 Louisiana Conference on Mathematical Control Theory
International Workshop “Variational Analysis and Applications”, ERICE, August 28 – September 5, 2018
Games, Dynamics and Optimization, 2019
System dynamics & optimization
Introduction to Dynamical Systems by John K. Hunter, Department of Mathematics, University of California at Davis

Special Focus on Bridging Continuous and Discrete Optimization

GRADIENTS AND FLOWS: CONTINUOUS OPTIMIZATION APPROACHES TO THE MAXIMUM FLOW PROBLEM

Stochastic Approximation

Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.

Kiefer-Wolfowitz Algorithm
Stochastic Process and Application

Robbins–Monro algorithm introduced in 1951 by Herbert Robbins and Sutton Monro, presented a methodology for solving a root finding problem, where the function is represented as an expected value.

Assume that we have a function ${\textstyle M(\theta )}:\mathbb{R}\mapsto\mathbb{R}$, and a constant ${\textstyle \alpha \in\mathbb R}$, such that the equation ${\textstyle M(\theta )=\alpha }$ has a unique root at ${\textstyle \theta ^{\ast}}$.

It is assumed that while we cannot directly observe the function ${\textstyle M(\theta )}$, we can instead obtain measurements of the random variable ${\textstyle N(\theta )}$ where ${\textstyle \mathbb{E} [N(\theta )]=M(\theta )}$. The structure of the algorithm is to then generate iterates of the form: $${\displaystyle {\theta}{n+1}= {\theta}{n} - a_{n}(N({\theta}{n})-\alpha )}$$ where $a{1},a_{2},\dots$ is a sequence of positive step sizes. Robbins and Monro proved , Theorem 2 that $\theta_n$ converges in $L^{2}$ (and hence also in probability) to $\theta$ , and Blum later proved the convergence is actually with probability one, provided that:

${\textstyle N(\theta )}$ is uniformly bounded, ${\textstyle M(\theta )}$ is nondecreasing, ${\textstyle M'(\theta ^{\ast})}$ exists and is positive, and The sequence ${\textstyle a_{n}}$ satisfies the following requirements: $$\sum_{n=0}^{\infty} a_{n}=\infty \quad \fbox{ and }\quad \sum_{n=0}^{\infty} a_{n}^{2} < \infty \quad$$ A particular sequence of steps which satisfy these conditions, and was suggested by Robbins–Monro, have the form: ${\textstyle a_{n}=a/n}$, for ${\textstyle a>0}$. Other series are possible but in order to average out the noise in ${\textstyle N(\theta )}$, the above condition must be met.

The basic ideas of the Robbins–Monro scheme can be readily modified to provide successive approximations for the minimum (or maximum) of a unimodal regression function, as was shown by Kiefer and Wolfowitz (1952) who introduced a recursive scheme of the form

$$\theta_{n+1}= {\theta}{n} - a{n}\Delta(x_{n})\tag{KW}$$

to find the minimum $\theta$ of $M$ (or, equivalently, the solution of $dM/dx = 0$).

During the nth stage of the Kiefer–Wolfowitz scheme, observations $y_n^{(1)}$ and $y_n^{(2)}$ are taken at the design levels $x_n^{(1)} = x_n + c_n$ and $x_n^{(2)} = x_n − c_n$, respectively. In (KW), $\Delta(x_n)=\frac{y_n^{(1)}-y_n^{(2)}}{2c_n}$ an and $c_n$ are positive constants such that $c_n \to 0$, $\sum (c_n/a_n)^2<\infty$ and $\sum a_n=\infty$.

5 The Stochastic Approximation Algorithm
Chapter 15: Introduction to Stochastic Approximation Algorithms
Dynamics of Stochastic Approximation
Stochastic Approximation by Tze Leung Lai
A Multivariate Stochastic Approximation Procedure
The Robbins–Monro Stochastic Approximation Approach to a Discrimination Problem
Stochastic approximation: invited paper
Stochastic Approximation and Recursive Algorithms and Applications
Stochastic Approximations, Diffusion Limit and Small Random Perturbations of Dynamical Systems

Stochastic Gradient Descent

Stochastic gradient descent is classified to stochastic optimization which is considered as the generalization of gradient descent.

Stochastic gradient descent takes advantages of stochastic or estimated gradient to replace the true gradient in gradient descent. It is stochastic gradient but may not be descent. The name stochastic gradient methods may be more appropriate to call the methods with stochastic gradient. It can date back to stochastic approximation in statistics.

It is aimed to solve the problem with finite sum optimization problem, i.e. $$ \arg\min_{\theta}\frac{1}{n}\sum_{i=1}^{n}f(\theta|x_i) $$ where $n<\infty$ and ${f(\theta|x_i)}{i=1}^{n}$ are in the same function family and ${x_i}{i=1}^{n}\subset \mathbb{R}^{d}$ are constants while $\theta\in\mathbb{R}^{p}$ is the variable vector.

The difficulty is $p$, that the dimension of $\theta$, is tremendous. In other words, the model is overparameterized. And the number $n$ is far larger than $p$ generally, i.e. $n \gg p\gg d$. What is worse, the functions ${f(\theta|x_i)}_{i=1}^{n}$ are not convex in most case.

The stochastic gradient method is defined as $$ \theta^{k+1}=\theta^{k}-\alpha_{k}\frac{1}{m}\sum_{j=1}^{m}\nabla f(\theta^{k}| x_{j}^{\prime}) $$ where $x_{j}^{\prime}$ is draw from ${x_i}_{i=1}^{n}$ and $m\ll n$ is on random.

It is the fact $m\ll n$ that makes it possible to compute the gradient of finite sum objective function and its side effect is that the objective function is not always descent. Thus it is also called as mini-batch gradient descent.

There is fluctuations in the total objective function as gradient steps with respect to mini-batches are taken.

The fluctuations in the objective function as gradient steps with respect to mini-batches are taken

An heuristic proposal for avoiding the choice and for modifying the learning rate while the learning task runs is the bold driver (BD) method[^14]. The learning rate increases exponentially if successive steps reduce the objective function $f$, and decreases rapidly if an “accident” is encountered (if objective function $f$ increases), until a suitable value is found. After starting with a small learning rate, its modifications are described by the following equation: $$ \alpha_{k+1}= \begin{cases} \rho \alpha_{k}, & {f(\theta^{k+1})< f(\theta^{k})}; \ \eta^n \alpha_{k}, & {f(\theta^{k+1})> f(\theta^{k})} \text{using ${\alpha}_k$}, \end{cases} $$

where $\rho$ is close to 1 such as $\rho=1.1$ in order to avoid frequent “accidents” because the objective function computation is wasted in these cases, $\eta$ is chosen to provide a rapid reduction ($\eta = 0.5$), and $n$ is the minimum integer such that the reduced rate $\eta^n$ succeeds in diminishing the objective function.[^13]

Except the learning rate or step length, there is yet another hyperparameter - the batch size $m$ in each iteration.

The fact that the sample size is far larger than the dimension of parameter, $n\gg p$, that makes it expensive to compute total objective function $f(\theta)=\sum_{i=1}^{n}f(\theta|{x}i)$. Thus it is not clever to determine the learning rate $\alpha_k$ by line search. And most stochastic gradient methods are to find proper step length $\alpha{k}$ to make it converge at least in convex optimization. The variants of gradient descent such as momentum methods or mirror gradient methods have their stochastic counterparts.

It is simplest to set the step length a constant, such as ${\alpha}_k=3\times 10^{-3}, \forall k$.
There are decay schemes, i.e. the step length ${\alpha}_k$ diminishes such as ${\alpha}_k=\frac{\alpha}{k}$, where $\alpha$ is constant.
And another strategy is to tune the step length adaptively such as AdaGrad, ADAM.

$\color{lime}{PS}$: the step length $\alpha_k$ is called learning rate in machine learning. Additionally, stochastic gradient descent is also named as increment gradient methods in some case.

The Differences of Gradient Descent and Stochastic Gradient Descent

We can see some examples to see the advantages of incremental method such as the estimation of mean. Given $x_1, x_2, \dots, x_n$ the mean is estimated as $\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}$. If now we observed more data $y_1, y_2, \dots, y_m$ from the population, the mean could be estimated by $\frac{n\bar{x}}{m+n} + \frac{\sum_{j=1}^{m} y_j }{m+n} =\frac{n\bar{x}+\sum_{j=1}^{m} y_j}{m+n}$. It is not necessary to summarize ${x_1, \dots, x_n}$.

Another example, it is Newton interpolation formula in numerical analysis. The task is to fit the function via polynomials given some point in th function $(x_i , f(x_i)), i = 1,2, \dots, n$. The Newton form of the interpolating polynomial is given by

$$ P_n(x) = a_0 + a_1 (x-x_1) + a_2 (x-x_1)(x-x_2) + \cdots \\ + a_n(x-x_1)(x-x_2)(x-x_3)\cdots (x-x_n). $$

This form is incremental and if another points $(x_{n+1}, f(x_{n+1}))$ is observed we will fit the function $f$ more precisely just by adding another term $a_{n+1}(x-x_1)(x-x_2)\cdots (x-x_n)(x-x_{n+1})$ where the coefficients $a_0, a_1,\cdots, a_n$ are determined by $f(x_1), f(x_2), \cdots, f(x_n)$.

See the following links for more information on stochastic gradient descent.

https://www.wikiwand.com/en/Stochastic_gradient_descent
http://ruder.io/optimizing-gradient-descent/
https://zhuanlan.zhihu.com/p/22252270
http://fa.bianp.net/teaching/2018/eecs227at/stochastic_gradient.html
A Brief (and Comprehensive) Guide to Stochastic Gradient Descent Algorithms
A look at SGD from a physicist's perspective - Part 1
A look at SGD from a physicists's perspective - Part 2, Bayesian Deep Learning
A look at SGD from a physicists's perspective - Part 3, Langevin Dynamics and Applications

Convergence Analysis

Convergence Analysis of Gradient Descent Stochastic Algorithms
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Classification Errors
Stochastic Approximations, Diffusion Limit and Small Random Perturbations of Dynamical Systems
Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey
Convex Relaxations of Convolutional Neural Nets
The Impact of Neural Network Overparameterization on Gradient Confusion and Stochastic Gradient Descent
Quasi-potential as an implicit regularizer for the loss function in the stochastic gradient descent
The Multiplicative Noise in Stochastic Gradient Descent: Data-Dependent Regularization, Continuous and Discrete Approximation

Progress in machine learning (ML) is happening so rapidly, that it can sometimes feel like any idea or algorithm more than 2 years old is already outdated or superseded by something better. However, old ideas sometimes remain relevant even when a large fraction of the scientific community has turned away from them. This is often a question of context: an idea which may seem to be a dead end in a particular context may become wildly successful in a different one. In the specific case of deep learning (DL), the growth of both the availability of data and computing power renewed interest in the area and significantly influenced research directions.

http://blavatnikawards.org/honorees/profile/leon-bottou/
https://leon.bottou.org/projects/sgd
https://leon.bottou.org/research/stochastic
https://leon.bottou.org/papers/bottou-bousquet-2008
Large-Scale Machine Learning with Stochastic Gradient Descent
Trade off of Machine Learning
The Tradeoffs of Large-scale Learning
https://sites.google.com/view/panos-toulis/implicit-sgd
http://dustintran.com/blog/on-asymptotic-convergence-of-averaged-sgd
https://github.com/ptoulis/implicit-sgd

ADAM and More

ADAM composes of adaptive step stratagies and momentum methods in some sense. It is widely used in deep learning training.

Adam: A Method for Stochastic Optimization
On the convergence of Adam and Beyond
http://ruder.io/deep-learning-optimization-2017/

The Differences of Stochastic Gradient Descent and its Variants

$\color{green}{PS}$: Zeyuan Allen-Zhu and others published much work on acceleration of stochastic gradient descent.

https://arxiv.org/pdf/1811.03962.pdf
http://www.arxiv-sanity.com/search?q=Zeyuan+Allen-Zhu

Variance Reduction Stochastic Gradient Methods

VR-SGD
http://ranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf
Variance reduction techniques for stochastic optimization
Stochastic Variance-Reduced Optimization for Machine Learning Part I
Stochastic Variance-Reduced Optimization for Machine Learning Part II
Laplacian Smoothing Gradient Descent
Entropy SGD
https://github.com/tdozat/Optimization
https://zhuanlan.zhihu.com/p/25473305

Stochastic Optimization

Stochastic optimization problems are so diverse that the field has become fragmented into a Balkanized set of communities with competing algorithmic strategies and modeling styles. It is easy to spend a significant part of a career mastering the subtleties of each perspective on stochastic optimization, without recognizing the common themes.

We have developed a unified framework that covers all of these books. We break all sequential decision problems into five components: state variables, decision variables, exogenous information variables, the transition function and the objective function, where we search over policies (functions) for making decisions. We have identified two core strategies for designing policies (policy search, and policies based on lookahead approximations), each of which further divides into two classes, creating four classes that cover all of the solution approaches.

Clearing the Jungle of Stochastic Optimization

Distributed Optimization Methods

Beginning around ten years ago, the single-threaded CPU performance stopped improving significantly, due to physical limitations; it is the numbers of cores in each machine that continue to arise. Today we can buy 8-core phones, 64-core workstations, and 2.5k-core GPUs at affordable prices. On the other hand, most of our algorithms are still single-threaded, and because so, their running time is about the same as it was to ten years ago and will stay so in the future, say, ten or twenty years. To develop faster algorithms, especially for those large-scale problems that arise in signal processing, medical imaging, and machine learning, it is inevitable to consider parallel computing. Machine learning especially deep learning requires more powerful distributed and decentralized optimization methods.

Large scale supervised machine learning methods, which are based on gradient to improve their performance, need online near-real-time feedback from massive users.

Math 285J: A First Course on Large-Scale Optimization Methods
http://bicmr.pku.edu.cn/~wenzw/bigdata2019.html
Data Science meet optimzation
Distributed optimization for control and learning
ME 555: Distributed Optimization
https://www.euro-online.org/websites/dso/
https://labs.criteo.com/2014/09/poh-part-3-distributed-optimization/
Convex and Distributed Optimization
(749g) Accelerated Parallel Alternating Method of Multipliers (ADMM) for Distributed Optimization
8th IEEE Workshop Parallel / Distributed Computing and Optimization (PDCO 2018)
Distributed Optimization and Control
ADOPT: Asynchronous Distributed Constraint Optimization with Quality Guarantees
On Distributed Optimization in Networked Systems
Distributed Optimization and Control using Operator Splitting Methods
Foundations of Distributed and Large Scale Computing Optimization
Distributed Optimization of Large-Scale Complex Networks
Walkman: A Communication-Efﬁcient Random-Walk Algorithm for Decentralized Optimization
NOVEL GRADIENT-TYPE OPTIMIZATION ALGORITHMS FOR EXTREMELY LARGE-SCALE NONSMOOTH CONVEX OPTIMIZATION
Projects: Structure Exploitation in Large-Scale Non-Convex Optimisation
A Distributed Flexible Delay-tolerant Proximal Gradient Algorithm
http://ecee.colorado.edu/marden/files/dist-opt-journal.pdf
Hemingway: Modeling Distributed Optimization Algorithms
http://principlesofoptimaldesign.org/

Parallelizing Stochastic Gradient Descent

Elastic Stochastic Gradient Descent

It is based on an elastic force which links the parameters they compute with a center variable stored by the parameter server (master). The algorithm enables the local workers to perform more exploration, i.e. the algorithm allows the local variables to fluctuate further from the center variable by reducing the amount of communication between local workers and the master.

The loss function of Elastic-SGD $$x^{\ast}=\arg\min_{x, x^1, x^N}\frac{1}{N}\sum_{n=1}^{N} f(x^n)+\frac{1}{2\rho N}|x-x^n|^2.$$

Deep learning with Elastic Averaging SGD

Parle

Parle exploits the phenomenon of wide minima that has been shown to improve generalization performance of deep networks and trains multiple “replicas” of a network that are coupled to each other using attractive potentials. It requires infrequent communication with the parameter server and is well-suited to singlemachine-multi-GPU as well as distributed settings.

The method replace the loss function by a smoother loss called local entropy $$f_{\gamma}^{\beta}(x)=-\frac{1}{\beta}\log(G_{\gamma/\beta}\times \exp(-\beta f(x)))$$ where $G_{\gamma/\beta}$ is the Gaussian kernel with variance $\gamma/\beta$.

Parle solves for $$x^{\ast}=\arg\min_{x, x^1, x^N}\sum_{n=1}^{N} f_{\gamma}^{\beta}(x^n)+\frac{1}{2\rho N}| x - x^n|^2.$$

Parle: parallelizing stochastic gradient descent
Unraveling the mysteries of stochastic gradient descent on deep neural networks

Asynchronous Stochastic Gradient Descent

Asynchronous Stochastic Gradient (shared memory):

Each thread repeatedly performs the following updates:
- For $t = 1, 2, \cdots$
  - Randomly pick an index $i$
  - $x\leftarrow x - \nabla f_i(x)$.

Main trick: in shared memory systems, every threads can access the same parameter $x$.

Asynchronous Accelerated Stochastic Gradient Descent:

Asynchronous Stochastic Gradient Descent with Delay Compensation
Asynchronous Decentralized Parallel Stochastic Gradient Descent
https://github.com/AbduElturki/Asynchronous-Stochastic-Gradient-Descent
Stochastic Proximal Langevin Algorithm: Potential Splitting and Nonasymptotic Rates
Stochastic Proximal Langevin Algorithm: Potential Splitting and Nonasymptotic Rates
Hybrid Stochastic Gradient Descent Algorithms for Stochastic Nonconvex Optimization

Hogwild

Hogwild allows processors access to shared memory with the possibility of overwriting each other's work.

Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

https://www.podc.org/data/podc2018/podc2018-tutorial-alistarh.pdf
https://www.math.ucla.edu/~wotaoyin/math285j.18
http://seba1511.net/dist_blog/

Distributed Nesterov-like gradient methods

Distributed Nesterov-like Gradient Algorithms
Convergence Rates of Distributed Nesterov-Like Gradient Methods on Random Networks
Accelerated Distributed Nesterov Gradient Descent for Convex and Smooth Functions
https://nali.seas.harvard.edu/
http://users.isr.ist.utl.pt/~jxavier/

Parallel Coordinate Methods

Synchronized Parallel Coordinate Descent
DISTRIBUTED ASYNCHRONOUS COMPUTATION OF FIXED POINTS
An Inertial Parallel and Asynchronous Fixed-Point Iteration for Convex Optimization

Stochastic coordinate fixed-point iteration

There are examples such as Forward-Backward, Douglas-Rachford,... for finding a zero of a sum of maximally monotone operators or for minimizing a sum of convex functions.

Random block-coordinate Krasnoselskiı–Mann iteration

for $n=0,\cdots$
- for $i=1, \cdots, m$
  - $x_{i, n+1}=x_{i, n}+\epsilon_{i, n}\lambda_n(\mathrm T_i(x_{1,n},\cdots, x_{m, n})+a_{i, n}-x_{i, n})$

where

$x_0, a_n\in \mathbb H$ and $\mathbb H$ is separable real Hilbert space,
$\epsilon_{n}$ is random variable in ${0,1}^m\setminus \mathbf{0}$,
$\lambda_n\in (0, 1)$ and $\liminf \lambda_n>0$ and $\limsup \lambda_n<1$,
the mapping $\mathrm T:\mathbb H\to \mathbb H$ i.e. $x\mapsto (T_1x, \cdots, T_i x, \cdots, T_m x)$ is nonexpansive operator.

Double-layer random block-coordinate algorithms

for $n=0, 1, \cdots$
- $y_n =\mathrm R_n x_n + b_n$
- for $i=1, \cdots, m$
  - $x_{i, n+1}=x_{i, n}+\epsilon_{i, n}\lambda_n(\mathrm T_{i, n}(y_n)+a_{i, n}-x_{i, n})$

Random block-coordinate Douglas-Rachford splitting

https://www.ljll.math.upmc.fr/~plc/sestri/pesquet2014.pdf
https://arxiv.org/abs/1404.7536
https://arxiv.org/abs/1406.6404
https://arxiv.org/abs/1406.5429
https://arxiv.org/abs/1706.00088

Random block-coordinate forward-backward splitting

Stochastic Block-Coordinate Fixed Point Iterations with Applications to Splitting
Stochastic Quasi-Fejer Block-Coordinate Fixed Point Iterations with Random Sweeping
LINEAR CONVERGENCE OF STOCHASTIC BLOCK-COORDINATE FIXED POINT ALGORITHMS

Asynchronous coordinate fixed-point iteration

ARock: An Algorithmic Framework for Asynchronous Parallel Coordinate Update
Asynchronous Stochastic Coordinate Descent: Parallelism and Convergence Properties

Asynchronous Parallel Greedy Coordinate Descent

Stochastic ADMM

Linearly constrained stochastic convex optimization is given by $$ \min_{x,y}\mathbb{E}{\vartheta}[F(x,\vartheta)]+h(y),\ s.t. , Ax+By = b, x\in\mathbb{X}, y\in\mathbb{Y}. $$ where typically the expectation $\mathbb{E}{\vartheta}[F(x,\vartheta)]$ is some loss function and ${h}$ is regularizer to prevent from over-fitting.

The first problem is that the distribution of $\vartheta$ is unknown as well as the expectation $\mathbb{E}_{\vartheta}[F(x,\vartheta)]$ in the objective function.

Modified Augmented Lagrangian

Linearize $f(x)$ at $x_k$ and add a proximal term：

$$ L_{\beta}^{k}(x,y,\lambda):= f(x_k)+\left<x_k, g_k\right>+h(y)-\left< \lambda, Ax+By-b\right>+\frac{\beta}{2}{|Ax + By-b|}2^2 \+\frac{1}{2\eta_k}|x-x{k}|^2 $$

where $g_k$ is a stochastic (sub)gradient of ${f}$.

$x^{k+1}=\arg\min_{x\in\mathbf{X}}L_{\beta}^{k}(x,y^{\color{aqua}{k}},\lambda^{\color{aqua}{k}});$

$y^{k+1}=\arg\min_{y\in\mathbf{Y}} L_{\beta}^{k}(x^{\color{red}{k+1}}, y, \lambda^{\color{aqua}{k}});$

$\lambda^{k+1} = \lambda^{k} - \beta (Ax^{\color{red}{k+1}} + By^{\color{red}{k+1}}-b).$

Stochastic ADMM
Accelerated Variance Reduced Stochastic ADMM
Towards optimal stochastic ADMM or the talk in ICML
V-Cycle or Double Sweep ADMM
https://arxiv.org/abs/1903.01786

Distributed ADMM

Let us assume that the function $F$ has the following decomposition where each $x_i$ has its own dimension. $$ \min_{x} F(x){=\sum_{i=1}^{n}f_i(x_i)}, \ s.t. \quad x\in\mathrm{C} $$

We can reformulate the problem to get $$ \min_{x} F(x) + g(y), \ s.t.\quad x=y $$

where $g(z)$ is the indictor function of set $\mathrm{C}$.

We can dene an augmented Lagrangian $$ L_{\beta}(x, y, \lambda) = F(x)+g(y) - \lambda^{T}(x-y) + \frac{\beta}{2}{|x-y|}{2}^{2} \ = \sum{i=1}^{n} [f_i(x_i) - {\lambda}i(x_i -y_i)+\frac{\beta}{2}(x_i - y_i)^2] + g(y)\ = \sum{i=1}^{n}L_{(\beta,i)}(x_i, y, \lambda_i). $$

We can split the optimization over $x_i$:

$x_i^{k+1} =\arg\min_{x_i} L_{(\beta,i)}(x_i, y^{k}, \lambda_i^{k})\quad i=1,2,\cdots, n;$

$y^{k+1} =\arg\min_{x_i} L_{(\beta,i)}(x_i^{\color{green}{k+1}}, y, \lambda_i^k);$

$\lambda^{k+1}=\lambda^k+\lambda (x^{\color{green}{k+1}} - y^{\color{green}{k+1}}).$

Then optimization over $x_i$ is done in parallel. Subsequently, the results are communicated back to a master node which performs the $y$ update (usually just a projection as in the example above) and returns back the result to other worker nodes. The update over $\lambda_i$ is again done independently.

http://users.isr.ist.utl.pt/~jmota/DADMM/
http://repository.ust.hk/ir/Record/1783.1-66353
https://ieeexplore.ieee.org/document/7039496
DISTRIBUTED ASYNCHRONOUS OPTIMIZATION WITH THE ALTERNATING DIRECTION METHOD OF MULTIPLIERS
Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers by S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein
Asynchronous Distributed ADMM for Consensus Optimization
Notes on Distributed ADMM

Monotone Operator Splitting Methods for Optimization

Monotone operator splitting methods, which originated in the late 1970’s in the context of partial differential equations, have started to be highly effective for modeling and solving a wide range of data analysis and processing problems, in particular high-dimensional statistical data analysis.

Operator splitting is to decompose one omplicated operator(procedure) into some simple operators (procedures). For example, ADMM splits the maxmin operator of the augmented Lagrangian into 3 opertors: $$ \arg\min_{x,y}\max_{\lambda} L_{\beta}(x,y\mid \lambda) $$ to $$ \arg\min_{x}L_{\beta}(x,y\mid \lambda) \circ \ ,\arg\min_{y}L_{\beta}(x,y\mid \lambda) \circ \arg\max_{\lambda} L_{\beta}(x,y,\mid \lambda). $$

They are really some block relaxation techniques.

https://web.stanford.edu/class/ee364b/lectures/monotone_split_slides.pdf
Operator Splitting by Professor Udell @ORIE 6326: Convex Optimization
A note on the equivalence of operator splitting methods
Splitting methods for monotone operators with applications to parallel optimization
Operator Splitting Methods for Fast MPC
https://staffportal.curtin.edu.au/staff/profile/view/Jie.Sun/
Operator Splitting Methods in Data Analysis
https://www.samsi.info/programs-and-activities/research-workshops/operator-splitting-methods-data-analysis/
http://idda.cuhk.edu.cn/zh-hans/page/10297
Random monotone operators and application to Stochastic Optimization

ORQUESTRA - Distributed Optimization and Control of Large Scale Water Delivery Systems
Ray is a fast and simple framework for building and running distributed applications.
CoCoA: A General Framework for Communication-Efficient Distributed Optimization
Federated Optimization: Distributed Machine Learning for On-Device Intelligence

Gradient Free Optimization Methods

As shown in Principle of Optimal Design, non-gradient methods are classified into 3 classes:

We organize the discussion of non-gradient methods in three parts, direct search methods, heuristic methods, and black-box methods. Direct search methods rely on ranking the objective function values rather than using the objective values themselves. Heuristic methods use some random process to generate new candidate solutions, and so the iteration paths and even the solutions obtained can change each time we perform a search. Black-box methods deal with problems that have no known model function structure that we can exploit. For example, functions generated by simulations have no discernible mathematical properties (like convexity), and so we refer to them as black-box functions. In this sense, all nongradient methods can be used for black-box problems. The two black-box methods described in this chapter were created to address design problems with expensive simulations, and so their main goal is to find an optimum quickly with few function evaluations.

Derivative-free optimization (DFO) algorithms differ in the way they use the sampled function values to determine the new iterate. One class of methods constructs a linear or quadratic model of the objective function and defines the next iterate by seeking to minimize this model inside a trust region. We pay particular attention to these model-based approaches because they are related to the unconstrained minimization methods described in earlier chapters. Other widely used DFO methods include the simplex-reflection method of Nelder and Mead, pattern-search methods, conjugate-direction methods, and simulated annealing.

Heuristic methods will be introduced in computational intelligence as well as Bayesian Optimization.

Let us start with the example and suppose that we want to $$ \arg\min_{x}f(x)=x^{2}+4\sin(2x).\tag{1} $$

The objective function $f(x)$, a non-convex function, has many local minimizer or extrema. The function (1) is upper bounded by $x^2+4$ and lower bounded by $x^2 - 4$.

Another insightful example is to minimize the following cost function: $$ x^{2}+4\sin(2x) - 1001 \color{red}{\mathbb{I}{\sqrt{2}}}(x) \tag{2} $$ where the last part $\color{red}{\mathbb{I}{\sqrt{2}}}(x)$ is equal to 1 when $x=\sqrt{2}$ and 0 otherwise, a Dirac function. It is almost equal to the first function (1) except at the point $x=\sqrt{2}$. The minimal value of the above function is $2+4\sin(2\sqrt{2})-1001$ when $x=\sqrt{2}$. And these two functions are two different kinds of non-convex functions.

Optimization and Assumptions @ Freemind|Test functions for optimization

Graduated Optimization

Another related method is graduated optimization, which is a global optimization technique that attempts to solve a difficult optimization problem by initially solving a greatly simplified problem, and progressively transforming that problem (while optimizing) until it is equivalent to the difficult optimization problem.Further, when certain conditions exist, it can be shown to find an optimal solution to the final problem in the sequence. These conditions are:

The first optimization problem in the sequence can be solved given the initial starting point.
The locally convex region around the global optimum of each problem in the sequence includes the point that corresponds to the global optimum of the previous problem in the sequence.

Graduated Optimization

https://www.wikiwand.com/en/Numerical_continuation
Multi-Resolution Methods and Graduated Non-Convexity

Kiefer-Wolfowitz Algorithm

In stochastic gradient descent, the estimated gradient is a partial sum of the population gradient so that it is necessary to compute the gradient of sample function. Kiefer-Wolfowitz Algorithm is the gradient-free version of stochastic gradient descent. It is a recursive scheme to approximate the minimum or maximum of the form $$ x^{k+1} =x^{k} - a_n \Delta(x^k) $$

During the n-th stage, observations $y^{\prime\prime}$ and $y^{\prime}$ are taken at the design levels $x^{\prime\prime}=x^k+c_k$ and $x^{\prime}=x^k - c_k$, respectively. And $\Delta(x^k)=\frac{y^{\prime\prime} - y^{\prime}}{2c_n}$, $a_n$ and $c_n$ are positive constants so that $c_n\to 0, \sum_{i=0}^{\infty}(a_n/c_n)^2< \infty, \sum a_n =\infty$.

http://pawel.sawicz.eu/tag/kiefer-wolfowitz/
A compansion to Kiefer-Wolfowit algorithm
Archive for Kiefer-Wolfowitz algorithm
Strong Convergence of a Stochastic Approximation Algorithm
Almost Sure Approximations to the Robbins-Monro and Kiefer-Wolfowitz Processes with Dependent Noise
A Kiefer–Wolfowitz Algorithm with Randomized Differences
Stochastic Approximation by Tze Leung Lai

Multi-Level Optimization

Multi-Level Optimization is to optimize a related cheap function $\hat{f}$ when the objective function $f$ is very expensive to evaluate.

Multi-level optimization methods repeat three steps:

Cheap minimization of modified $\hat{f}$: $$y^{k}=\arg\min_{x\in \mathbb{R}^p} + \left<v_k, x\right>.$$

Use $y^{k}$ to give descent direction, $$x^{k+1} = x^k -a_k(x^k - y^k) .$$

Set $v_k$ to satisfy first-order coherence $$v_{k+1}=\frac{L_f}{L_F} F^{\prime}(x^{k+1}) - f^{\prime}(x^{k+1}).$$

Panos Parpas Panos Parpas EMERITUS PROFESSORBERCRUSTEM

Multilevel Optimization Methods: Convergence and Problem Structure
A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization
Multiresolution Algorithms for Faster Optimization in Machine Learning

The simplified scheme of work for the multilevel optimization procedure can be represented as follows.

Solving the optimization problem using a surrogate model. For this purpose, the method of indirect optimization based on the self-organization (IOSO) is used. This method allows finding the single solution for single-objective optimization or the Pareto-optimal set of solutions for multi-objective problems.
For the obtained solution the indicators of efficiency are updated using the high-fidelity analysis tools.
The adjustment of a current search region is performed.
The adjustment of the surrogate model is performed. Depending upon the particular features of the applied mathematical simulation, the adjustment procedure can performed using several approaches. One such approach involves constructing non-linear corrective dependencies. This includes evaluation of the results obtained with different fidelity analysis tools. The other possible approach is application of nonlinear estimation of surrogate model internal parameters.
Replacement of the surrogate model by the adjusted one and the return to step (1).

Multilevel Optimization: Convergence Theory, Algorithms and Application to Derivative-Free Optimization
multilevel optimization iosotech
OptCom: A Multi-Level Optimization Framework for the Metabolic Modeling and Analysis of Microbial Communities

Reactive Search Optizmiation

Roberto Battiti and Mauro Brunato explains that how reactive search optimization, reactive affine shaker(RAS), memetic algorithm works for optimization problem without gradient in Machine Learning plus Intelligent Optimization.

Curriculum-based course timetabling solver; uses Tabu Search or Simulated Annealing
Machine Learning plus Intelligent Optimization by Roberto Battiti and Mauro Brunato
Reactive Search and Intelligent Optimization

Zeroth-Order Method for Distributed Optimization With Approximate Projections
Derivative-Free Optimization (DFO)
Derivative Free Optimization / Optimisation sans Gradient
Delaunay-based derivative-free optimization via global surrogates, part I: linear constraints
Delaunay-based derivative-free optimization via global surrogates, part II: convex constraints
https://projects.coin-or.org/Dfo
https://nlopt.readthedocs.io/en/latest/NLopt_Algorithms/
http://adl.stanford.edu/aa222/Lecture_Notes_files/chapter6_gradfree.pdf
https://code.fb.com/ai-research/nevergrad/
https://github.com/facebookresearch/nevergrad
https://www.kthohr.com/optimlib.html
https://www.infoq.com/presentations/black-box-optimization

And there are more topics on optimization such as this site. And more courses on optimization:

凸优化 (2018年秋季) by 文在文
2010- Shanghai Jiao Tong University 张小群
Optimisation: Master's degree in Modelling for Science and Engineering
A Course on First-Order, Operator Splitting, and Coordinate Update Methods for Optimization by Yinwo Tao
Math 273C: Numerical Optimizatoin by Yinwo Tao
Math 273, Section 1, Fall 2009: Optimization, Calculus of Variations, and Control Theory
ECE236B - Convex Optimization (Winter Quarter 2018-19) by Prof. L. Vandenberghe, UCLA
EECS 127 / 227AT: Optimization Models and Applications — Fall 2018 Instructors: A. Bayen, L. El Ghaoui.
IEOR 262B: Mathematical Programming II by Professor Javad Lavaei, UC Berkeley
https://web.stanford.edu/~boyd/teaching.html
EE364b - Convex Optimization II
Convex Optimization: Fall 2018 Machine Learning 10-725
Algorithms for Convex Optimization
Optimization by Vector Space Methods
Algorithms for Convex Optimization
EE 227C (Spring 2018), Convex Optimization and Approximation
https://ocw.mit.edu/courses/sloan-school-of-management/15-093j-optimization-methods-fall-2009/

http://niaohe.ise.illinois.edu/IE598_2016/index.html

http://www.optimization-online.org/
http://convexoptimization.com/
More Optimization Online Links
TUTORIALS AND BOOKS at http://plato.asu.edu/sub/tutorials.html.
Provable Nonconvex Methods/Algorithms
https://optimization.mccormick.northwestern.edu/index.php/Main_Page
Non-convex Optimization for Machine Learning
GLOBAL OPTIMALITY CONDITIONS FOR DEEP NEURAL NETWORKS
http://www.vision.jhu.edu/assets/HaeffeleCVPR17.pdf
https://www.springer.com/us/book/9783319314822
https://core.ac.uk/display/83849901
https://core.ac.uk/display/73408878
https://zhuanlan.zhihu.com/p/51514687
http://math.cmu.edu/~hschaeff/research.html
https://people.eecs.berkeley.edu/~elghaoui/Teaching/EECS127/index.html
https://blogs.princeton.edu/imabandit/
http://www.probabilistic-numerics.org/research/index.html
https://people.eecs.berkeley.edu/~brecht/eecs227c.html
https://neos-guide.org/content/optimization-under-uncertainty
Optimization and Gradient Descent on Riemannian Manifolds
Approximation Theory & Convex Optimization
https://people.maths.ox.ac.uk/riseth/
https://www.aritradutta.com/
http://sites.utexas.edu/mokhtari/
https://www.di.ens.fr/~aspremon/

Files

Numerical Optimization.md

Latest commit

History