IN A Few Useful Things to Know about Machine Learning, Pedro Domingos put up a relation:
- Representation as the core of the note is the general (mathematical) model that computer can handle.
- Evaluation is criteria. An evaluation function (also called objective function, cost function or scoring function) is needed to distinguish good classifiers from bad ones.
- Optimization is aimed to find the parameters that optimizes the evaluation function, i.e. $$ \arg\min_{\theta\in \Theta} f(\theta)={\theta^{\ast}|f(\theta^{\ast})=\min f(\theta)},\text{or} \ \quad\arg\max_{\theta\in \Theta}f(\theta)={\theta^{\ast}|f(\theta^{\ast})=\max f(\theta)}. $$
The objective function to be minimized is also called cost function.
Evaluation is always attached with optimization; the evaluation which cannot be optimized is not a good evaluation in machine learning.
- https://www.wikiwand.com/en/Mathematical_optimization
- https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf
- http://www.cs.cmu.edu/~pradeepr/convexopt/
- An interactive tutorial to numerical optimization
- Patrick Louis' RECENT CONFERENCE TALKS on optimization
- Northwestern University Open Text Book on Process Optimization
- Introductory course on non-smooth optimisation
- CS4540: Simple Algorithms
- CS 798: Convexity and Optimization
- Survival Guide for Students of Optimization, Dianne P. O'Leary, September 2017
- https://nlopt.readthedocs.io/en/latest/
The proof of convergence or complexity is often based on the convex cases where the objective function as well as the constrained set is convex, i.e.,
And this optimization is called convex optimization.
By the way, the name numerical optimization
means the variables or parameters to estimate are in numeric format, which is far from performance optimization in concept.
First optimal condition is a necessary condition for the unconstrainted optimziation problems if the objective function is differentiable: if
Wotao Yin wrote a summary on First-order methods and operator splitting for optimization:
First-order methods are described and analyzed with gradients or subgradients, while second-order methods use second-order derivatives or their approximations.
During the 70s–90s the majority of the optimization community focused on second-order methods since they are more efficient for those problems that have the sizes of interest at that time. Beginning around fifteen years ago, however, the demand to solve ever larger problems started growing very quickly. Many large problems are further complicated by non-differentiable functions and constraints. Because simple first-order and classic second-order methods are ineffective or infeasible for these problems, operator splitting methods regained their popularity.
Operators are used to develop algorithms and analyze them for a wide spectrum of problems including optimization problems, variational inequalities, and differential equations. Operator splitting is a set of ideas that generate algorithms through decomposing a problem that is too difficult as a whole into two or more smaller and simpler subproblems. During the decomposition, complicated structures like non-differentiable functions, constraint sets, and distributed problem structures end up in different subproblems and thus can be handled elegantly. We believe ideas from operator splitting provide the most effective means to handle such complicated structures for computational problem sizes of modern interest.
- The world of optimization
- Gradient flow and gradient descent
- Accelerated gradient descent
- Accelerated gradient flow
- Stochastic gradient flow
- Aristotle vs. Newton
- Advanced non-smooth optimization
Each iteration of a line search method computes a search direction
where the positive scalar
Gradient descent and its variants are to find the local solution of the unconstrained optimization problem:
where
It is not difficult to observe that
Let $x^{k+1}=\arg\min_{x} f(x^k) + (x - x^k)^{T}\nabla f(x^{k}) + \frac{s_k}{2}{|x-x^k|}2^2$, we will obtain $x^{k+1} = x^{k}-\frac{1}{s_k}{\nabla}{x} f(x^k)$. However, the constant
And the general gradient descent methods are given by
where
$$ x^{k+1}=\fbox{$x^{k}$}-\alpha_{k}{\nabla}{x} f(x^k) = \fbox{$x^{k-1}-\alpha{k-1}\nabla f(x_{k-1})$}-\alpha_{k}{\nabla}{x} f(x^k)\ = x^1-\sum{n=0}^{k}\alpha_n{\nabla}_{x} f(x^n) $$
- http://59.80.44.100/www.seas.ucla.edu/~vandenbe/236C/lectures/gradient.pdf
- http://wiki.fast.ai/index.php/Gradient_Descent
There are many ways to choose some proper step or learning rate sequence
The first-order Taylor approximation of
The second term on the righthand side,
We define a normalized steepest descent direction (with respect to the norm
It is also convenient to consider a steepest descent step
$$\Delta x_{sd}={|\nabla f(x)|}{\ast}\Delta x{nsd}$$
where
(We say ‘a’ steepest descent direction because there can be multiple minimizers.)
Algorithm Steepest descent method
- given a starting point
$x \in domf$ . - repeat
- Compute steepest descent direction
$\Delta x_{sd}$ . - Line search. Choose
${t}$ via backtracking or exact line search. 3. Update.$x := x + t \Delta x_{sd}$ .
- Compute steepest descent direction
- until stopping criterion is satisfied.
If the variable conditional gradient descent method
or Frank-Wolfe algorithm
.
Algorithm Frank–Wolfe algorithm
- given a starting point
$x \in domf$ .- repeat
- Find
$s^k$ :$s^k =\arg\min_{v}{f(x^{k})^T v\mid v\in D }$ .- Set
$t=\frac{2}{k+2}$ or$t=\arg\min{f(x^k+t(s^k - x^k))\mid t\in [0, 1]}$ - Update.
$x^{k+1}= x^k + t(s^k-x^k), k\leftarrow k+1$ .- until stopping criterion is satisfied.
- 梯度下降法和最速下降法的细微差别
- An Introduction to Conditional Gradient
- Frank–Wolfe algorithm
- Greedy Algorithms, Frank-Wolfe and Friends - A modern perspective
- https://sites.google.com/site/nips13greedyfrankwolfe/
- Revisiting Frank-Wolfe
- Nonsmooth Frank-Wolfe algorithm and Coresets
Some variants of gradient descent methods are not line search method. For example, the heavy ball methods or momentum methods:
where the momentum coefficient
Nesterov accelerated gradient method at the $k$th step is given by:
where the momentum coefficient
They are called as inertial gradient methods or accelerated gradient methods. And there are some different forms.
Inventor of Nesterov accelerated Gradient |
---|
- ORF523: Nesterov’s Accelerated Gradient Descent
- Nesterov’s Accelerated Gradient Descent for Smooth and Strongly Convex Optimization
- Revisiting Nesterov’s Acceleration
- A short proof for Nesterov’s momentum
- Nemirovski’s acceleration
- https://zhuanlan.zhihu.com/p/41263068
- https://zhuanlan.zhihu.com/p/35692553
- https://zhuanlan.zhihu.com/p/35323828
- On Gradient-Based Optimization: Accelerated, Asynchronous, Distributed and Stochastic
- Nesterov Accelerated Gradient and Momentum
- Why Momentum Really Works
- http://www.optimization-online.org/DB_FILE/2018/11/6938.pdf
- https://www.mat.univie.ac.at/~neum/glopt/mss/MasAi02.pdf
- https://www.cs.cmu.edu/~ggordon/10725-F12/slides/09-acceleration.pdf
- WHAT IS GRADIENT DESCENT IN MACHINE LEARNING?
- https://www.fromthegenesis.com/gradient-descent-part1/
- https://www.fromthegenesis.com/gradient-descent-part-2/
- Deep Learning From Scratch IV: Gradient Descent and Backpropagation
- https://ee227c.github.io/notes/ee227c-lecture08.pdf
- Monotonicity, Acceleration, Inertia, and the Proximal Gradient algorithm
It is from the methods to solve the Symmetric positive definite linear systems
The solution is to make the search directions-orthogonal instead of orthogonal.
Two vectors
Krylov subspace
the CG algorithm (among others) generates the Krylov sequence.
-
$f\left(x^{(k+1)}\right) \leq f\left(x^{(k)}\right)$ but$|r|$ can increase. - $x^{(n)}=x^{\star}\left(i.e., x^{\star} \in \mathcal{K}{n}\right)$ even when $\mathcal{K}{n} \neq \mathbf{R}^{n}$
-
$x^{(k)}=p_{k}(A) b,$ where$p_{k}$ is a polynomial with$\operatorname{deg} p_{k}<k$ .
There is a two-term recurrence $$\fbox{$x^{k+1} = x^k + \alpha_k (x^k-x^{\ast}) + \beta_k (x^{k}-x^{k-1})$}$$
$$ {x :=0, r :=b, \rho_{0} :={|r|}^2 } \ {\text { for } k=1, \ldots, N_{\text {max }}} \ {\text { quit if } \sqrt{\rho_{k}} \leq \epsilon {|b|}{2} \text {or} |r| \leq \epsilon{|b|}{2}} \ {w := Ap} \ {\alpha :=\frac{\rho_{k}}{w^{T} p}} \ {x := r+\alpha p} \ {r := r-\alpha w} \ {\rho_k := {|r|}^2} $$
Preconditioned conjugate gradient
with preconditioner
The gradient descent methods transforms the multiply optimization to univariate optimization.
Here is an outline of the nonlinear CG method:
$d^{0}=r^{0}=-f^{\prime}(x^0)$ - Find
$\alpha^{i}$ that minimizes$f(x^{i}+\alpha^{i} d^{i})$ , $x^{i+1}=x^{i} + \alpha_{i} d^{i}$ $r^{i+1}=-f^{\prime}(x^{i+1})$ -
$\beta_{i+1}=\frac{\left<r^{i+1}, r^{i+1}\right>}{\left<r^{i}, r^{i}\right>}$ or$\beta_{i+1}=\max{\frac{\left<r^{i+1}, r^{i+1}-r^{i}\right>}{\left<r^{i}, r^{i}\right>}, 0}$ $d^{i+1}=r^{i+1}+\beta_{i+1}d^{i}$
- https://wiki.seg.org/wiki/The_conjugate_gradient_method
- https://stanford.edu/class/ee364b/lectures/conj_grad_slides.pdf
- An Introduction to the Conjugate Gradient Method Without the Agonizing Pain
- http://www.cs.umd.edu/users/oleary/cggg/historyindex.html
- A BRIEF INTRODUCTION TO THE CONJUGATE GRADIENT METHOD
NEWTON’S METHOD and QUASI-NEWTON METHODS are classified to variable metric methods.
It is also to find the solution of unconstrained optimization problems, i.e.
If
Newton's method is one of the fixed-point methods to solve nonlinear equation system.
It is given by
$$
x^{k+1}=\arg\min_{x}f(x^k)+(x-x^k)^T \nabla_x f(x^k)+\frac{1}{2\alpha_{k+1}}(x-x^k)^T H(x^k)(x-x^k)
\=x^{k}-\alpha_{k+1}H^{-1}(x^{k})\nabla_{x},{f(x^{k})}
$$
where
- https://brilliant.org/wiki/newton-raphson-method/
- https://www.shodor.org/unchem/math/newton/
- http://fourier.eng.hmc.edu/e176/lectures/NM/node21.html
- https://www.cup.uni-muenchen.de/ch/compchem/geom/nr.html
- http://web.stanford.edu/class/cme304/docs/newton-type-methods.pdfs
In maximum likelihood estimation, the objective function is the log-likelihood function, i.e.
$$
\ell(\theta)=\sum_{i=1}^{n}\log{P(x_i|\theta)}
$$
where
And the Fisher scoring algorithm is given by
$$
\theta^{k+1}=\theta^{k}+\alpha_{k}J^{-1}(\theta^{k})\nabla_{\theta} \ell(\theta^{k})
$$
where
See http://www.stats.ox.ac.uk/~steffen/teaching/bs2HT9/scoring.pdf or https://wiseodd.github.io/techblog/2018/03/11/fisher-information/.
Fisher scoring algorithm is regarded as an example of Natural Gradient Descent in information geometry as shown in https://wiseodd.github.io/techblog/2018/03/14/natural-gradient/ and https://www.zhihu.com/question/266846405.
Quasi-Newton methods, like steepest descent, require only the gradient of the objective function to be supplied at each iterate. By measuring the changes in gradients, they construct a model of the objective function that is good enough to produce super-linear convergence. The improvement over steepest descent is dramatic, especially on difficult problems. Moreover, since second derivatives are not required, quasi-Newton methods are sometimes more efficient than Newton’s method.[^11]
In optimization, quasi-Newton methods (a special case of variable-metric methods) are algorithms for finding local maxima and minima of functions. Quasi-Newton methods are based on Newton's method to find the stationary point of a function, where the gradient is 0.
In quasi-Newton methods the Hessian matrix does not need to be computed. The Hessian is updated by analyzing successive gradient vectors instead. Quasi-Newton methods are a generalization of the secant method to find the root of the first derivative for multidimensional problems.
In multiple dimensions the secant equation is under-determined, and quasi-Newton methods differ in how they constrain the solution, typically by adding a simple low-rank update to the current estimate of the Hessian.
One of the chief advantages of quasi-Newton methods over Newton's method is that the Hessian matrix (or, in the case of quasi-Newton methods, its approximation)
The unknown
-
${\displaystyle \Delta x_{k}=-\alpha_{k} B_{k}^{-1}\nabla f(x_{k})}$ , with$\alpha$ chosen to satisfy the Wolfe conditions; -
$x_{k+1}=x_{k}+\Delta x_{k}$ ; - The gradient computed at the new point
$\nabla f(x_{k+1})$ , and$y_{k}=\nabla f(x_{k+1})-\nabla f(x_{k})$ is used to update the approximate Hessian$B_{k+1}$ , or directly its inverse$H_{k+1} = B_{k+1}^{-1}$ using the Sherman–Morrison formula.
A key property of the BFGS and DFP updates is that if
For example,
Method | ||
---|---|---|
DFP | ||
BFGS | ||
SR1 |
- Quasi-Newton methods in Wikipedia page
- http://59.80.44.98/www.seas.ucla.edu/~vandenbe/236C/lectures/qnewton.pdf
- http://fa.bianp.net/teaching/2018/eecs227at/quasi_newton.html
Consider the gradient iteration form $$ x^{k+1}=x^{k}-\alpha_k \nabla f(x^k) $$
which can be written as
$$
x^{k+1}=x^{k}- D_k \nabla f(x^k)
$$
where
where
By symmetry, we may minimize
In short, the iteration formula of Barzilai-Borwein method is given by
$$
x^{k+1}=x^{k}-\alpha_k \nabla f(x^k)
$$
where
It is easy to see that in this method no matrix computations and no line searches (except
- https://mp.weixin.qq.com/s/G9HH29b2-VBnk_Sqze8pDg
- http://www.math.ucla.edu/~wotaoyin/math273a/slides/
- http://bicmr.pku.edu.cn/~wenzw/courses/WenyuSun_YaxiangYuan_BB.pdf
- https://www.math.lsu.edu/~hozhang/papers/cbb.pdf
The BFGS quasi-newton approximation has the benefit of not requiring us to be able to analytically compute the Hessian of a function. However, we still must maintain a history of the
- https://www.wikiwand.com/en/Limited-memory_BFGS
- On the limited memory BFGS method for large scale optimization
- Numerical Optimization: Understanding L-BFGS
- Unconstrained optimization: L-BFGS and CG
- Wikipedia page on Newton Method
- Newton-Raphson Visualization (1D)
- Newton-Raphson Visualization (2D)
- Using Gradient Descent for Optimization and Learning
- ftp://lsec.cc.ac.cn/pub/home/yyx/papers/siamjna1987.pdf
Natural gradient descent is to solve the optimization problem Remiann metric
at the point
Natural gradient descent for manifolds corresponding to
exponential families can be implemented as a first-order method through mirror descent.
Originator of Information Geometry: Shunichi Amari |
---|
- Natural gradient descent and mirror descent
- Online Natural Gradient as a Kalman Filter
- https://www.zhihu.com/question/266846405
- 2016 PKU Mini-Course: Information Geometry
- Information Geometry and Natural Gradients
- Tutorial on Information Geometry and Algebraic Statistics
- True Asymptotic Natural Gradient Optimization
- Shun-ichi Amari's CV in RIKEN
- Energetic Natural Gradient Descent
- Natural gradients and K-FAC
- http://www.divergence-methods.org/
- http://www.deeplearningpatterns.com/doku.php?id=natural_gradient_descent
- NATURAL GRADIENTS AND STOCHASTIC VARIATIONAL INFERENCE
- 谈谈优化算法 - 郑思座的文章 - 知乎
- Accelerating Natural Gradient with Higher-Order Invariance
Trust-region methods define a region around the current iterate within which they trust the model to be an adequate representation of the objective function, and then choose the step to be the approximate minimizer of the model in this region. If a step is not acceptable, they reduce the size of the region and find a new minimizer. In general, the direction of the step changes whenever the size of the trust region is altered.
$$
\min_{p\in\mathbb{R}^n} m_k(p)=f_k+ g_k^T p+\frac{1}{2}p^T B_k p, , , s.t. |p|\leq {\Delta}_k,
$$
where
- https://optimization.mccormick.northwestern.edu/index.php/Trust-region_methods
- Concise Complexity Analyses for Trust-Region Methods
- https://www.nmr-relax.com/manual/Trust_region_methods.html
- http://lsec.cc.ac.cn/~yyx/worklist.html
Expectation-Maximization algorithm, popularly known as the EM algorithm has become a standard piece in the statistician’s repertoire. It is used in incomplete-data problems or latent-variable problems such as Gaussian mixture model in maximum likelihood estimation. The basic principle behind the EM is that instead of performing a complicated optimization, one augments the observed data with latent data to perform a series of simple optimizations.
It is really popular for Bayesian statistician.
Let
Each iteration of the EM algorithm consists of an expectation step (E-step) and a maximization step (M-step)
Specifically, let
and the M-step is to maximize Q with respect to
- https://www.wikiwand.com/en/Expectation%E2%80%93maximization_algorithm
- http://cs229.stanford.edu/notes/cs229-notes8.pdf
- EM算法存在的意义是什么? - 史博的回答 - 知乎
Diagram of EM algorithm |
---|
Each iteration of the generalized EM algorithm consists of an expectation step (E-step) and a maximization step (M-step)
Specifically, let
It is not to maximize the conditional expectation.
See more on the book The EM Algorithm and Extensions, 2nd Edition by Geoffrey McLachlan , Thriyambakam Krishna.
- Majorization Methods
- Tangential Majorization
- Quadratic Majorization
- Sharp Majorization
- Using Higher Derivatives
- AN ASSEMBLY AND DECOMPOSITION APPROACH FOR CONSTRUCTING SEPARABLE MINORIZING FUNCTIONS IN A CLASS OF MM ALGORITHMS
Quadratic Lowwer Bound
We will focus on projected gradient descent and its some non-Euclidean generalization in order to solve some simply constrained optimization problems.
If not specified, all these methods are aimed to solve convex optimization problem with explicit constraints, i.e. $$ \arg\min_{x\in\mathbb{S}}f(x) $$
where any feasible direction is not descent direction
: if variational inequality
holds:
And it is the optimal condition of constrained optimization problem.
We say the gradient monotone
operator because
Projected gradient descent has two steps: $$ z^{k+1} = x^{k}-\alpha_k\nabla_x f(x^{k}) \qquad \text{Descent}\ x^{k+1} = Proj_{\mathbb{S}}(z^{k+1})=\arg\min_{x\in \mathbb{S}}|x-z^{k+1}|^{2} \qquad\text{Projection} $$
or in the compact form $$ x^{k+1} = \arg\min_{x\in \mathbb{S}}|x-(x^{k}-\alpha_k\nabla_x f(x^{k}))|^{2} \= \arg\min_{x}{|x-(x^{k}-\alpha_k\nabla_x f(x^{k}))|^{2} +\delta_{\mathbb{S}}(x) } $$
where
Each projection is an optimization so that the iterative points satisfy the optimal conditions, which also restricts the projection method into the case where the projection is available or simple to compute. And it is natural to search better Descent step or Projection step.
The following links are recommended if you are interested in more theoretical proof:
For the non-differentiable but convex function such as the absolute value function
Mirror descent can be regarded as the non-Euclidean generalization via replacing the
Bregman divergence is induced by convex smooth function
where
The function
It is convex in
Especially, when
A wonderful introduction to Bregman divergence is Meet the Bregman Divergences by Mark Reid at http://mark.reid.name/blog/meet-the-bregman-divergences.html.
The Bregman projection onto a convex set
A generalized Pythagorean theorem
holds: for convex
It is given in the projection form:
In another compact form, mirror gradient can be described in the proximal form:
$$
x^{k+1} = \arg\min_{x\in\mathbb{S}} { f(x^k) + \left<g^k, x-x^k\right> + \frac{1}{\alpha_k} B_h(x,x^k)}\tag {1}
$$
with
Note that the next iteration point
By the optimal conditions of equation (1), the original "mirror" form of mirror gradient method is described as
$$
\nabla h(x^{k+1}) = \nabla h(x^k) - \alpha_k \nabla f(x^k), \
= \nabla h(x^1) - \sum_{n=1}^{k}\alpha_n \nabla f(x^n) , x\in \mathbb{S},
$$
where the convex function
One special method is called entropic mirror descent(Exponential Gradient Descent)
when the Bregman divergence induced by
Entropic descent method at step
$$ {x_{i}^{k+1} = \frac{x_i^{k}\exp(-\alpha \nabla {f(x^k)}{i})}{\sum{j=1}^{n} x_j^{k}\exp(-\alpha \nabla {f(x^k)}{j})}}, i=1,2,\dots, n. $$ it is obvious that entropic decscent methods are in the coordinate-wise update formula. Whast is more , it can be rewritten as $$x^{k+1}=\frac{x^{1}\exp(\sum{n=1}^{k}-\alpha \nabla f(x^n))}{\prod_{n=1}^{k}\left<x^n, \exp(-\alpha \nabla f(x^n))\right>}\propto x^{1}\exp(\sum_{n=1}^{k}-\alpha \nabla f(x^n)).$$
Multiplicative Weights Update
is closely related with entropic descent method. See more on the following link list.
- The Divergence Methods Web Site (under construction)
- Bregman Divergence and Mirror Descent, Xinhua Zhang(张歆华)
- CS 294 / Stat 260, Fall 2014: Learning in Sequential Decision Problems
- ELE522: Large-Scale Optimization for Data Science , Yuxin Chen, Princeton University, Fall 2019
- Mirror Descent and the Multiplicative Weight Update Method, CS 435, 201, Nisheeth Vishnoi
- Mirror descent: 统一框架下的first order methods
- ORF523: Mirror Descent, part I/II
- ORF523: Mirror Descent, part II/II
- Thibaut Lienart: MIrror Descent
- Sinkhorn Algorithm as a Special Case of Stochastic Mirror Descent
- https://web.stanford.edu/class/cs229t/2017/Lectures/mirror-descent.pdf
- https://www.cs.ubc.ca/labs/lci/mlrg/slides/mirrorMultiLevel.pdf
- https://konstmish.github.io/
Recall the projected gradient method, it converts the constrained problem
If the function proximal mapping (or prox-operator)
of a convex function
Unconstrained problem with cost function split in two components:
-
${g}$ is convex, differentiable; -
$\mathbf{h}$ is closed, convex, possibly non-differentiable while$prox_h(x)$ is inexpensive.
Proximal gradient algorithm
And projected gradient method is a special case of proximal gradient method.
And it is natural to consider a more general algorithm by replacing the squared Euclidean distance in definition of proximal mapping
with a Bregman distance:
$$
\arg\min_{u}{h(u)+\frac{1}{2} {|x-u|}2^2}\to \arg\min{u}{h(u)+ B(x,u)}.
$$
so that the primary proximal gradient methods are modified to the Bregman version,
which is called as Bregman proximal gradient
method.
- A collection of proximity operators implemented in Matlab and Python.
- L. Vandenberghe ECE236C (Spring 2019): 4. Proximal gradient method
- https://people.eecs.berkeley.edu/~elghaoui/Teaching/EE227A/lecture18.pdf
- Accelerated Bregman Proximal Gradient Methods for Relatively Smooth Convex Optimization
- A unified framework for Bregman proximal methods: subgradient, gradient, and accelerated gradient schemes
- Proximal Algorithms, N. Parikh and S. Boyd
- For more (non exhaustive) informations regarding the proximity operator and the associated proximal algorithms
Proximal and Projected Newton Methods
A fundamental difference between gradient descent and Newton’s method was that the latter also iteratively minimized quadratic approximations, but these used the local Hessian of the function in question. Proximal Newton method can be also applied to such optimization problem: $$ y^{k} = prox_{H_{k-1}}(x^{k-1}-H_{k-1}^{-1}\nabla g(x^{k-1}))\ x^{k} = x^{k-1}+t_{k}(y^{k}-x^{k-1}). $$
- useful when
$f$ and$g$ have inexpensive prox-operators; -
$x^k$ converges to a solution of$0 \in \partial f (x) + \partial g(x)$ (if a solution exists); - not symmetric in
$f$ and$g$ .
The iteration can be written as fixed-point iteration
Now, let us consider the simple convex optimization
Since
It is proved that The sequence Fejer monotone
, i.e.,
- http://maths.nju.edu.cn/~hebma/Talk/VI-PPA.pdf
- https://lostella.github.io/software/
- The proximal point method revisited
The optimal condition of the linearly constrained convex optimization is characterized as a mixed monotone variational inequality:
[Prediction Step.]
: With given
[Correction Step.]
: The new iterate
Convergence Conditions
: For the matrices
- https://www.seas.upenn.edu/~spater/assets/papers/conference/c_2019_paternain_et_al.pdf
- http://export.arxiv.org/pdf/1709.05850
- https://www.hindawi.com/journals/aaa/2013/845459/
- https://arxiv.org/pdf/1709.05850.pdf
Now, let us consider the simple convex optimization
The related variational inequality of the saddle point of the Lagrangian function is
For given
- https://link.springer.com/article/10.1007/s10589-013-9616-x
- https://core.ac.uk/display/23878067
- http://www.optimization-online.org/DB_FILE/2017/03/5922.pdf
while it can also written in the following form: $$ x^{+}=\arg\min_{x}{\frac{1}{1_C(x)}\cdot {|x-x^0|}_2^2} $$
where $$ 1_C(x)= \begin{cases} 1, & \text{if} \quad x \in C,\ 0, & \text{otherwise}. \end{cases} $$ How we can generalize this form into the proximal form? And what is the difference with the original addition proximal operator?
If
How we can generalize the function
$$ x^{+} = \arg\min_{x}{\exp[\delta_C(x)]\cdot \frac{1}{2}{|x-x^0|}2^2} \ = \arg\min{x}\log{\exp[\delta_C(x)]\cdot \frac{1}{2}{|x-x^0|}2^2} \= \arg\min{x} {\delta_C(x)+\log({|x-x^0|}_2^2)}. $$
In projected gradient method, it converts the constrained problem into an unsmooth problem.
constrained problem | unconstrained problem | desired properties |
---|---|---|
unsmooth, inexpensive | ||
smooth, inexpensive | ||
smooth, inexpensive |
Recall the indicator function of constrained set in projected methods:
$$
\delta_{\mathbb{S}}(x)=
\begin{cases}
0, & \text{if} \quad x \in \mathbb{S};\
\infty, & \text{otherwise},
\end{cases}
$$
there is a dramatic leap between the boundary of
The basic idea of the barrier method is to approximate the indicator function by some functions
- convex and differentiable;
-
$f(x)< \infty$ if$x\in\mathbb S$ ; - for every point constraint on the boundary
$f(x)\to\infty$ as$x\to\partial\mathbb S$ .
For example, Barrier Method in Convex Optimization: Fall 2018 pushs the inequality constraints in the
problem in a smooth way, reducing the problem to an equality-constrained problem.
Consider now the following minimization problem:
$$
\min_{x} f(x) \
\text{subject to} \quad h_n \leq 0, n=1,2,\cdots, N \
Ax=b
$$
where
When we ignore equality constraints, the problem above can be written by incorporating the inequality with the identity function, which in turn can be approximated by the log-barrier functions as follows $$\min_{x} f(x)+\sum_{n=1}^{N}\mathbb{I}{h_n(x)\leq 0}\approx \min{x} f(x)+\frac{1}{t}\phi(x)$$
for
In the context of the equality-constrained problem
the quadratic penalty function penalty parameter
. By driving
Quadratic Penalty Method
- Given
$\mu_0 > 0$ , a nonnegative sequence${\tau_k}$ with$\tau_k\to 0$ , and a starting point$x^s_0$ ; -
for
$k = 0, 1, 2, . . .$ :- Find an approximate minimizer
$x^k=\arg\min_{x}Q(x; \mu_k)$ , starting at$x^s_k$ and terminating when${|\nabla_x Q(x;\mu_k)|}_2\leq \tau_k$ ; -
if final convergence test satisfied,
-
stop with approximate solution
$x^k$ ;
-
stop with approximate solution
- end if
- Choose new penalty parameter
$\mu_{k+1} > \mu_{k}$ ; - Choose new starting point
$x^s_{k+1}$
- Find an approximate minimizer
- end (for)
- Line Search Procedures for the Logarithmic Barrier Function
- Penalty method
- http://s-mat-pcs.oulu.fi/~keba/Optimointi/OP_penalty_engl.pdf
- https://www.me.utexas.edu/~jensen/ORMM/supplements/units/nlp_methods/const_opt.pdf
- http://users.jyu.fi/~jhaka/opt/TIES483_constrained_indirect.pdf
The main ideas of path following by predictor–corrector and piecewise-linear methods, and their application in the direction of homotopy methods and nonlinear eigenvalue problems are reviewed. Further new applications to areas such as polynomial systems of equations, linear eigenvalue problems, interior methods for linear programming, parametric programming and complex bifurcation are surveyed. Complexity issues and available software are also discussed.
- Continuation and path following
- https://nisheethvishnoi.files.wordpress.com/2018/05/lecture71.pdf
- http://www.stat.cmu.edu/~ryantibs/convexopt-S15/lectures/16-primal-dual.pdf
- L. Vandenberghe EE236C (Spring 2016): 17. Path-following methods
It is to solve the constrained optimization problem
The barrier or penalty function methods are to add some terms to
The penalty function methods do not take the optimal conditions into consideration although it works.
If
In another direction, we want to prove that
By the definition,
It is dual problem is in the following form:
Note that
$$
\min_{x} L(x, \lambda)\leq L(x,\lambda)\leq \max_{\lambda}L(x,\lambda)
$$
implies
And note that necessary condition of extrema is that the gradients are equal to 0s: $$ \frac{\partial L(x,\lambda)}{\partial x}= \frac{\partial f(x)}{\partial x}+ A\lambda = 0 \ \frac{\partial L(x,\lambda)}{\partial \lambda} = Ax-b=0 $$
Dual Ascent takes advantages of this properties:
$x^{k+1}=\arg\min_{x} L(x,\lambda)$ ;${\lambda}^{k+1}= {\lambda}^{k}+\alpha_k(Ax^{k+1}-b)$ .
Entropy minimization algorithms: $$ x^{k+1}\in \arg\min_{x}{f(x)+\frac{1}{c_k} \sum_{i=1}^{n} x^{i}(\ln(\frac{x_i}{x^{k}_{i}})-1)}. $$
- Duality
- https://cs.stanford.edu/people/davidknowles/lagrangian_duality.pdf
- https://people.eecs.berkeley.edu/~elghaoui/Teaching/EE227A/lecture7.pdf
- https://www.svm-tutorial.com/2016/09/duality-lagrange-multipliers/
- https://www.cs.jhu.edu/~svitlana/papers/non_refereed/optimization_1.pdf
- Constrained Optimization and Lagrange Multiplier Methods
- https://zhuanlan.zhihu.com/p/50823110
- The proximal augmented Lagrangian method for nonsmooth composite optimization
Exponential Augmented Lagrangian Method
A special case for the convex problem $$ \text{minimize}\quad f(x),\ s.t. g_1(x) \leq 0, g_2(x) \leq 0, \cdots, g_r(x) \leq 0, x\in X $$
is the exponential augmented Lagrangean method.
It consists of unconstrained minimizations:
$$ x^{k}\in \arg\min_{x\in X}{f(x)+\frac{1}{c_k} \sum_{j=1}^{r} {\mu}^{k}{j}\exp(c_k g_j(x)) $$ followed by the multiplier iterations $$ {\mu}^{k+1}{j} = {\mu}^{k}_{j}\exp(c_k g_j(x^k)). $$
If the constraints are more complex, KKT theorem may be necessary.
- An exponential augmented Lagrangian method with second order convergence
- On the convergence of the exponential multiplier method for convex programming
Generalized Lagrangian function
Minimize
- A generalized Lagrangian function and multiplier method
- Generalized Lagrange Function and Generalized Weak Saddle Points for a Class of Multiobjective Fractional Optimal Control Problems
- https://blog.csdn.net/shayashi/article/details/82529816
- https://suzyahyah.github.io/calculus/optimization/2018/04/07/Lagrange-Multiplier.html
- Lagrange Multipliers without Permanent Scarring
KKT condition
Given general problem $$ \begin{aligned} \min_{x \in \mathbb{R}^{n}} f(x) & \ \text { subject to } & h_{i}(x) \leq 0, \quad i=1, \ldots m \ & \ell_{j}(x)=0, \quad j=1, \ldots r \end{aligned} $$
We defined the Lagrangian:
The Karush-Kuhn-Tucker conditions or KKT conditions are: $$\begin{aligned} &\bullet\quad 0 \in \partial f(x)+\sum_{i=1}^{m} u_{i} \partial h_{i}(x)+\sum_{j=1}^{T} v_{j} \partial \ell_{j}(x) &\text{(stationarity)} \ &\bullet\quad u_{i} \cdot h_{i}(x)=0 \text { for all } i &\text{ (complementary slackness) } \ &\bullet\quad h_{i}(x) \leq 0, \ell_{j}(x)=0 \text { for all } i, j &\text{(primal feasibility) } \ &\bullet\quad u_{i} \geq 0 \text{for all} i &\text{ (dual feasibility) } \end{aligned}$$
I learnt this theorem in functional analysis at graduate level course.
- https://mitmgmtfaculty.mit.edu/rfreund/educationalactivities/
- Nonlinear Programming by Robert M. Freund
- https://www.cs.cmu.edu/~ggordon/10725-F12/slides/16-kkt.pdf
Conjugate Duality
Dual problem transform the original primary problem into a new optimization problems.
- Primary-dual hybrid gradient,
- The Complexity of Primal-Dual Fixed Point Methods for Ridge Regression ,
- A primal–dual fixed point algorithm for convex separable minimization with applications to image restoration
- https://web.maths.unsw.edu.au/~gyli/papers/ljl-conjugate-revised-final-18-11-10.pdf
- Conjugate Duality and Optimization
- LECTURE 5: CONJUGATE DUALITY
- Duality Theory of Constrained Optimization by Robert M. Freund
- Gauge optimization, duality, and applications
- https://mitmgmtfaculty.mit.edu/rfreund/educationalactivities/
- http://www.mit.edu/~mitter/publications/6_conjugate_convex_IS.pdf
Primal-dual fixed point algorithm and Primary Dual Hybrid Gradient
- A primal-dual fixed point algorithm for multi-block convex minimization
- A primal–dual fixed point algorithm for convex separable minimization
- A Unified Primal-Dual Algorithm Framework Based on Bregman Iteration
- Proximal ADMM
- PDHG
In following context we will talk the optimization problem with linear constraints:
$$
\arg\min_{x} f(x) \
s.t. Ax = b
$$
where
Alternating direction method of multipliers is called ADMM shortly.
It is aimed to solve the following convex optimization problem:
$$
\min F(x,y) {=f(x)+g(y)} \
\text{subject to }\quad Ax+By =b
$$
where
Define the augmented Lagrangian: $$ L_{\beta}(x, y)=f(x)+g(y) - \lambda^{T}(Ax + By -b)+ \frac{\beta}{2}{|Ax + By - b|}_{2}^{2}. $$
Augmented Lagrange Method at step
$(x^{k+1}, y^{k+1})=\arg\min_{x\in\mathbf{X}}L_{\beta}(x, y,\lambda^{\color{aqua}{k}});$ $\lambda^{k+1} = \lambda^{k} - \beta (Ax^{\color{red}{k+1}} + By^{\color{red}{k+1}}-b).$
ADMM at step
$x^{k+1}=\arg\min_{x\in\mathbf{X}}L_{\beta}(x,y^{\color{aqua}{k}},\lambda^{\color{aqua}{k}});$ $y^{k+1}=\arg\min_{y\in\mathbf{Y}} L_{\beta}(x^{\color{red}{k+1}}, y, \lambda^{\color{aqua}{k}});$ $\lambda^{k+1} = \lambda^{k} - \beta (Ax^{\color{red}{k+1}} + By^{\color{red}{k+1}}-b).$
The convergence proof of ADMM in convex optimization can be reduced to verifying the stability of a dynamical system or based on the optimal condition variational inequality
like On the O(1/t) convergence rate of alternating direction method.
Linearized ADMM
Note that the
However, the
solution of the subproblem (1) does not have the closed form solution because of the
general structure of the matrix
at
$x^{k+1}=\arg\min_{x\in\mathbf{X}} f(x) + \beta(A x)^T (A x^k + B y^k - b -\frac{1}{\lambda^k})+ \frac{r}{2}{|x - x^k|}_2^2$ ,$y^{k+1}=\arg\min_{y\in\mathbf{Y}} L_{\beta}(x^{\color{red}{k+1}}, y, \lambda^{\color{aqua}{k}})$ ,$\lambda^{k+1} = \lambda^{k} - \beta (Ax^{\color{red}{k+1}} + By^{\color{red}{k+1}}-b).$
For given
We can also linearize the
$x^{k+1}=\arg\min_{x\in\mathbf{X}}L_{\beta}(x, y^k, \lambda^k)$ ,$y^{k+1}=\arg\min_{y\in\mathbf{Y}} g(y) + \beta (By)^{T}(Ax^{\color{red}{k+1}} + B y^k - b -\frac{1}{\beta}\lambda^k) + \frac{r}{2}|y - y^k|^2$ ,$\lambda^{k+1} = \lambda^{k} - \beta (Ax^{\color{red}{k+1}} + By^{\color{red}{k+1}}-b).$
For given
Taking
$x^{k+1}=\arg\min_{x\in\mathbf{X}}L_{\beta}(x, y^{\color{aqua}{k}},\lambda^{\color{aqua}{k}})$ ,$\lambda^{k + \frac{1}{2}} = \lambda^{k} - \mu\beta (Ax^{\color{red}{k+1}} + By^{\color{red}{k}}-b)$ ,$y^{k+1}=\arg\min_{y\in\mathbf{Y}} L_{\beta}(x^{\color{red}{k+1}}, y, \lambda^{\color{aqua}{k+\frac{1}{2}}})$ ,$\lambda^{k+1} = \lambda^{\color{red}{k+\frac{1}{2}}} - \mu\beta (A x^{\color{red}{k+1}} + B y^{\color{red}{k+1}}-b)$ .
One of the particular ADMM is also called Split Bregman
methods. And Bregman ADMM
replace the quadratic penalty function with Bregman divergence:
$$
L_{\beta}^{\phi}(x, y)=f(x)+g(y) - \lambda^{T}(Ax + By - b) + \frac{\beta}{2} B_{\phi}(b- Ax, By).
$$
where
BADMM
$x^{k+1}=\arg\min_{x\in\mathbf{X}}L_{\beta}^{\phi}(x,y^{\color{aqua}{k}},\lambda^{\color{aqua}{k}});$ $y^{k+1}=\arg\min_{y\in\mathbf{Y}} L_{\beta}^{\phi}(x^{\color{red}{k+1}}, y, \lambda^{\color{aqua}{k}});$ $\lambda^{k+1} = \lambda^{k} - \beta (Ax^{\color{red}{k+1}} + By^{\color{red}{k+1}}-b).$
Relaxed, Inertial and Fast ADMM
Family of relaxed A-ADMM algorithms for the above problem
$x^{k+1}=\arg\min_{x\in\mathbf{X}}f(x)+\frac{\rho}{2}{|Ax- y^k+ \lambda^k|}^2$ ,$y^{k+1}=\arg\min_{y\in\mathbf{Y}} g(y) + \frac{\rho}{2}{|\alpha Ax^{\color{red}{k+1} }+(1-\alpha_k)y^k-y+\lambda^k|}^2$ ,$\lambda^{k+1} = \lambda^{k} +\alpha Ax^{\color{red}{k+1}}+(1-\alpha_k)y^k-z^{\color{red}{k+1}}$ .
Family of relaxed ADMM algorithms for the above problem The damping constant is
$r \geq 3$ .
$x^{k+1}=\arg\min_{x\in\mathbf{X}}f(x)+\frac{\rho}{2}{|Ax-\hat y^k+\lambda^k|}^2$ ,$y^{k+1}=\arg\min_{y\in\mathbf{Y}} g(y) + \frac{\rho}{2}{|\alpha Ax^{\color{red}{k+1} }+(1-\alpha_k)\hat y^k-y+\lambda^k|}^2$ ,$\lambda^{k+1} = \hat\lambda^{k} +\alpha Ax^{\color{red}{k+1}}+(1-\alpha_k)\hat y^k-z^{\color{red}{k+1}}$ ,$\gamma_{k+1}=\frac{k}{k+r}$ ,$\hat\lambda^{k+1}=\lambda^{k+1}+\gamma_{k+1}(\lambda^{k+1}-\lambda^{k})$ ,$\hat y^{k+1}=y^{k+1}+\gamma_{k+1}(y^{k+1}-y^{k})$ .
Fast ADMM
$x^{k+1}=\arg\min_{x\in\mathbf{X}}L_{\beta}(x,\hat y^{\color{aqua}{k}},\hat\lambda^{\color{aqua}{k}});$ $y^{k+1}=\arg\min_{y\in\mathbf{Y}} L_{\beta}(x^{\color{red}{k+1}}, y, \hat\lambda^{\color{aqua}{k}});$ $\lambda_{k+1} = \lambda^{k} - \beta (Ax^{\color{red}{k+1}} + By^{\color{red}{k+1}}-b).$ $\alpha_{k+1}=\frac{1+\sqrt{1+4\alpha_k^2}}{2}$ $\hat y^{k+1}=y^{k+1}+\frac{\alpha_{k}-1}{\alpha_{k+1}}(y^{k}-y^{k-1})$ $\hat \lambda^{k+1}=\lambda^{k+1}+\frac{\alpha_{k}-1}{\alpha_{k+1}}(\lambda^{k}-\lambda^{k-1})$
- FAST ALTERNATING DIRECTION OPTIMIZATION METHODS
- The Classical Augmented Lagrangian Method and Nonexpansiveness
- Relative-error inertial-relaxed inexact versions of Douglas-Rachford and ADMM splitting algorithms
- Relax, and Accelerate: A Continuous Perspective on ADMM
- https://www.semanticscholar.org/author/Guilherme-Fran%C3%A7a/145512630
- An explicit rate bound for over-relaxed ADMM
Firstly we consider the following optimization problem
We defined its augmented Lagrangian multipliers as
Particularly, we firstly consider the case when
It is natural and computationally beneficial to extend the original ADMM directly to solve the general n-block problem. A counter-example shows that this method diverges.
And Professor Bingsheng He, who taught me this section in his class, and his coauthors proposed some schemes for this problem based on his unified frame work for convex optimization and monotonic variational inequality.
Parallel splitting augmented Lagrangian method (abbreviated to PSALM
) is described as follows:
$x^{k+1}=\arg\min_{x}{L_{\beta}^3(x,y^k,z^k,\lambda^k)\mid x\in\mathbb{X}}$ ;$y^{k+1}=\arg\min_{x}{L_{\beta}^3(x^{\color{red}{k+1}},y,z^k,\lambda^k)\mid y\in\mathbb{Y}}$ ;$z^{k+1}=\arg\min_{x}{L_{\beta}^3(x^{\color{red}{k+1}},y^{\color{yellow}{k}},z,\lambda^k)\mid z\in\mathbb{Z}}$ ;$\lambda^{k+1} = {\lambda}^{k}-\beta(A_1x^{k+1}+A_2y^{k+1}+A_3z^{k+1}-b)$ .
We can add one more correction step $$ v^{k+1} := v^{k}-\alpha(v^k - v^{k+1}),\alpha\in (0, 2 -\sqrt{2}) $$
where
Another approach is to add an regularized terms:
$x^{k+1}=\arg\min_{x}{L_{\beta}^3(x, y^k, z^k, \lambda^k)\mid x\in\mathbb{X}}$ ,$y^{k+1}=\arg\min_{x}{L_{\beta}^3(x^{\color{red}{k+1}}, y, z^k,\lambda^k)+\color{red}{\frac{\tau}{2}\beta{|A_2(y-y^k)|}^{2}}\mid y\in\mathbb{Y}}$ ,$z^{k+1}=\arg\min_{x}{L_{\beta}^3(x^{\color{red}{k+1}}, y^{\color{yellow}{k}}, z, \lambda^k)+\color{blue}{\frac{\tau}{2}\beta{|A_3(z - z^k)|}^{2}}\mid z\in\mathbb{Z}}$ ,$\lambda^{k+1} = {\lambda}^{k} - \beta(A_1 x^{k+1} + A_2 y^{k+1} + A_3 z^{k+1}-b)$ ,
where
- http://scis.scichina.com/en/2018/122101.pdf
- http://maths.nju.edu.cn/~hebma/slides/17C.pdf
- http://maths.nju.edu.cn/~hebma/slides/18C.pdf
- https://link.springer.com/article/10.1007/s10107-014-0826-5
Davis-Yin three operator splitting
If
$x^{k+1}=\arg\min_{x}{L^3_{\beta}(\color{green}{x},y^k,z^k,\lambda^k)\mid x\in\mathbb{X}}$ ;$y^{k+1}=\arg\min_{x}{L_{\beta}^3(x^{k+1},\color{green}{y},z^k,\lambda^k)\mid y\in\mathbb{Y}}$ ;$z^{k+1}=\arg\min_{x}{L_{\beta}^3(x^{k+1},y^{k+1},\color{green}{z},\lambda^k)\mid z\in\mathbb{Z}}$ ;$\lambda^{k+1} = {\lambda}^{k}-\beta(A_1x^{k+1}+A_2y^{k+1}+A_3z^{k+1}-b)$ .
where the notation
- Three-Operator Splitting
- A Three-Operator Splitting Scheme and its Optimization Applications
- Three-Operator Splitting and its Optimization Applications
- A Three-Operator Splitting Scheme and its Optimization Applications
- A Three-Operator Splitting Scheme and its Applications
Randomly Permuted ADMM
given initial values at round
- Primal update
- Pick a permutation
$\sigma$ of${1,.. ., n}$ uniformly at random; - For
$i = 1,2,\cdots, n$ , compute ${x}^{k+1}{\sigma(i)}$ by $$ x^{k+1}{\sigma(i)}=\arg\min_{x_{\sigma(i)}} L(x^{k+1}{\sigma(1)},\cdots, x^{k+1}{\sigma(i-1)}, x_{\sigma(i)}, x^{k+1}_{\sigma(i+1)},\cdots\mid \lambda^{k}). $$
- Pick a permutation
- Dual update. Update the dual variable by
$${\lambda}^{k+1}={\lambda}^{k}-\mu(\sum_{i=1}^{n}A_i x_i -b).$$
- Randomly Permuted ADMM
- On the Expected Convergence of Randomly Permuted ADMM
- On the Efficiency of Random Permutation for ADMM and Coordinate Descent
- Multi-Block ADMM and its Convergence Random Permutation Helps-A talk by Ye
dlADMM
$$
\begin{array}{l}{\mathrm{ Problem 1. }} \
{\min {W{l}, b_{l}, z_{l}, a_{l}} R\left(z_{L} ; y\right)+\sum_{l=1}^{L} \Omega_{l}\left(W_{l}\right)} \
{\text { s.t. } z_{l}=W_{l} a_{l-1}+b_{l}(l=1, \cdots, L), a_{l}=f_{l}\left(z_{l}\right)(l=1, \cdots, L-1)}\end{array}
$$
where
Rather than solving Problem 1 directly, we can relax Problem 1 by adding an
$$ \begin{array}{l}{\mathrm{ Problem 2 }} \ {\begin{aligned} \min {W{l}, b_{l}, z_{l}, a_{l}} F(\boldsymbol{W}, \boldsymbol{b}, z, \boldsymbol{a})=& R\left(z_{L} ; y\right)+\sum_{l=1}^{L} \Omega_{l}\left(W_{l}\right) \ +\underbrace{(v / 2) \sum_{l=1}^{L-1}\left(\left|z_{l}-W_{l} a_{l-1}-b_{l}\right|{2}^{2}+\left|a{l}-f_{l}\left(z_{l}\right)\right|{2}^{2}\right)}{\text{$\ell_2$ penalty} } \ \text { s.t. } z_{L}=W_{L} a_{L-1}+b_{L} \end{aligned}}\end{array} $$
where $\mathbf{W}=\left{W_{l}\right}{l=1}^{L}, \mathbf{b}=\left{b{l}\right}{l=1}^{L}, \mathbf{z}=\left{z{l}\right}{l=1}^{L}, \mathbf{a}=\left{a{l}\right}_{l=1}^{L-1}$ and
Compared with Problem 1, Problem 2 has only a linear constraint
- dlADMM: Deep Learning Optimization via Alternating Direction Method of Multipliers
- Deep Learning Optimization via Alternating Direction Method of Multiplier
- http://maths.nju.edu.cn/~hebma/
- http://stanford.edu/~boyd/admm.html
- A General Analysis of the Convergence of ADMM
- 用ADMM实现统计学习问题的分布式计算
- https://www.wikiwand.com/en/Augmented_Lagrangian_method
- 凸优化:ADMM(Alternating Direction Method of Multipliers)交替方向乘子算法
- Splitting methods and ADMM, Thibaut Lienart
- Split Bregman
- Accelerated Bregman operator splitting with backtracking
- Splitting Algorithms, Modern Operator Theory, and Applications (17w5030)
- 17w5030 Workshop on Splitting Algorithms, Modern Operator Theory, and Applications
The methods such as ADMM, proximal gradient methods do not optimize the cost function directly. For example, we want to minimize the following cost function $$ f(x,y) = g(x) + h(y) $$
with or without constraints.
Specially, if the cost function is additionally separable,
And ADMM or proximal gradient methods are to split the cost function to 2 blocks, of which one is differentiable and smooth while the other may not be differentiable. In another word, we can use them to solve some non-smooth optimization problem. However, what if there is no constraints application to the optimization problem?
Coordinate descent is aimed to minimize the following cost function $$ f(x) = g(x) +\sum_{i}^{n} h_i(x_i) $$
where
$x_{1}^{k} \in\arg\min_{x_1}f(x_1, x_2^{k-1}, x_3^{k-1}, \dots, x_n^{k-1});$ $x_{2}^{k} \in\arg\min_{x_1}f(x_1^{\color{red}{k}}, x_2,x_3^{k-1},\dots, x_n^{k-1});$ $x_{3}^{k} \in\arg\min_{x_1}f(x_1^{\color{red}{k}}, x_2^{\color{red}{k}},x_3,\dots, x_n^{k-1});$ $\vdots$ $x_{n}^{k} \in\arg\min_{x_1}f(x_1^{\color{red}{k}}, x_2^{\color{red}{k}},x_3^{\color{red}{k}},\dots, x_n).$
It can extended to block coordinate descent(BCD
) if the variables
- http://bicmr.pku.edu.cn/conference/opt-2014/index.html
- https://calculus.subwiki.org/wiki/Additively_separable_function
- https://www.cs.cmu.edu/~ggordon/10725-F12/slides/25-coord-desc.pdf
- http://bicmr.pku.edu.cn/~wenzw/opt2015/multiconvex_BCD.pdf
- http://pages.cs.wisc.edu/~swright/LPS/sjw-abcr-v3.pdf
- MS72: Recent Progress in Coordinate-wise Descent Methods
A linear program (LP) is an optimization problem in which the objective function is linear in the unknowns and the constraints consist of linear equalities and linear inequalities.
The standard problem is
Here
Fundamental Theorem of Linear Programming. Given a linear program in standard form where
$\mathbf A$ is an$m × n$ matrix of rank$m$ ,
- if there is a feasible solution, there is a basic feasible solution;
- if there is an optimal feasible solution, there is an optimal basic feasible solution.
Linear programming is constrainted convex optimization problem. It is the simplest case of constrainted optimization problem in theory. However, it is useful in many cases.
If there is no constraints, the linear objectve function is unbounded.
It is a unified principle that we optimize an objective function via sequentially optimizing surrogate functions such as EM, ADMM.
It is obvious that the choice of optimization methods relies on the objective function. Surrogate loss transform the original problem
We will call merit function
.
A good surrogate function should:
- Approximate the objective function.
- Easy to optimize.
This always leads to the convexification technique
where the surrogate function is convex. For example, we can rewrite gradient descent in the following form
$$
x^{k+1}=\arg\min_{x} {f(x^k)+\left<\nabla f(x^k), x-x^k\right>+\frac{1}{2\alpha_k}{|x-x^k|}_2^2}.
$$
In Newton’s method, we approximate the objective with a quadratic surrogate of the form: $$ Q_k(x) = f(x^k)+\left<\nabla f(x^k), x-x^k\right>+\frac{1}{2\alpha_k}(x-x^k)^{T}H_k(x-x^k). $$
Note that the Hessian matrix
Note that momentum methods can be rewrite in the surrogate function form: $$ x^{k+1}=x^{k}-\alpha_{k}\nabla_{x}f(x^k)+\rho_{k}(x^k-x^{k-1}) \ = \arg\min_{x}{f(x^k) + \left<\nabla f(x^k), x-x^k\right> + \frac{1}{2\alpha_k} {|x-x^k-\rho_k(x^k-x^{k-1})|}_2^2}. $$
It is natural to replace the squared function with some non-negative function such as mirror gradient methods $$ x^{k+1} = \arg\min_{x} { f(x^k) + \left<\nabla f(x^k), x-x^k\right> + \frac{1}{\alpha_k} B(x,x^k)}. $$
More generally, auxiliary function
- http://fa.bianp.net/teaching/2018/eecs227at/
- http://fa.bianp.net/teaching/2018/eecs227at/newton.html
- http://faculty.uml.edu/cbyrne/NBHSeminar2015.pdf
The parameters
- If the cost function is not smooth or differential such as absolute value function, the gradient is not available so that it is a problem to construct a convex surrogate function without gradient;
- In another hand, there is no unified principle to construct a convex surrogate function if we know more information of higher order derivatives.
Practically, we rarely meet pure black box models; rather, we know something about structure of underlying problems One possible strategy is:
- approximate nonsmooth objective function by a smooth function
- optimize smooth approximation instead (using, e.g., Nesterov’s accelerated method).
A convex function
$f_{\mu}(x) \leq f(x) \leq f_{\mu}(x) + \beta\mu, \forall x$ -
$f_{\mu}(x)$ is$\frac{\alpha}{\mu}$ smooth.
Moreau envelope (or Moreau-Yosida regularization)
of a convex
function
Minimizing
Moreau envelope
This means $$ prox_{\mu f} (x)=x-\mu \nabla M_{\mu f} $$
i.e.,
Fenchel Conjugate
of a function
- Smoothing for non-smooth optimization, ELE522: Large-Scale Optimization for Data Science
- Smoothing, EE236C (Spring 2013-14)
- Smoothing and First Order Methods: A Unified Framework
Charles Byrne gives a unified treatment of some iterative optimization algorithms such as auxiliary function methods.
Young | Recent | Now |
---|---|---|
He is a featured author of CRC press and professor in UML
Auxiliary-Function Methods
minimize the function
$$
G_k(x) = f(x) + g_k(x)
$$
over
Auxiliary-Function Methods(AF methods) include
- Barrier- and Penalty-Function Algorithms such as sequential unconstrained minimization (SUM) methods, interior-point methods exterior-point methods;
- Proximal Minimization Algorithms such as majorization minimization.
And an AF method is said to be in the SUMMA class if the SUMMA Inequality holds:
Proximal minimization algorithms using Bregman distances, called here PMAB
, minimize the function
The forward-backward splitting (FBS)
methods is to minimize the function
It is equivalent to minimize $$ G_k(x) = f(x) + \frac{1}{2\gamma} {|x-x^{k-1}|}_2^2-B(x, x^{k-1}), $$
where Alternating Minimization (AM)
can be regarded as coordinate optimization methods with 2 blocks:
$p^{n+1}=\arg\min_{p\in P} f(p,q^n)$ ,$q^{n+1}=\arg\min_{p\in Q} f(p^{n+1},q)$ ,
where Five-Point Property
hold:
$$
f(p, q) + f(p, q^{n-1}) \geq f(p, q^{n}) + f(p^n, q^{n-1}).
$$
For each p in the set P, define
See Iterative Convex Optimization Algorithms; Part Two: Without the Baillon–Haddad Theorem or Alternating Minimization, Proximal Minimization and Optimization Transfer Are Equivalent for more information on AF methods.
Expectation-Maximization can be classified to AF method.
$Q(\theta|\theta^{(t)})=\mathbb{E}(\ell(\theta|Y_{obs}, Z)|Y_{obs},\theta^{(t)})$ ;$\theta^{(t+1)}=\arg\min_{\theta} Q(\theta|\theta^{(t)})$ .
The
- Iterative Convex Optimization Algorithms; Part Two: Without the Baillon–Haddad Theorem
- Alternating Minimization, Proximal Minimization and Optimization Transfer Are Equivalent
- A unified treatment of some iterative algorithms in signal processing and image reconstruction
- Convergence Rate of Expectation-Maximization
Optimization using surrogate models applied to Gaussian Process models (Kriging).
- Surrogate loss function
- Divergences, surrogate loss functions and experimental design
- Surrogate Regret Bounds for Proper Losses
- Bregman Divergences and Surrogates for Learning
- Meet the Bregman Divergences
- Some Theoretical Properties of an Augmented Lagrangian Merit Function
- https://people.eecs.berkeley.edu/~wainwrig/stat241b/lec11.pdf
- http://fa.bianp.net/blog/2014/surrogate-loss-functions-in-machine-learning/
The methods discussed in the book Block Relaxation Methods in Statistics
are special cases of what we shall call block-relaxation methods, although other names such as decomposition or nonlinear Gauss-Seidel or ping-pong or seesaw methods have also been used.
In a block relaxation method we minimize a real-valued function of several variables by partitioning the variables into blocks. We choose initial values for all blocks, and then minimize over one of the blocks, while keeping all other blocks fixed at their current values. We then replace the values of the active block by the minimizer, and proceed by choosing another block to become active. An iteration of the algorithm steps through all blocks in turn, each time keeping the non-active blocks fixed at current values, and each time replacing the active blocks by solving the minimization subproblems. If there are more than two blocks there are different ways to cycle through the blocks. If we use the same sequence of active blocks in each iteration then the block method is called cyclic.
In the special case in which blocks consist of only one coordinate we speak of the coordinate relaxation method or the coordinate descent (or CD) method. If we are maximizing then it is coordinate ascent (or CA). The cyclic versions are CCD and CCA.
Augmentation Methods
Augmentation and Decomposition Methods Note: augmentation duality. $$ h(y)=\min_{x\in\mathcal{X}} g(x,y) $$ then $$ \min_{x\in\mathcal{X}}f(x)=\min_{x\in\mathcal{X}}\min_{y\in\mathcal{Y}}g(x,y)=\min_{y\in\mathcal{Y}}\min_{x\in\mathcal{X}}g(x,y)=\min_{y\in\mathcal{Y}}h(y). $$
Alternating Conditional Expectations
The alternating descent conditional gradient method
A ubiquitous prior in modern statistical signal processing asserts that an observed signal is the noisy measurement of a few weighted sources. In other words, compared to the entire dictionary of possible sources, the set of sources actually present is sparse. In many cases of practical interest the sources are parameterized and the measurement of multiple weighted sources is linear in their individual measurements.
As a concrete example, consider the idealized task of identifying the aircraft that lead to an observed radar signal. The sources are the aircraft themselves, and each is parameterized by, perhaps, its position and velocity relative to the radar detectors. The sparse inverse problem is to recover the number of aircraft present, along with each of their parameters.
Convex Relaxations
Quadratic Majorization
A quadratic
- https://pratikac.github.io/
- Block Relaxation Methods in Statistics by Jan de Leeuw
- Deep Relaxation: partial differential equations for optimizing deep neural networks
- Deep Relaxation tutorials
- CS 369S: Hierarchies of Integer Programming Relaxations
- Convex Relaxations and Integrality Gaps
- LP/SDP Hierarchies Reading Group
- Proofs, beliefs, and algorithms through the lens of sum-of-squares
- Iterative Convex Optimization Algorithms; Part Two: Without the Baillon–Haddad Theorem
- Relaxation and Decomposition Methods for Mixed Integer Nonlinear Programming
In order for primal-dual methods to be applicable to a constrained minimization problem, it is necessary that restrictive convexity conditions are satisfied. A nonconvex problem can be convexified and transformed into one which can be solved with the aid of primal-dual methods.
- Convexification and Global Optimization in Continuous and Mixed-Integer Nonlinear Programming: Theory, Algorithms, Software, and Applications
- Convexification and Global Optimization of Nonlinear Programs
- Convexification Procedure and Decomposition Methods for Nonconvex Optimization Problem
- Conic Optimization Theory: Convexification Techniques and Numerical Algorithms
- Convexification of polynomial optimization problems by means of monomial patterns
- On convexification/optimization of functionals including an $\ell^2$-misfit term
- Lossless Convexification of Nonconvex Control Bound and Pointing Constraints of the Soft Landing Optimal Control Problem
- A General Class of Convexification Transformation for the Noninferior Frontier of a Multiobjective Program
- Implementation of a Convexification Technique for Signomial Functions
- Sequential quadratic programming
- A method to convexify functions via curve evolution
The fixed point algorithm is initially to find approximate solutions of the equation
In this method, we first rewrite the question(1) in the following form
in such a way that any solution of the equation (2), which is a fixed point of the function
- Give the initial point
$x^{0}$ ;- Compute the recursive procedure
$x^{n+1}=g(x^n), n=1,2,\ldots$
So that finally we obtain an sequence
The order of convergence is defined as the constant
$p$ such that$\lim_{n\to \infty}\frac{| x^{n+1}-x^{\ast}|}{| x^{n}-x^{\ast}|^p}=C$ if$\lim_{n\to\infty}x^{n}=x^{\ast}$ .
name | the order of convergence |
---|---|
sublinear | |
linear |
|
superlinear | |
quadratic |
Solving nonlinear equation
There are many methods for determining the value of
- Numerical methods: Solving nonlinear equations
- Newton, Chebyshev, and Halley Basins of Attraction; A Complete Geometric Approach by Bart D. Stewart, Department of Mathematics, United States Military Academy
- Solutions of Equations in One Variable Fixed-Point Iteration II
Bisection method
Find a midpoint of interval
Regula falsi (false position) method
and
Secant method
The
Newton’s method
Steffensen’s Method
Steffensen’s method is modified Newton’s method
- http://www.fyzikazeme.sk/mainpage/stud_mat/nm/lecture2.pdf
- On the development of Steffensen’s method and applications to Hammerstein equations
- An improvement of Steffensen's method for solving nonlinear equations by N. Gattal1, and A. S. Chibi
Muller's Method
Muller’s method is a generalization of the secant method, in the sense that it does not require the derivative of the function.
It is an iterative method that requires three starting points
- http://mathworld.wolfram.com/MullersMethod.html
- http://mathfaculty.fullerton.edu/mathews/n2003/mullersmethod/MullersMethodProof.pdf
- Muller's Method Wiki
Chebyshev Method
- http://mathworld.wolfram.com/ChebyshevIteration.html
- http://www.sam.math.ethz.ch/~mhg/pub/Cheby-02ParComp.pdf
Halley's Methods
- Newton's versus Halley's Method: A Dynamical Systems Approach
- MODIFIED HALLEY’S METHOD FOR SOLVING NONLINEAR FUNCTIONS WITH CONVERGENCE OF ORDER SIX AND EFFICIENCY INDEX 1.8171
- http://benisrael.net/NEWTON-MONTHLY.pdf
Aitken’s
- Let
$\left{p_{n}\right}$ be generated by a method which has a linear convergence, - Having
$p_{0}, p_{1}$ and$p_{2}$ compute $\hat{p}{0}=p{0}-\frac{\left(p_{1}-p_{0}\right)^{2}}{p_{2}-2 p_{1}+p_{0}}, n=1,2, \ldots$- compute
$p_{n+2}$ - compute $\hat{p}{n}=p{n}-\frac{\left(p_{n+1}-p_{n}\right)^{2}}{p_{n+2}-2 p_{n+1}+p_{n}};$ and
- the algorithm terminates $p \approx \hat{p}{n}$ if $\left|\hat{p}{n}-\hat{p}_{n-1}\right|<\epsilon$.
- compute
- http://mathfaculty.fullerton.edu/mathews/n2003/AitkenSteffensenMod.html
- 2.6 - Accelerating Convergence Aitken’s Delta squared Method
- Aitken’s $\Delta^2$ method extended
- Higher Order Aitken Extrapolation with Application to Converging and Diverging Gauss-Seidel Iterations
- AN ACCELERATION TECHNIQUE FOR SLOWLY CONVERGENT FIXED POINT ITERATIVE METHODS
- https://www.math.usm.edu/lambers/mat460/fall09/lecture13.pdf
- Fixed-Point Iteration
- Lecture 8 : Fixed Point Iteration Method, Newton’s Method
- 2.2 Fixed-Point Iteration
Homotopy Continuation Methods
Homotopy Methods
transform a hard problem into a simpler one whit easily calculated zeros and then gradually deform this simpler problem into the original one computing the zeros of the intervening problems and eventually ending with a zero of the original problem.
The homotopy method (continuation method, successive loading method) can be used to generate a good starting value
.
Suppose one wishes to obtain a solution to a system of
Since we assume that such a priori knowledge is not available, the iteration will often fail, because poor starting values are likely to be chosen.
We construct a parameter depending function
- http://homepages.math.uic.edu/~jan/
- http://people.bu.edu/fdc/H-topy.pdf
- 3.10: Homotopy Method
- CHAPTER 2: Numerical homotopy continuation
- https://blog.csdn.net/jbb0523/article/details/52460408
- Homotopy Continuation, Max Buot (CMU)&Donald Richards (PSU)
- On the Link Between Gaussian Homotopy Continuation and Convex Envelopes
- HOPE: A Homotopy Optimization Method for Protein Structure Prediction
- Solving inequality Constrained Optimization Problems by Differential Homotopy Continuation Methods
- PHCpack: a general-purpose solver for polynomial systems by homotopy continuation
In high dimensional space, it is a little different. Fixed point iteration as well as the fixed point itself arises in many cases such as [https://arxiv.org/pdf/1511.06393.pdf].
The contracting mapping
Now we rewrite the necessary condition of unconstrainted optimization problems
where
These correspond to gradient descent, mirror gradient methods, Newton's methods and quasi-Newton's methods, respectively.
And the projected (sub)gradient methods are in the fixed-point iterative form: $$ x = Proj_{\mathbb{S}}(x-\alpha \nabla f(x)) $$
as well as the mirror gradient and proximal gradient methods different from the projection operator.
Expectation maximization is also an accelerated fixed point iteration as well as Markov chain.
The following figures in the table is form Formulations to overcome the divergence of iterative method of fixed-point in nonlinear equations solution
Fixed Point Iterations | |
---|---|
- https://www.wikiwand.com/en/Fixed-point_theorem
- FixedPoint: A suite of acceleration algorithms with Application
- Books on Fixed Point Theory
- Recent Advances in Convex Optimization and Fixed Point Algorithms by Jean-Christophe Pesquet
- Acceleration methods TIES594 PDE-solvers, Lecture 14, 6.5.2015, Olli Mali
- The SUNNonlinearSolver_FixedPoint implementation¶
The mapping
Updating all block-components of
An equivalent definition of $S$ is given by the equation
$$S_i=T{i}(S_1(x), \dots, S_{i-1}(x), x_i, \dots, x_m)$$
where Gaussian-Seidel mapping
based on the mapping Gaussian-Seidel algorithm
based on the mapping
The system
Given a vector
The
It is less sensitive to outliers and obtain much more sparse solutions (as opposed to
It is clear that the absolute value function is not smooth or differentiable everywhere even the objective function
Iterative Shrinkage-Threshold Algorithms(ISTA) for
FISTA with constant stepsize
$x^{k}= p_{L}(y^k)$ computed as ISTA;$t_{k+1}=\frac{1+\sqrt{1+4t_k^2}}{2}$ ;$y^{k+1}=x^k+\frac{t_k -1}{t_{k+1}}(x^k-x^{k-1})$ .
- A Fast Iterative Shrinkage Algorithm for Convex Regularized Linear Inverse Problems
- https://pylops.readthedocs.io/en/latest/gallery/plot_ista.html
- ORF523: ISTA and FISTA
This will lead to the operator splitting methods analysesed by Wotao Yin and others.
- ORIE 6326: Convex Optimization Operator Splitting
- Monotone Operator Splitting Methods
- A Course on First-Order, Operator Splitting, and Coordinate Update Methods for Optimization
- Operator Splitting Methods for Convex Optimization Analysis and Implementation
- Some Operator Splitting Methods for Convex Optimization
- FAST ALTERNATING DIRECTION OPTIMIZATION METHODS
- https://www.math.ucla.edu/~wotaoyin/math285j.18f/
- Fixed-Point Continuation for $\ell_1$-Minimization: Methodology and Convergence@https://epubs.siam.org/doi/abs/10.1137/070698920
- Given
- existing optimization procedure
$M(f, x)$ - previous iterates
$x^{1},x^{2}, \cdots ,x^{k}$ and - new proposed guess
$x^P = M(f, x^{k})$ .
- existing optimization procedure
- Find step
$x^{k+1}$ using information at$x^{1},x^{2}, \cdots ,x^{k}, x^P$ .
- Discover acceleration
- The zen of gradient descent
- Introduction to Optimization Theory MS&E213 / CS269O - Spring 2017 Chapter 4 Acceleration
- Algorithms, Nature, and Society
- https://damienscieur.com/sections/paper.html
- Generalized Framework for Nonlinear Acceleration
- Nonlinear Acceleration of Stochastic Algorithms
Relaxation and inertia
- A Generic online acceleration scheme for Optimization algorithms via Relaxation and Inertia
- RELAXATION AND INERTIA IN FIXED-POINT ITERATIONS WITH APPLICATIONS
- Weak Convergence of a Relaxed and Inertial Hybrid Projection-Proximal Point Algorithm for Maximal Monotone Operators in Hilbert Space
- FIRE: Fast Inertial Relaxation Engine for Optimization on All Scales
- Monotonicity, Acceleration, Inertia, and the Proximal Gradient algorithm
- Online Relaxation Method for Improving Linear Convergence Rates of the ADMM
minimal polynomial
of
- Efficient implementation of minimal polynomial and reduced rank extrapolation methods
- https://core.ac.uk/download/pdf/82614502.pdf
- Minimal polynomial and reduced rank extrapolation methods are related
- http://www.cs.technion.ac.il/~asidi/
- https://www.di.ens.fr/~aspremon/PDF/FOCM17.pdf
- https://simons.berkeley.edu/sites/default/files/docs/8821/alexsimons17.pdf
- Nonlinear Schwarz iterations with Reduced Rank Extrapolation
- A BLOCK RECYCLED GMRES METHOD WITH INVESTIGATIONS INTO ASPECTS OF SOLVER PERFORMANCE
Let nonexpansive
mapping and consider for fixed Halpern-Iteration
(named after Benjamin Halpern, who introduced it):
with
It is proved that
Krasnosel'skii-Mann(KM, or averaged) iterations
update
for the fixed point problem (2). Specially, Krasnosel'skii-Mann Theorem
.
In 1953, Mann defined an iterative method: $$ x^{k+1}=M(x^k, \alpha_k , T)=(1-\alpha_k)x^k+ {\alpha}k T(x^k),\ \alpha_k \in [0,1),,\text{satisfying}, \sum{k=1}^{\infty}{\alpha}_k = \infty. $$
The sequence
$$
x^{k+1}= (1-\alpha_k)x^k+ {\alpha}_k T(x^k), 0 < a\leq {\alpha}k \leq b < 1
$$
(additionally $\sum{k=1}^{\infty}{\alpha}k = \infty$ )is called a modied Mann iteration
.
Take ${\alpha_k, \beta_k}$ two sequences in $[0, 1]$ satisfying
$$
\sum{k=1}^{\infty}{\alpha}_k {\beta}k = \infty, \lim{k\infty}\beta_k =0, 0 \leq {\alpha}_k \leq \beta_k \leq 1.
$$
Then the sequence
is called the Ishikawa iteration
.
Hojjat Afshari and Hassen Aydi proposes another Mann type iteration: $$ x^{k+1} = {\alpha}_k x^k + {\beta}_k T(x^k) + {\gamma}_k T^{2}(x^k) $$
where ${\alpha}_k + {\beta}_k + {\gamma}_k = 1, {\alpha}_k, {\beta}k, {\gamma}k \in [0,1)$ for all $k\geq 1$, $\sum{k}^{\infty}(1-\alpha_k)=\infty$, $\sum{k=1}^{\infty}\gamma_k < \infty$.
If the Mann type iteration
- Krasnoselskii-Mann method for non-self mappings
- http://www.krasnoselskii.iitp.ru/papereng.pdf
- Strong convergence of the modified Mann iterative method for strict pseudo-contractions
- Some results on a modified Mann iterative scheme in a reflexive Banach space
- Some results about Krasnosel'skiĭ-Mann iteration process
- Convergence theorems for inertial KM-type algorithms
- Modified inertial Mann algorithm and inertial CQ-algorithm for nonexpansive mappings
There is an acceleration framework of fixed point iterations for the problem (2) called Anderson Acceleration
or regularized nonlinear acceleration
:
$F_k = (h_{k-m_k}, \dots, h_k)$ where$h_i=g(x_i)-x_i$ ;- Determine
$\alpha^{k}=(\alpha_0^{k},\dots, \alpha_{m_k}^{k})^{T}$ that solves $\min_{\alpha^{k}}{|F_k\alpha^k|}2^2$ s.t. $\sum{i=0}^{m_k}\alpha_i^{k}=1$;- Set
$x_{k+1}=\sum_{i=0}^{m_k} \alpha_i^{k} g(x_{k - m_k + i})$ .
It is maybe interesting to introduce some Bregman divergence
It is proved that the Anderson acceleration converges if the fixed point mapping is cotractive.
- Anderson acceleration for fixed point iterations
- Anderson Acceleration
- MS142 Anderson Acceleration and Applications
- Convergence Analysis For Anderson Acceleration
- Comments on "Anderson Acceleration, Mixing and Extrapolation"
- Globally Convergent Type-I Anderson Acceleration for Non-Smooth Fixed-Point Iterations
- A proof that Anderson acceleration improves the convergence rate in linearly converging fixed point methods (but not in those converging quadratically)
The Baillon-Haddad Theorem provides an important link between convex
optimization and fixed-point iteration, which proves that if the gradient of a convex and continuously differentiable function is non-expansive, then it is actually firmly non-expansive
.
- The Baillon-Haddad Theorem Revisited
- A Generic online acceleration scheme for Optimization algorithms via Relaxation and Inertia
- RELAXATION AND INERTIA IN FIXED-POINT ITERATIONS WITH APPLICATIOn
- Monotonicity, Acceleration, Inertia, and the Proximal Gradient algorithm
- Iterative Convex Optimization Algorithms; Part One: Using the Baillon–Haddad Theorem
- Recent Advances in Convex Optimization and Fixed Point Algorithms by Jean-Christophe Pesquet
Anderson Acceleration of the Alternating Projections Method for Computing the Nearest Correlation Matrix
A correlation matrix is symmetric, has unit diagonal, and is positive semidefinite. Frequently, asynchronous or missing observations lead to the obtained matrix being indefinite.
A standard way to correct an invalid correlation matrix, by which we mean a real, symmetric indefinite matrix with unit diagonal, is to replace it by the nearest correlation matrix in the Frobenius norm, that is, by the solution of the problem $$\min{ {|A − X|}F : \text{X is a correlation matrix} },$$ where ${|A|}F^2=\sum{ij}a{ij}^2$.
- The Nearest Correlation Matrix
- Anderson Acceleration of the Alternating Projections Method for Computing the Nearest Correlation Matrix
- https://github.com/higham/anderson-accel-ncm
- Damped Anderson acceleration with restarts and monotonicity control for accelerating EM and EM-like algorithms Talk
- Damped Anderson acceleration with restarts and monotonicity control for accelerating EM and EM-like algorithms
- Accelerating the EM Algorithm for Mixture-density Estimation
- http://nhenderstat.com/research/
- Anderson Accelerated Douglas-Rachford Splitting
- https://ctk.math.ncsu.edu/TALKS/Anderson.pdf
- Using Anderson Acceleration to Accelerate the Convergence of Neutron Transport Calculations with Anisotropic Scattering
As shown before, the acceleration schemes are based on the linear combination of last iterated sequence.
The question is why it is linear combination?
Why not other extrapolation
of the last updated values?
- Alternating Anderson-Richardson method: An efficient alternative to preconditioned Krylov methods for large, sparse linear systems
- Anderson acceleration of the Jacobi iterative method: An efficient alternative to Krylov methods for large, sparse linear systems
- REGULARIZED NONLINEAR ACCELERATION
- Regularized Nonlinear Acceleration@lids.mit.edu
- Regularized Nonlinear Acceleration@simons.berkeley
- Regularized Nonlinear Acceleration by Damien Scieur
- Regularized nonlinear acceleration, Mathematical Programming
- http://spars2017.lx.it.pt/index_files/papers/SPARS2017_Paper_16.pdf
- https://github.com/windows7lover/RegularizedNonlinearAcceleration
- https://damienscieur.com/
- Objective acceleration for unconstrained optimization, Optimization methods and software conference 2017, Havana
- Objective acceleration for unconstrained optimization: code
- Objective acceleration for unconstrained optimization by Asbjørn Nilsen Riseth
- http://julianlsolvers.github.io/Optim.jl/stable/#algo/ngmres/
- http://spars2017.lx.it.pt/
However, it is best to think from the necessary condition of optima in non-convex optimization in my opinion. Another question is to generalize the fixed point iteration to stochastic gradient methods.
The principle of feedback is simple an input,
A PID controller continuously calculates an error
where
- the P-term (which is proportional to the error);
- the I-term (which is proportional to the integral of the error);
- and the D-term (which is proportional to the derivative of the error).
The controller can also be parameterized as
where
The proportional part acts on the present value of the error, the integral represent and average of past errors and the derivative can be interpreted as a prediction of future errors based on linear extrapolation.
- http://www.scholarpedia.org/article/Optimal_control
- EE365: Stochastic Control Spring Quarter 2014
- PID Theory Explained
- Chapter 8: PID Control
- CDS 101/110 -- Fall 2004 Analysis and Design of Feedback Systems
Methods | Recursion | Integration |
---|---|---|
Gradient Descent | ? | |
Momentum Methods | ? | |
Nesterov's Gradient Methods | ? | |
Newton's Methods | ? | |
Mirror Gradient Methods | ? |
By viewing the gradient
We rewrite the fomula
One can see that the update of parameters relies on both the present gradient and the integral of past gradients.
The only difference is that there
is a decay
PID optimizer
The proposed PID optimizer updates parameter
$V^{t+1}=\alpha V^t -r g(x^t)$ $D^{t+1}=\alpha D^t +(1-\alpha)(g(x^t)-g(x^{t-1}))$ $x^{t+1}=x^t+V^{t+1}+K_d D^{t+1}$
- CVPR 2018 | 加速模型收敛的新思路(控制理论+深度学习)
- 一种用于深度网络随机优化的PID控制器方法
- A PID Controller Approach for Stochastic Optimization of Deep Networks
- Proportional Integral Distributed Optimization for Dynamic Network Topologies
- Proportional-Integral Distributed Optimization
- Continuous-time Proportional-Integral Distributed Optimization for Networked Systems
- Continuous-time Proportional-Integral Distributed Optimization for Networked Systems
- A Control Perspective for Centralized and Distributed Convex Optimization
- https://arxiv.org/pdf/1905.03468
- Feedback-Feedforward Control Approach to Distributed Optimization
Sample Recurrence Relation | Idea of Successive Approximations |
---|---|
- Accelerated Optimization in the PDE Framework: Formulations for the Manifold of Diffeomorphism
- Integration Methods and Accelerated Optimization Algorithms
We will focus on the optimization methods in the form of fixed point iteration and dynamical systems. It is to minimize the following function $$ f(x), x\in\mathbb{R}^p, \quad\nabla f(x) = g(x), \quad\nabla^2 f(x)= H(x). $$
Iteration | ODE | Name |
---|---|---|
Gradient descent | ||
Newton's method |
Like Newton interpolation, more points can compute higher order derivatives. The dynamics of accelerated gradient methods are expected to correspond to higher order differential equations.
Some acceleration methods are iterations of the corresponding algorithms of Asymptotic Vanishing Damping
called by Hedy Attouch:
where
The fast minimization properties of the trajectories of the second-order evolution equation is also studied by Hedy's group in 2016: $$ \ddot{x}(t) + \frac{\alpha}{t} \dot{x}(t) + \nabla^2 \Phi (x(t))\dot{x}(t) + \nabla \Phi (x(t)) =0\tag{HDD} $$
When it comes to numerical solution to differential equations, it is to find the solution of the equations
- https://perso.math.univ-toulouse.fr/spot/resumes/
- Fast convex optimization via inertial dynamics with Hessian driven damping
- A proximal-Newton method for monotone inclusions in Hilbert spaces with complexity $O(1/k^2)$
- https://www.researchgate.net/profile/Hedy_Attouch
- https://www.ljll.math.upmc.fr/~plc/sestri/
- Hedy Attouch at https://biography.omicsonline.org
- Rate of convergence of the Nesterov accelerated gradient method in the subcritical case $\alpha\leq 3$
- A Dynamical Approach to an Inertial Forward-Backward Algorithm for Convex Minimization
- An Inertial Proximal Method for Maximal Monotone Operators via Discretization of a Nonlinear Oscillator with Damping
- THE HEAVY BALL WITH FRICTION METHOD, I. THE CONTINUOUS DYNAMICAL SYSTEM: GLOBAL EXPLORATION OF THE LOCAL MINIMA OF A REAL-VALUED FUNCTION BY ASYMPTOTIC ANALYSIS OF A DISSIPATIVE DYNAMICAL SYSTEM
- Viscosity Solutions of Minimization Problems
- Rate of convergence of the Nesterov accelerated gradient method in the subcritical case $\alpha \leq 3$
- A dynamic approach to a proximal-Newton method for monotone inclusions in Hilbert spaces, with complexity $O(\frac{1}{n^2})$
- FAST CONVERGENCE OF INERTIAL DYNAMICS AND ALGORITHMS WITH ASYMPTOTIC VANISHING DAMPING
It is difficult to generalize these methods to stochastic cases.
There is a wonderful summary DYNAMICAL, SYMPLECTIC AND STOCHASTIC PERSPECTIVES ON GRADIENT-BASED OPTIMIZATION given by Micheal I Jordan at ICM 2018.
Some new connections between dynamical systems and optimization is found.
- Variational and Dynamical Perspectives On Learning and Optimization
- Continuous and Discrete Dynamics For Online Learning and Convex Optimization
- DYNAMICAL, SYMPLECTIC AND STOCHASTIC PERSPECTIVES ON GRADIENT-BASED OPTIMIZATION
- On Symplectic Optimization
- A variational perspective on accelerated methods in optimization
- A Dynamical Systems Perspective on Nesterov Acceleration
- Generalized Momentum-Based Methods: A Hamiltonian Perspective
- https://people.eecs.berkeley.edu/~jordan/optimization.html
Weijie J. Su (joint with Bin Shi, Simon Du, and Michael Jordan) introduced a set of high-resolution differential equations to model, analyze, interpret, and design accelerated optimization methods.
- A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method: Theory and Insights
- Acceleration via Symplectic Discretization of High-Resolution Differential Equations
- Understanding the Acceleration Phenomenon via High-Resolution Differential Equations
- Global Convergence of Langevin Dynamics Based Algorithms for Nonconvex Optimization
- Analysis of Hamilton-Jacobi Equation: Optimization, Dynamics and Control - Part II of II
- Optimization via the Hamilton-Jacobi-Bellman Method: Theory and Applications by Navin Khaneja, Harvard & KITP
- GRADIENT FLOW DYNAMICS
- Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem
- The dynamics of Lagrange and Hamilton
- Optimization and Dynamical Systems
- Direct Runge-Kutta Discretization Achieves Acceleration
- The Physical systems Behind Optimization Algorithms
- Integration Methods and Optimization Algorithms
ADMM and Dynamics
- A Dynamical Systems Perspective on Nonsmooth Constrained Optimization
- https://kgatsis.github.io/learning_for_control_workshop_CDC2018/assets/slides/Vidal_CDC18.pdf
- ADMM and Accelerated ADMM as Continuous Dynamical Systems
- ADMM, Accelerated-ADMM, and Continuous Dynamical Systems, Talk @DIMACS
- Relax, and Accelerate: A Continuous Perspective on ADMM
- http://people.ee.duke.edu/~lcarin/Xuejun12.11.2015.pdf
- ESAIM: Control, Optimization and Calculus of Variations (ESAIM: COCV)
- MCT'03 Louisiana Conference on Mathematical Control Theory
- International Workshop “Variational Analysis and Applications”, ERICE, August 28 – September 5, 2018
- Games, Dynamics and Optimization, 2019
- System dynamics & optimization
- Introduction to Dynamical Systems by John K. Hunter, Department of Mathematics, University of California at Davis
GRADIENTS AND FLOWS: CONTINUOUS OPTIMIZATION APPROACHES TO THE MAXIMUM FLOW PROBLEM
Robbins–Monro algorithm introduced in 1951 by Herbert Robbins and Sutton Monro, presented a methodology for solving a root finding problem, where the function is represented as an expected value.
Assume that we have a function
It is assumed that while we cannot directly observe the function Robbins and Monro
proved , Theorem 2 that
The basic ideas of the Robbins–Monro scheme
can be readily modified to provide successive approximations for the minimum
(or maximum) of a unimodal regression function, as was shown by Kiefer and Wolfowitz (1952)
who introduced a recursive scheme of the form
$$\theta_{n+1}= {\theta}{n} - a{n}\Delta(x_{n})\tag{KW}$$
to find the minimum
During the nth stage of the Kiefer–Wolfowitz scheme, observations
- 5 The Stochastic Approximation Algorithm
- Chapter 15: Introduction to Stochastic Approximation Algorithms
- Dynamics of Stochastic Approximation
- Stochastic Approximation by Tze Leung Lai
- A Multivariate Stochastic Approximation Procedure
- The Robbins–Monro Stochastic Approximation Approach to a Discrimination Problem
- Stochastic approximation: invited paper
- Stochastic Approximation and Recursive Algorithms and Applications
- Stochastic Approximations, Diffusion Limit and Small Random Perturbations of Dynamical Systems
Stochastic gradient descent
is classified to stochastic optimization which is considered as the generalization of gradient descent
.
Stochastic gradient descent takes advantages of stochastic or estimated gradient to replace the true gradient in gradient descent. It is stochastic gradient but may not be descent. The name stochastic gradient methods may be more appropriate to call the methods with stochastic gradient. It can date back to stochastic approximation in statistics.
It is aimed to solve the problem with finite sum optimization problem, i.e.
$$
\arg\min_{\theta}\frac{1}{n}\sum_{i=1}^{n}f(\theta|x_i)
$$
where
The difficulty is
The stochastic gradient method is defined as
$$
\theta^{k+1}=\theta^{k}-\alpha_{k}\frac{1}{m}\sum_{j=1}^{m}\nabla f(\theta^{k}| x_{j}^{\prime})
$$
where
It is the fact mini-batch
gradient descent.
There is fluctuations in the total objective function as gradient steps with respect to mini-batches are taken.
The fluctuations in the objective function as gradient steps with respect to mini-batches are taken |
---|
An heuristic proposal for avoiding the choice and for modifying the learning rate while the learning task runs is the bold driver (BD) method[^14].
The learning rate increases exponentially if successive steps reduce the objective function
where
Except the learning rate or step length
, there is yet another hyperparameter - the batch size
The fact that the sample size is far larger than the dimension of parameter,
- It is simplest to set the step length a constant, such as
${\alpha}_k=3\times 10^{-3}, \forall k$ . - There are decay schemes, i.e. the step length
${\alpha}_k$ diminishes such as${\alpha}_k=\frac{\alpha}{k}$ , where$\alpha$ is constant. - And another strategy is to tune the step length adaptively such as AdaGrad, ADAM.
The Differences of Gradient Descent and Stochastic Gradient Descent |
---|
We can see some examples to see the advantages of incremental method such as the estimation of mean.
Given
Another example, it is Newton interpolation formula
in numerical analysis. The task is to fit the function via polynomials given some point in th function
This form is incremental and if another points
See the following links for more information on stochastic gradient descent.
- https://www.wikiwand.com/en/Stochastic_gradient_descent
- http://ruder.io/optimizing-gradient-descent/
- https://zhuanlan.zhihu.com/p/22252270
- http://fa.bianp.net/teaching/2018/eecs227at/stochastic_gradient.html
- A Brief (and Comprehensive) Guide to Stochastic Gradient Descent Algorithms
- A look at SGD from a physicist's perspective - Part 1
- A look at SGD from a physicists's perspective - Part 2, Bayesian Deep Learning
- A look at SGD from a physicists's perspective - Part 3, Langevin Dynamics and Applications
- Convergence Analysis of Gradient Descent Stochastic Algorithms
- Stochastic Gradient Descent with Exponential Convergence Rates of Expected Classification Errors
- Stochastic Approximations, Diffusion Limit and Small Random Perturbations of Dynamical Systems
- Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey
- Convex Relaxations of Convolutional Neural Nets
- The Impact of Neural Network Overparameterization on Gradient Confusion and Stochastic Gradient Descent
- Quasi-potential as an implicit regularizer for the loss function in the stochastic gradient descent
- The Multiplicative Noise in Stochastic Gradient Descent: Data-Dependent Regularization, Continuous and Discrete Approximation
- http://blavatnikawards.org/honorees/profile/leon-bottou/
- https://leon.bottou.org/projects/sgd
- https://leon.bottou.org/research/stochastic
- https://leon.bottou.org/papers/bottou-bousquet-2008
- Large-Scale Machine Learning with Stochastic Gradient Descent
- Trade off of Machine Learning
- The Tradeoffs of Large-scale Learning
- https://sites.google.com/view/panos-toulis/implicit-sgd
- http://dustintran.com/blog/on-asymptotic-convergence-of-averaged-sgd
- https://github.com/ptoulis/implicit-sgd
ADAM
composes of adaptive step stratagies and momentum methods in some sense. It is widely used in deep learning training.
- Adam: A Method for Stochastic Optimization
- On the convergence of Adam and Beyond
- http://ruder.io/deep-learning-optimization-2017/
The Differences of Stochastic Gradient Descent and its Variants |
---|
- VR-SGD
- http://ranger.uta.edu/~heng/CSE6389_15_slides/SGD2.pdf
- Variance reduction techniques for stochastic optimization
- Stochastic Variance-Reduced Optimization for Machine Learning Part I
- Stochastic Variance-Reduced Optimization for Machine Learning Part II
- Laplacian Smoothing Gradient Descent
- Entropy SGD
- https://github.com/tdozat/Optimization
- https://zhuanlan.zhihu.com/p/25473305
Stochastic optimization problems are so diverse that the field has become fragmented into a Balkanized set of communities with competing algorithmic strategies and modeling styles. It is easy to spend a significant part of a career mastering the subtleties of each perspective on stochastic optimization, without recognizing the common themes.
We have developed a unified framework that covers all of these books.
We break all sequential decision problems
into five components: state variables
, decision variables
, exogenous information variables
, the transition function
and the objective function
, where we search over policies (functions) for making decisions.
We have identified two core strategies for designing policies (policy search, and policies based on lookahead approximations), each of which further divides into two classes, creating four classes that cover all of the solution approaches.
Beginning around ten years ago, the single-threaded CPU performance stopped improving significantly, due to physical limitations; it is the numbers of cores in each machine that continue to arise. Today we can buy 8-core phones, 64-core workstations, and 2.5k-core GPUs at affordable prices. On the other hand, most of our algorithms are still single-threaded, and because so, their running time is about the same as it was to ten years ago and will stay so in the future, say, ten or twenty years. To develop faster algorithms, especially for those large-scale problems that arise in signal processing, medical imaging, and machine learning, it is inevitable to consider parallel computing.
Machine learning especially deep learning requires more powerful distributed and decentralized
optimization methods.
Large scale supervised machine learning methods, which are based on gradient to improve their performance, need online near-real-time feedback from massive users.
- Math 285J: A First Course on Large-Scale Optimization Methods
- http://bicmr.pku.edu.cn/~wenzw/bigdata2019.html
- Data Science meet optimzation
- Distributed optimization for control and learning
- ME 555: Distributed Optimization
- https://www.euro-online.org/websites/dso/
- https://labs.criteo.com/2014/09/poh-part-3-distributed-optimization/
- Convex and Distributed Optimization
- (749g) Accelerated Parallel Alternating Method of Multipliers (ADMM) for Distributed Optimization
- 8th IEEE Workshop Parallel / Distributed Computing and Optimization (PDCO 2018)
- Distributed Optimization and Control
- ADOPT: Asynchronous Distributed Constraint Optimization with Quality Guarantees
- On Distributed Optimization in Networked Systems
- Distributed Optimization and Control using Operator Splitting Methods
- Foundations of Distributed and Large Scale Computing Optimization
- Distributed Optimization of Large-Scale Complex Networks
- Walkman: A Communication-Efficient Random-Walk Algorithm for Decentralized Optimization
- NOVEL GRADIENT-TYPE OPTIMIZATION ALGORITHMS FOR EXTREMELY LARGE-SCALE NONSMOOTH CONVEX OPTIMIZATION
- Projects: Structure Exploitation in Large-Scale Non-Convex Optimisation
- A Distributed Flexible Delay-tolerant Proximal Gradient Algorithm
- http://ecee.colorado.edu/marden/files/dist-opt-journal.pdf
- Hemingway: Modeling Distributed Optimization Algorithms
- http://principlesofoptimaldesign.org/
It is based on an elastic force which links the parameters they compute with a center variable stored by the parameter server (master). The algorithm enables the local workers to perform more exploration, i.e. the algorithm allows the local variables to fluctuate further from the center variable by reducing the amount of communication between local workers and the master.
The loss function of Elastic-SGD
Parle exploits the phenomenon of wide minima that has been shown to improve generalization performance of deep networks and trains multiple “replicas” of a network that are coupled to each other using attractive potentials. It requires infrequent communication with the parameter server and is well-suited to singlemachine-multi-GPU as well as distributed settings.
The method replace the loss function by a smoother loss called local entropy
Parle solves for
- Parle: parallelizing stochastic gradient descent
- Unraveling the mysteries of stochastic gradient descent on deep neural networks
Asynchronous Stochastic Gradient (shared memory):
- Each thread repeatedly performs the following updates:
- For
$t = 1, 2, \cdots$ - Randomly pick an index
$i$ -
$x\leftarrow x - \nabla f_i(x)$ .
- Randomly pick an index
- For
Main trick: in shared memory systems, every threads can access the same parameter
Asynchronous Accelerated Stochastic Gradient Descent:
- Asynchronous Stochastic Gradient Descent with Delay Compensation
- Asynchronous Decentralized Parallel Stochastic Gradient Descent
- https://github.com/AbduElturki/Asynchronous-Stochastic-Gradient-Descent
- Stochastic Proximal Langevin Algorithm: Potential Splitting and Nonasymptotic Rates
- Stochastic Proximal Langevin Algorithm: Potential Splitting and Nonasymptotic Rates
- Hybrid Stochastic Gradient Descent Algorithms for Stochastic Nonconvex Optimization
Hogwild allows processors access to shared memory with the possibility of overwriting each other's work.
- https://www.podc.org/data/podc2018/podc2018-tutorial-alistarh.pdf
- https://www.math.ucla.edu/~wotaoyin/math285j.18
- http://seba1511.net/dist_blog/
- Distributed Nesterov-like Gradient Algorithms
- Convergence Rates of Distributed Nesterov-Like Gradient Methods on Random Networks
- Accelerated Distributed Nesterov Gradient Descent for Convex and Smooth Functions
- https://nali.seas.harvard.edu/
- http://users.isr.ist.utl.pt/~jxavier/
- Synchronized Parallel Coordinate Descent
- DISTRIBUTED ASYNCHRONOUS COMPUTATION OF FIXED POINTS
- An Inertial Parallel and Asynchronous Fixed-Point Iteration for Convex Optimization
There are examples such as Forward-Backward, Douglas-Rachford,... for finding a zero of a sum of maximally monotone operators or for minimizing a sum of convex functions.
- for
$n=0,\cdots$ - for
$i=1, \cdots, m$ $x_{i, n+1}=x_{i, n}+\epsilon_{i, n}\lambda_n(\mathrm T_i(x_{1,n},\cdots, x_{m, n})+a_{i, n}-x_{i, n})$
- for
where
-
$x_0, a_n\in \mathbb H$ and$\mathbb H$ is separable real Hilbert space, -
$\epsilon_{n}$ is random variable in${0,1}^m\setminus \mathbf{0}$ , -
$\lambda_n\in (0, 1)$ and$\liminf \lambda_n>0$ and$\limsup \lambda_n<1$ , - the mapping
$\mathrm T:\mathbb H\to \mathbb H$ i.e.$x\mapsto (T_1x, \cdots, T_i x, \cdots, T_m x)$ is nonexpansive operator.
- for
$n=0, 1, \cdots$ $y_n =\mathrm R_n x_n + b_n$ - for
$i=1, \cdots, m$ $x_{i, n+1}=x_{i, n}+\epsilon_{i, n}\lambda_n(\mathrm T_{i, n}(y_n)+a_{i, n}-x_{i, n})$
- https://www.ljll.math.upmc.fr/~plc/sestri/pesquet2014.pdf
- https://arxiv.org/abs/1404.7536
- https://arxiv.org/abs/1406.6404
- https://arxiv.org/abs/1406.5429
- https://arxiv.org/abs/1706.00088
- Stochastic Block-Coordinate Fixed Point Iterations with Applications to Splitting
- Stochastic Quasi-Fejer Block-Coordinate Fixed Point Iterations with Random Sweeping
- LINEAR CONVERGENCE OF STOCHASTIC BLOCK-COORDINATE FIXED POINT ALGORITHMS
- ARock: An Algorithmic Framework for Asynchronous Parallel Coordinate Update
- Asynchronous Stochastic Coordinate Descent: Parallelism and Convergence Properties
Linearly constrained stochastic convex optimization is given by
$$
\min_{x,y}\mathbb{E}{\vartheta}[F(x,\vartheta)]+h(y),\ s.t. , Ax+By = b, x\in\mathbb{X}, y\in\mathbb{Y}.
$$
where typically the expectation $\mathbb{E}{\vartheta}[F(x,\vartheta)]$ is some loss function and
The first problem is that the distribution of
Modified Augmented Lagrangian
Linearize
$$ L_{\beta}^{k}(x,y,\lambda):= f(x_k)+\left<x_k, g_k\right>+h(y)-\left< \lambda, Ax+By-b\right>+\frac{\beta}{2}{|Ax + By-b|}2^2 \+\frac{1}{2\eta_k}|x-x{k}|^2 $$
where
$x^{k+1}=\arg\min_{x\in\mathbf{X}}L_{\beta}^{k}(x,y^{\color{aqua}{k}},\lambda^{\color{aqua}{k}});$ $y^{k+1}=\arg\min_{y\in\mathbf{Y}} L_{\beta}^{k}(x^{\color{red}{k+1}}, y, \lambda^{\color{aqua}{k}});$ $\lambda^{k+1} = \lambda^{k} - \beta (Ax^{\color{red}{k+1}} + By^{\color{red}{k+1}}-b).$
- Stochastic ADMM
- Accelerated Variance Reduced Stochastic ADMM
- Towards optimal stochastic ADMM or the talk in ICML
- V-Cycle or Double Sweep ADMM
- https://arxiv.org/abs/1903.01786
Let us assume that the function
We can reformulate the problem to get $$ \min_{x} F(x) + g(y), \ s.t.\quad x=y $$
where
We can dene an augmented Lagrangian $$ L_{\beta}(x, y, \lambda) = F(x)+g(y) - \lambda^{T}(x-y) + \frac{\beta}{2}{|x-y|}{2}^{2} \ = \sum{i=1}^{n} [f_i(x_i) - {\lambda}i(x_i -y_i)+\frac{\beta}{2}(x_i - y_i)^2] + g(y)\ = \sum{i=1}^{n}L_{(\beta,i)}(x_i, y, \lambda_i). $$
We can split the optimization over
$x_i^{k+1} =\arg\min_{x_i} L_{(\beta,i)}(x_i, y^{k}, \lambda_i^{k})\quad i=1,2,\cdots, n;$ $y^{k+1} =\arg\min_{x_i} L_{(\beta,i)}(x_i^{\color{green}{k+1}}, y, \lambda_i^k);$ $\lambda^{k+1}=\lambda^k+\lambda (x^{\color{green}{k+1}} - y^{\color{green}{k+1}}).$
Then optimization over
- http://users.isr.ist.utl.pt/~jmota/DADMM/
- http://repository.ust.hk/ir/Record/1783.1-66353
- https://ieeexplore.ieee.org/document/7039496
- DISTRIBUTED ASYNCHRONOUS OPTIMIZATION WITH THE ALTERNATING DIRECTION METHOD OF MULTIPLIERS
- Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers by S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein
- Asynchronous Distributed ADMM for Consensus Optimization
- Notes on Distributed ADMM
Monotone operator splitting methods, which originated in the late 1970’s in the context of partial differential equations, have started to be highly effective for modeling and solving a wide range of data analysis and processing problems, in particular high-dimensional statistical data analysis.
Operator splitting is to decompose one omplicated operator(procedure) into some simple operators (procedures). For example, ADMM splits the maxmin operator of the augmented Lagrangian into 3 opertors: $$ \arg\min_{x,y}\max_{\lambda} L_{\beta}(x,y\mid \lambda) $$ to $$ \arg\min_{x}L_{\beta}(x,y\mid \lambda) \circ \ ,\arg\min_{y}L_{\beta}(x,y\mid \lambda) \circ \arg\max_{\lambda} L_{\beta}(x,y,\mid \lambda). $$
They are really some block relaxation techniques.
- https://web.stanford.edu/class/ee364b/lectures/monotone_split_slides.pdf
- Operator Splitting by Professor Udell @ORIE 6326: Convex Optimization
- A note on the equivalence of operator splitting methods
- Splitting methods for monotone operators with applications to parallel optimization
- Operator Splitting Methods for Fast MPC
- https://staffportal.curtin.edu.au/staff/profile/view/Jie.Sun/
- Operator Splitting Methods in Data Analysis
- https://www.samsi.info/programs-and-activities/research-workshops/operator-splitting-methods-data-analysis/
- http://idda.cuhk.edu.cn/zh-hans/page/10297
- Random monotone operators and application to Stochastic Optimization
- ORQUESTRA - Distributed Optimization and Control of Large Scale Water Delivery Systems
- Ray is a fast and simple framework for building and running distributed applications.
- CoCoA: A General Framework for Communication-Efficient Distributed Optimization
- Federated Optimization: Distributed Machine Learning for On-Device Intelligence
As shown in Principle of Optimal Design
, non-gradient methods
are classified into 3 classes:
Derivative-free optimization (DFO) algorithms differ in the way they use the sampled function values to determine the new iterate. One class of methods constructs a linear or quadratic model of the objective function and defines the next iterate by seeking to minimize this model inside a trust region. We pay particular attention to these model-based approaches because they are related to the unconstrained minimization methods described in earlier chapters. Other widely used DFO methods include the simplex-reflection method of Nelder and Mead, pattern-search methods, conjugate-direction methods, and simulated annealing.
Heuristic methods
will be introduced in computational intelligence as well as Bayesian Optimization
.
Let us start with the example and suppose that we want to $$ \arg\min_{x}f(x)=x^{2}+4\sin(2x).\tag{1} $$
The objective function
Another insightful example is to minimize the following cost function:
$$
x^{2}+4\sin(2x) - 1001 \color{red}{\mathbb{I}{\sqrt{2}}}(x) \tag{2}
$$
where the last part $\color{red}{\mathbb{I}{\sqrt{2}}}(x)$ is equal to 1 when
Optimization and Assumptions @ Freemind|Test functions for optimization
Another related method is graduated optimization
, which is a global optimization technique that attempts to solve a difficult optimization problem by initially solving a greatly simplified problem, and progressively transforming that problem (while optimizing) until it is equivalent to the difficult optimization problem.Further, when certain conditions exist, it can be shown to find an optimal solution to the final problem in the sequence. These conditions are:
- The first optimization problem in the sequence can be solved given the initial starting point.
- The locally convex region around the global optimum of each problem in the sequence includes the point that corresponds to the global optimum of the previous problem in the sequence.
Graduated Optimization |
---|
- https://www.wikiwand.com/en/Numerical_continuation
- Multi-Resolution Methods and Graduated Non-Convexity
In stochastic gradient descent, the estimated gradient is a partial sum of the population gradient so that it is necessary to compute the gradient of sample
function.
Kiefer-Wolfowitz Algorithm
is the gradient-free version of stochastic gradient descent.
It is a recursive scheme to approximate the minimum or maximum of the form
$$
x^{k+1} =x^{k} - a_n \Delta(x^k)
$$
During the n-th stage, observations
- http://pawel.sawicz.eu/tag/kiefer-wolfowitz/
- A compansion to Kiefer-Wolfowit algorithm
- Archive for Kiefer-Wolfowitz algorithm
- Strong Convergence of a Stochastic Approximation Algorithm
- Almost Sure Approximations to the Robbins-Monro and Kiefer-Wolfowitz Processes with Dependent Noise
- A Kiefer–Wolfowitz Algorithm with Randomized Differences
- Stochastic Approximation by Tze Leung Lai
Multi-Level Optimization is to optimize a related cheap function
Multi-level optimization methods repeat three steps:
- Cheap minimization of modified
$\hat{f}$ :$$y^{k}=\arg\min_{x\in \mathbb{R}^p} + \left<v_k, x\right>.$$ - Use
$y^{k}$ to give descent direction,$$x^{k+1} = x^k -a_k(x^k - y^k) .$$ - Set
$v_k$ to satisfy first-order coherence$$v_{k+1}=\frac{L_f}{L_F} F^{\prime}(x^{k+1}) - f^{\prime}(x^{k+1}).$$
Panos Parpas Panos Parpas EMERITUS PROFESSORBERCRUSTEM
- Multilevel Optimization Methods: Convergence and Problem Structure
- A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization
- Multiresolution Algorithms for Faster Optimization in Machine Learning
The simplified scheme of work for the multilevel optimization
procedure can be represented as follows.
- Solving the optimization problem using a surrogate model. For this purpose, the method of indirect optimization based on the self-organization (IOSO) is used. This method allows finding the single solution for single-objective optimization or the Pareto-optimal set of solutions for multi-objective problems.
- For the obtained solution the indicators of efficiency are updated using the high-fidelity analysis tools.
- The adjustment of a current search region is performed.
- The adjustment of the surrogate model is performed. Depending upon the particular features of the applied mathematical simulation, the adjustment procedure can performed using several approaches. One such approach involves constructing non-linear corrective dependencies. This includes evaluation of the results obtained with different fidelity analysis tools. The other possible approach is application of nonlinear estimation of surrogate model internal parameters.
- Replacement of the surrogate model by the adjusted one and the return to step (1).
- Multilevel Optimization: Convergence Theory, Algorithms and Application to Derivative-Free Optimization
- multilevel optimization iosotech
- OptCom: A Multi-Level Optimization Framework for the Metabolic Modeling and Analysis of Microbial Communities
Roberto Battiti and Mauro Brunato explains that how reactive search optimization, reactive affine shaker(RAS), memetic algorithm
works for optimization problem without gradient in Machine Learning plus Intelligent Optimization.
- Curriculum-based course timetabling solver; uses Tabu Search or Simulated Annealing
- Machine Learning plus Intelligent Optimization by Roberto Battiti and Mauro Brunato
- Reactive Search and Intelligent Optimization
- Zeroth-Order Method for Distributed Optimization With Approximate Projections
- Derivative-Free Optimization (DFO)
- Derivative Free Optimization / Optimisation sans Gradient
- Delaunay-based derivative-free optimization via global surrogates, part I: linear constraints
- Delaunay-based derivative-free optimization via global surrogates, part II: convex constraints
- https://projects.coin-or.org/Dfo
- https://nlopt.readthedocs.io/en/latest/NLopt_Algorithms/
- http://adl.stanford.edu/aa222/Lecture_Notes_files/chapter6_gradfree.pdf
- https://code.fb.com/ai-research/nevergrad/
- https://github.com/facebookresearch/nevergrad
- https://www.kthohr.com/optimlib.html
- https://www.infoq.com/presentations/black-box-optimization
And there are more topics on optimization such as this site. And more courses on optimization:
- 凸优化 (2018年秋季) by 文在文
- 2010- Shanghai Jiao Tong University 张小群
- Optimisation: Master's degree in Modelling for Science and Engineering
- A Course on First-Order, Operator Splitting, and Coordinate Update Methods for Optimization by Yinwo Tao
- Math 273C: Numerical Optimizatoin by Yinwo Tao
- Math 273, Section 1, Fall 2009: Optimization, Calculus of Variations, and Control Theory
- ECE236B - Convex Optimization (Winter Quarter 2018-19) by Prof. L. Vandenberghe, UCLA
- EECS 127 / 227AT: Optimization Models and Applications — Fall 2018 Instructors: A. Bayen, L. El Ghaoui.
- IEOR 262B: Mathematical Programming II by Professor Javad Lavaei, UC Berkeley
- https://web.stanford.edu/~boyd/teaching.html
- EE364b - Convex Optimization II
- Convex Optimization: Fall 2018 Machine Learning 10-725
- Algorithms for Convex Optimization
- Optimization by Vector Space Methods
- Algorithms for Convex Optimization
- EE 227C (Spring 2018), Convex Optimization and Approximation
- https://ocw.mit.edu/courses/sloan-school-of-management/15-093j-optimization-methods-fall-2009/
- http://www.optimization-online.org/
- http://convexoptimization.com/
- More Optimization Online Links
- TUTORIALS AND BOOKS at http://plato.asu.edu/sub/tutorials.html.
- Provable Nonconvex Methods/Algorithms
- https://optimization.mccormick.northwestern.edu/index.php/Main_Page
- Non-convex Optimization for Machine Learning
- GLOBAL OPTIMALITY CONDITIONS FOR DEEP NEURAL NETWORKS
- http://www.vision.jhu.edu/assets/HaeffeleCVPR17.pdf
- https://www.springer.com/us/book/9783319314822
- https://core.ac.uk/display/83849901
- https://core.ac.uk/display/73408878
- https://zhuanlan.zhihu.com/p/51514687
- http://math.cmu.edu/~hschaeff/research.html
- https://people.eecs.berkeley.edu/~elghaoui/Teaching/EECS127/index.html
- https://blogs.princeton.edu/imabandit/
- http://www.probabilistic-numerics.org/research/index.html
- https://people.eecs.berkeley.edu/~brecht/eecs227c.html
- https://neos-guide.org/content/optimization-under-uncertainty
- Optimization and Gradient Descent on Riemannian Manifolds
- Approximation Theory & Convex Optimization
- https://people.maths.ox.ac.uk/riseth/
- https://www.aritradutta.com/
- http://sites.utexas.edu/mokhtari/
- https://www.di.ens.fr/~aspremon/