﻿ Linear Coupling of Gradient Descent and Mirror Descent

In the summer of 2014, Lorenzo Orecchia and I created a new framework for designing first-order methods with pleasance and ease [1]. At a high level, we categorized the zoo of first-order methods as performing either gradient descent or mirror descent, and discovered a linear-coupling technique that allows one to combine the two methods to yield faster algorithms than either of the descent methods alone. This page is my attempt to keep track of all of relevant papers in this linear of my research.

# Smooth and Convex Problems

Consider the problem of minimizing a convex and smooth function $$f(x)$$. Gradient descent converges at a rate $$1/\varepsilon$$ but Nesterov's celebrated accelerated gradient descent method gives a $$1/\sqrt{\varepsilon}$$ optimal convergene rate.

• In [1], we use linearly coupling and recovers Nesterov's method with the same convergence rate. This paper also includes relevant definitions and examples for gradient descent and mirror descent. This is the simplest application of linear coupling, and many readers have found [1] a very pleasant read to understand Nesterov's method or to understand linear coupling.

In the more challenging coordinate-gradient setting, only a coordinate is given to the algorithm per iteration instead of a full gradient. Nesterov's accelerated coordinate gradient method is only suboptimal in this setting. In particular, if the coodinates are $$L_1,L_2,\dots,L_n$$ smooth respectively, the best known convergence result was due to Lee-Sidford in 2013 that depends on $$O(\sqrt{n (L_1+\cdots L_n)})$$.

• In our paper [7], we use linear coupling to improve this complexity to $$O(\sqrt{L_1}+\cdots \sqrt{L_n})$$, a much more natural complexity formula that can be up to $$\sqrt{n}$$ times faster than Lee-Sidford. Our performance gain also requires a noval choice of the probability distribution which was not observed before.
• Noticing that the complexity of [7] is rotation-variant, one can apply a preprocessing rotation in order to improve its running time. In our paper [8], we provide such an example with machine learning applications.

In the most challenging stochastic-gradient setting, only a random vector is given to the algorithm whose expectation equals to a full gradient. Such a random vector usually comes from the finite average structure of the objective. It was previously conjectured by some experts that "acceleration is impossible due to the noice in the stochastic gradient descent".

• In [10], we use linear coupling and obtain Katyusha, the first accelerated stochastic gradient descent method.

# Smooth and Nonconvex Problems

### Faster Stochastic Gradient Descent for Non-Convex Optimization

Consider the fundamental problem of minimizing a finite average of $$n$$ smooth but non-convex functions. In order to obtain an $$\varepsilon$$-approximate stationary point, the best known first-order methods were previously full gradient descent (that requires $$1/\varepsilon$$ iterations), and stochastic gradient descent that converges slowly in a $$1/\varepsilon^2$$ rate.

• In [9], we use linear coupling and provide a $$1/\varepsilon$$ convergence rate for variance-reduced stochastic gradient descent. Our algorithm is provably at least $$\Omega(n^{1/3})$$ times faster than full gradient descent, for the first time in the history.

# Non-Smooth and Convex Problems

### Positive Linear Programming

Positive LP consists of a pair of LPs: the primal one $$\max\{1^T x \,:\, x\geq 0, Ax\leq 1\}$$ is known as the packing LP, and the dual one $$\min\{1^T x \,:\, x\geq 0, Ax\geq 1\}$$ is known as covering LP, where $A$ is a nonnegative matrix. Positive LP is an extremely important class that contains zero-sum games, maximum weighted bipartite matching, and many others.

A central question across many subfields of computer science is how to solve positive LP in nearly-linear time $$\tilde{O}(N / \varepsilon^c)$$, where $$N$$ is the number of non-zero entries that describe the LP. A nearly-linear time algorithm is also parallel if it can be further divided into $$\tilde{O}(1/\varepsilon^c)$$ iterations, each requiring only matrix-vector multiplications so can be implemented to run in $$O(N)$$ time. Otherwise, we say that the algorithm is sequential.

• In [2], we use linear coupling to obtain a parallel positive LP solver with $$\tilde{O}(1/\varepsilon^3)$$ convergence rate, the first improvement over Luby-Nisan which was $$\tilde{O}(1 / \varepsilon^4)$$ back to 1993. This result was later improved to $$\tilde{O}(1/\varepsilon^2)$$ by [3] using similar techniques on top of linear coupling.
• in [4], we use linear coupling to obtain a sequential positive LP solver with $$\tilde{O}(N/\varepsilon)$$ total running time, the first improvement over Bartal-Byers-Raz which was $$\tilde{O}(N/\varepsilon^2)$$ back to 1997. Due to some technical issues, our solver yields only a suboptimal $$\tilde{O}(N/\varepsilon^{1.5})$$ running time for the dual covering LP, and this was later fixed to $$\tilde{O}(1/\varepsilon)$$ by [5].

### Positive Semidefinite Programming

Positive SDP is the natural SDP extention of positive LP, and includes many important relaxation problems such as MaxCut, BalancedSeparator, and SparsestCut. The central question in solving positive SDP is to construct a parallel solver that converges in $$\tilde{O}(1/\varepsilon^c)$$ iterations, where each iteration consists of only cheap, parallelizable algebraic computations.

• In [6], we use linear coupling (plus some clever new matrix inequality) to obtain an algorithm with only $$\tilde{O}(1/\varepsilon^2)$$ iterations, a first and significant improvement over Jain-Yao which requires $$\tilde{O}(1/\varepsilon^{13})$$ iterations in 2011.

# References

 "Katyusha: The First Direct Acceleration of Stochastic Gradient Methods". By: Zeyuan Allen-Zhu. JMLR 2017: Journal of Machine Learning Research (to appear).STOC 2017: Symposium on Theory of Computing (abstract). We introduce Katyusha, the first direct, primal-only stochastic gradient method that has a provably accelerated convergence rate in convex optimization. In contrast, previous methods are based on dual coordinate descent which are more restrictive, or based on outer-inner loops which make them "blind" to the underlying stochastic nature of the optimization process. Katyusha is the first algorithm that incorporates acceleration directly into stochastic gradient updates. Unlike previous results, Katyusha obtains an optimal convergence rate. It also supports proximal updates, non-Euclidean norm smoothness, non-uniform sampling, and mini-batch sampling. When applied to interesting classes of convex objectives, including smooth objectives (e.g., Lasso, Logistic Regression), strongly-convex objectives (e.g., SVM), and non-smooth objectives (e.g., L1SVM), Katyusha improves the best known convergence rates. The main ingredient behind our result is Katyusha momentum, a novel "negative momentum on top of momentum" that can be incorporated into a variance-reduction based algorithm and speed it up. As a result, since variance reduction has been successfully applied to a fast growing list of practical problems, our paper suggests that in each of such cases, one had better hurry up and give Katyusha a hug. "Variance Reduction for Faster Non-Convex Optimization". By (α-β order): Zeyuan Allen-Zhu, Elad Hazan. ICML 2016: International Conference on Machine Learning. We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $$O(1/\varepsilon)$$ iterations for smooth objectives, and stochastic gradient descent that converges in $$O(1/\varepsilon^2)$$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $$O(1/\varepsilon)$$ rate, and is faster than full gradient descent by $$\Omega(n^{1/3})$$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets. "Exploiting the Structure: Stochastic Gradient Methods Using Raw Clusters". By (contribution): Zeyuan Allen-Zhu*, Yang Yuan*, Karthik Sridharan. NIPS 2016: Neural Information Processing Systems (spotlight). The amount of data available in the world is growing faster and bigger than our ability to deal with it. However, if we take advantage of the internal structure, data may become much smaller for machine learning purposes. In this paper we focus on one of the most fundamental machine learning tasks, empirical risk minimization (ERM), and provide faster algorithms with the help from the clustering structure of the data. We introduce a simple notion of raw clustering that can be efficiently obtained with just one pass of the data, and propose two algorithms. Our variance-reduction based algorithm ClusterSVRG introduces a new gradient estimator using the clustering information, and our accelerated algorithm ClusterACDM is built on a novel Haar transformation applied to the dual space of each cluster. Our algorithms outperform their classical counterparts both in theory and practice. ﻿"Even Faster Accelerated Coordinate Descent Using Non-Uniform Sampling". By (α-β order): Zeyuan Allen-Zhu, Peter Richtárik, Zheng Qu, Yang Yuan. ICML 2016: International Conference on Machine Learning. Accelerated coordinate descent is widely used in optimization due to its cheap per-iteration cost and scalability to large-scale problems. Up to a primal-dual transformation, it is also the same as accelerated stochastic gradient descent that is one of the central methods used in machine learning. In this paper, we improve the best known running time of accelerated coordinate descent by a factor up to $$\sqrt{n}$$. Our improvement is based on a clean, novel non-uniform sampling that selects each coordinate with a probability proportional to the square root of its smoothness parameter. Our proof technique also deviates from the classical estimation sequence technique used in prior work. Our speed-up applies to important problems such as empirical risk minimization and solving linear systems, both in theory and in practice. "Using Optimization to Obtain a Width-Independent, Parallel, Simpler, and Faster Positive SDP Solver". By (α-β order): Zeyuan Allen-Zhu, Yin Tat Lee, and Lorenzo Orecchia. SODA 2016: Symposium on Discrete Algorithms. We study the design of polylogarithmic depth algorithms for approximately solving packing and covering semidefinite programs (or positive SDPs for short). This is a natural SDP generalization of the well-studied positive LP problem. Although positive LPs can be solved in polylogarithmic depth while using only $$\log^{2} n/\varepsilon^3$$ parallelizable iterations [3], the best known positive SDP solvers due to Jain and Yao [18] require $$\log^{14} n /\varepsilon^{13}$$ parallelizable iterations. Several alternative solvers have been proposed to reduce the exponents in the number of iterations [19, 29]. However, the correctness of the convergence analyses in these works has been called into question [29], as they both rely on algebraic monotonicity properties that do not generalize to matrix algebra. In this paper, we propose a very simple algorithm based on the optimization framework proposed in [3] for LP solvers. Our algorithm only needs $$\log^2 n / \varepsilon^3$$ iterations, matching that of the best LP solver. To surmount the obstacles encountered by previous approaches, our analysis requires a new matrix inequality that extends Lieb-Thirring's inequality, and a sign-consistent, randomized variant of the gradient truncation technique proposed in [2, 3]. "Unified Acceleration Method for Packing and Covering Problems via Diameter Reduction". By: Di Wang, Satish Rao, Michael W. Mahoney. August 2015. "Nearly-Linear Time Positive LP Solver with Faster Convergence Rate". By (α-β order): Zeyuan Allen-Zhu, Lorenzo Orecchia. STOC 2015: Symposium on Theory of Computing. Packing and covering linear programs (LP) are an important class of problems that bridges computer science, operation research, and optimization. Efficient algorithms for solving such LPs have received significant attention in the past 20 years [LN93, PST95, BBR97, You01, Nem04, BI04, BBR04, Nes05, AK08, AHK12, KY13, You14, AO15]. Unfortunately, all known nearly-linear time algorithms for producing $$(1+\varepsilon)$$-approximate solutions to packing and covering LPs have a running time dependence that is at least proportional to $$\varepsilon^{-2}$$. This is also known as an $$O(1/\sqrt{T})$$ convergence rate and is particularly poor in many applications. In this paper, we leverage insights from optimization theory to break this longstanding barrier. Our algorithms solve the packing LP in time $$\tilde{O}(N \varepsilon^{-1})$$ and the covering LP in time $$\tilde{O}(N \varepsilon^{-1.5})$$. At high level, they can be described as linear couplings of several first-order descent steps. This is the first application of our linear coupling technique (see [AO14]) to problems that are not amenable to blackbox applications known iterative algorithms in convex optimization. Our work also introduces a sequence of new techniques, including the stochastic and the non-symmetric execution of gradient truncation operations, which may be of independent interest. "Faster Parallel Solver for Positive Linear Programs via Dynamically-Bucketed Selective Coordinate Descent". By: Di Wang, Michael Mahoney, Nishanth Mohan, Satish Rao. November 2015. "Using Optimization to Solve Positive LPs Faster in Parallel". By (α-β order): Zeyuan Allen-Zhu, Lorenzo Orecchia. SODA 2015: Symposium on Discrete Algorithms. We study the design of nearly-linear-time algorithms for approximately solving positive linear programs. Both the parallel and the sequential deterministic versions of these algorithms require $$\tilde{O}(\varepsilon^{-4})$$ iterations, a dependence that has not been improved since the introduction of these methods in 1993 by Luby and Nisan. Moreover, previous algorithms and their analyses rely on update steps and convergence arguments that are combinatorial in nature, and do not seem to arise naturally from an optimization viewpoint. In this paper, we leverage insights from optimization theory to construct a novel algorithm that breaks the longstanding $$\tilde{O}(\varepsilon^{-4})$$ barrier. Our algorithm has a simple analysis and a clear motivation. Our work introduces a number of novel techniques, such as the combined application of gradient descent and mirror descent, and a truncated, smoothed version of the standard multiplicative weight update, which may be of independent interest. "Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent". By (α-β order): Zeyuan Allen-Zhu, Lorenzo Orecchia. ITCS 2017: Innovations in Theoretical Computer Science.INFORMS Annual Meeting 2014. First-order methods play a central role in large-scale machine learning. Even though many variations exist, each suited to a particular problem, almost all such methods fundamentally rely on two types of algorithmic steps: gradient descent, which yields primal progress, and mirror descent, which yields dual progress. We observe that the performances of gradient and mirror descent are complementary, so that faster algorithms can be designed by linearly coupling the two. We show how to reconstruct Nesterov's accelerated gradient methods using linear coupling, which gives a cleaner interpretation than Nesterov's original proofs. We also discuss the power of linear coupling by extending it to many other settings that Nesterov's methods cannot apply to.