Last Update: August 16, 2020.
Note: conference talks are not included here, and they can be found on the publication page.
Despite the success of deep learning, from a theoretical standpoint, it remains absurdly unclear why deep learning is better than shallow learning. The main difficulty comes from that we do not actually know how the network features at intermediate layers are learned hierarchically. Why does the first layer learn edges, the second layer learn textures, the third layer learn patterns? Or, do they? (Spoiler alert, not really.) We present a theory towards explaining how hierarchical feature learning is done, by breaking the learning process into three principles: forward feature learning, backward feature correction, and feature purification. They together show how network features evolve during the training process, and elegantly match what happen in real world deep learning tasks. On the technical side, we give (1) the first theory showing that deep learning can be time efficient on certain learning tasks, where NO KNOWN shallow learning algorithms are efficient; and (2) the first theory showing how adversarial training of a neural network can lead to provably robust models, when lowcomplexity models such as linear models MUST be vulnerable to adversarial attacks even with infinite training data. 
MSR AI Seminar  Aug/11/2020  
We will present an accessible overview of recent advances on theory of deep learning. For example, we will try to address the following questions:

MSR AI Seminar Baidu Research 
Jul/16/2019 Sep/21/2019 

Packing and covering linear programs (PCLPs) form an important class of LPs across computer science, operations research, and optimization. In 1993, Luby and Nisan constructed an iterative algorithm for approximately solving PCLPs in nearly linear time, where the time complexity scales nearly linearly in N, the number of nonzero entries of the matrix, and polynomially in e, the (multiplicative) approximation error. Unfortunately, existing nearly lineartime algorithms [4, 10, 23, 31, 35, 36] for solving PCLP s require time at least proportional to 1/e^2. In this paper, we break this longstanding barrier by designing a solver that runs in time O(N/e). Joint work with Lorenzo Orecchia 
MIT  Mar/02/2018  
The diverse world of deep learning research has given rise to thousands of architectures for neural networks. However, to this date, the underlying training algorithms for neural networks are still stochastic gradient descent (SGD) and its heuristic variants. In this talk, we present a new stochastic algorithm to train any smooth neural network to \(\varepsilon\)approximate local minima, using \(O(\varepsilon^{3.25})\) backpropagations. The best provable result was \(O(\varepsilon^{4})\) by SGD before this work. More broadly, it finds \(\varepsilon\)approximate local minima of any smooth nonconvex function using only \(O(\varepsilon^{3.25})\) stochastic gradient computations 
University of Washington Princeton University NY Academy of Science 
Feb/13/2018 Mar/08/2018 Mar/09/2018 

The experimental design problem concerns the selection of \(k\) points from a potentially very large design pool of \(p\)dimensional vectors, so as to maximize the statistical efficiency regressed on the selected \(k\) design points. We achieve \(1+\varepsilon\) approximations to all popular optimality criteria for this problem, including Aoptimality, Doptimality, Toptimality, Eoptimality, Voptimality and Goptimality, with only \(k = O(p/\varepsilon^2)\) design points. In contrast, to the best of our knowledge, previously (1) no polynomialtime algorithm exists for achieving any \(1+\varepsilon\) approximations for D/T/E/Goptimality, and (2) the algorithm achieving \(1+\varepsilon\) approximation for A/Voptimality require \(k \geq \Omega(p^2/\varepsilon)\). Joint work with Yuanzhi Li, Aarti Singh, and Yining Wang. 
UC Berkeley (BLISS) University of Washington 
Oct/09/2017 Nov/07/2017  
We design a stochastic algorithm to train any smooth neural network to \(\varepsilon\)approximate local minima, using \(O(\varepsilon^{3.25})\) backpropagations. The best result was essentially \(O(\varepsilon^{4})\) by SGD. More broadly, it finds \(\varepsilon\)approximate local minima of any smooth nonconvex function in rate \(O(\varepsilon^{3.25})\), with only oracle access to stochastic gradients and Hessianvector products. 
UC Berkeley (Simons)  Oct/06/2017  
In this tutorial, we will provide an accessible and extensive overview on recent advances to optimization methods based on stochastic gradient descent (SGD), for both convex and nonconvex tasks. In particular, this tutorial shall try to answer the following questions with theoretical support. How can we properly use momentum to speed up SGD? What is the maximum parallel speedup can we achieve for SGD? When should we use dual or primaldual approach to replace SGD? What is the difference between coordinate descent (e.g. SDCA) and SGD? How is variance reduction affecting the performance of SGD? Why does the secondorder information help us improve the convergence of SGD? 
ICML 2017 Tutorial  Aug/6/2017  
Nesterov's momentum is famously known for being able to accelerate "fullgradient descent", and thus useful in building fast firstorder algorithms. However, in the offline stochastic setting, counter examples exist and prevent Nesterov's momentum from providing similar acceleration, even if the underlying problem is convex. In this talk, I shall discuss a Katyusha momentum framework that provides the first direct acceleration to stochastic gradient descent. This work is going to appear in STOC 2017. 
New York Theory Day  Apr/28/2017  
In STOC 2015, we developed a regret minimization framework for matrix games, and applied it to give faster algorithms for "finding graph linearsized spectral sparsifiers." In this talk, I will review and upgrade this framework to obtain nearly optimal algorithms for (1) online eigenvector and for (2) experiment design, two famous problems in learning theory and statistics respectively. 
University of Washington  Mar/10/2017  
Developing novel optimization frameworks can often enrich the dimension of algorithm design. In this talk, I will present a "linearcoupling" framework, and apply it to obtain the fastest firstorder algorithms for (1) packing linear programming, (2) stochastic convex optimization, and (3) stochastic nonconvex optimization. 
University of Washington  Mar/09/2017  
Many machine learning tasks can be solved via generic optimization methods such as gradient descent and stochastic gradient descent. In this talk, I will argue that this path of algorithm design may be restrictive and less efficient, I will show instead how to develop novel optimization tools to enrich the dimension of algorithm design. In particular, I will present a "linearcoupling" framework, and apply it to obtain faster firstorder algorithms for stochastic convex optimization (e.g., SVM, Lasso regression) and for stochastic nonconvex optimization (e.g., deep learning). 
Brown University  Mar/06/2017  
Many computational tasks can be solved via classical optimization methods such as gradient descent, SGD, LP, or SDP. In this talk, I will argue that this path of algorithm design may be restrictive and less efficient. I will show instead how to develop novel optimization tools to enrich the dimension of algorithm design. In particular, I will present a "linearcoupling" framework, and apply it to obtain the fastest firstorder algorithms for (1) packing LP, (2) stochastic convex optimization, and (3) stochastic nonconvex optimization. 
Microsoft Research Redmond Columbia University 
Mar/14/2017 Feb/28/2017  
In this talk I will show how to derive the fastest coordinate descent method [1] and the fastest stochastic gradient descent method [2], both from the linearcoupling framework [3]. I will relate them to linear system solving, conjugate gradient method, the Chebyshev approximation theory, and raise several open questions at the end. No prior knowledge is required on firstorder methods.
[1] Even Faster Accelerated Coordinate Descent Using NonUniform Sampling. 
Institute for Advanced Study  Nov/22/2016  
In this talk, I will discuss how to use optimization to obtain sublinear algorithms, as well as an interpolation between sublineartime and lineartime algorithms using optimization insights and acceleration techniques.

MSRNE / MIT  Oct/14/2016  

Institute for Advanced Study  Sep/26/2016  
Some computer science literature creates new algorithms by reducing the problem they solve to existing optimization methods (such as LP solvers, gradient descent, or followtheregularizedleader). In this talk, I will argue that this path of algorithm design is somewhat restrictive and less likely to yield the best possible algorithm. I will show that, on a conceptual level, how to design CS algorithms using optimization insights as a "guiding light"  rather than applying optimization solvers as blackboxs. With more liberty introduced in this way, one can better harvest the hidden potential of optimizationbased algorithms. More concretely, in two separate examples, I am going to demonstrate how to develop new optimization techniques, in order to (1) solve positive linear programs faster, beating the stateoftheart that has stood for ~20 years, and (2) understand how the Brain compresses data, contributing to the field of computational neuroscience. 
Boston University Simons A&G 
Apr/2/2015 Mar/11/2016  
Regret minimization over the PSD cone is a key component in many fast spectral algorithms, and yields nearlylinear time algorithms for many versions of graph partitioning problems. All applications of this idea rely on the matrix version of the classical multiplicative weight update method (MWU). In this talk, I explain how Matrix MWU naturally arises as an instance of the FollowtheRegularizedLeader method with the entropy regularizer. I will then generalize this approach to a larger class of regularizers. As an application, I will provide an alternative construction of the linearsized spectral sparsifiers of Batson, Spielman and Srivastava, and show how to construct these sparsifiers in almost quadratic time. 
UC Berkeley (Simons)
Princeton University 
Nov/5/2014 Apr/29/2016  
Firstorder methods play a central role in largescale convex optimization. Even though many variations exist, almost all such methods fundamentally rely on two types of algorithmic steps: gradientdescent step and mirrordescent step. In this paper, we observe that the performances of these two types of step are complementary, so that faster algorithms can be designed by linearly coupling the two steps. In particular, we show how to obtain a conceptually simple interpretation of Nesterov's accelerated gradient method, a cornerstone algorithm in convex optimization, as a simple linear coupling of the convergence analyses of the two underlying steps. We believe that the complementary view and the linear coupling proposed in this paper will prove very useful in the design of firstorder methods as it allows us to design fast algorithms in a conceptually easier way. For instance, our technique greatly facilitates the recent breakthroughs in solving packing and covering linear programs [AO14a, AO14b]. 
MSRNE INFORMS Annual Meeting 
Sep/30/2014 Nov/09/2014 

Fast iterative methods from Convex Optimization play a crucial role in a number of recent breakthroughs in the design of nearlylineartime algorithms for fundamental graph problems, such as maximum flow and graph partitioning. Multiplicative Weight Updates, Euclidean and NonEuclidean Gradient Descent, and Nesterov's Method have become a mainstay in the construction and analysis of fast algorithms. However, until recently, the relation between these different methods and the reason for their success have been somewhat murky. What is the exact relation between Multiplicative Weight Updates and Gradient Descent? Why do Multiplicative Weight Updates show up in so many settings? What is the intuition behind Nesterov's iterative algorithm that achieves the asymptotically optimal iteration count for smooth convex functions (hint: it isn't just an algebraic trick)? The answer to these questions was not clear. In this survey, we will provide answers by presenting a unified framework that reveals the power and limitations of each method and provides much needed geometric intuition. Among other insights, we will explain how Multiplicative Weight Updates are a particular instance of a dual iterative method, known as Mirror Descent, and how Nesterov's algorithm can be naturally derived as a combination of Mirror and Gradient Descent. This talk is based on joint work and also colectured with Lorenzo Orecchia. 
MSRNE  Apr/11/2014  
Fast algorithms for solving linear systems Ax=b when A is symmetric diagonally dominant (SDD) or graph Laplacian have found broad applications across both the theory and practice of computer science, playing a foundational role in scientific computing, machine learning, random processes, image processing, network analysis, and computational biology, among others. More recently, they have emerged as a powerful tool in the design of graph algorithms, enabling researchers to break longstanding algorithmic barriers for a wide range of fundamental graph problems, including finding maximum flows and multicommodity flows, generating random spanning trees, sparsifying graphs, routing in distributed systems, and finding sparse cuts and balanced separators. Finding fast solvers for these systems has been an active area of research, culminating in algorithms that solve them in nearlylinear time. These solvers are quite complex, and developing them required multiple fundamental innovations in spectral and combinatorial graph theory, graph algorithms, and computational linear algebra, which were used to construct and analyze an intricate recursively preconditioned iterative solver. In this talk, I will give a very simple combinatorial algorithm that solves SDD systems in nearlylinear time. It uses very little of the machinery that previously appeared to be necessary for a such an algorithm. It does not require recursive preconditioning, or even the Chebyshev Method or Conjugate Gradient. After constructing a "nice" spanning tree of a graph associated to the linear system, the entire algorithm consists of the repeated application of a simple (nonrecursive) update rule, which it implements using a lightweight data structure. The algorithm is numerically stable and can be implemented without the increased bitprecision required by previous solvers. As such, the algorithm presented has the fastest known running time under the standard unitcost RAM model. This is joint work with Jonathan Kelner, Lorenzo Orecchia and Aaron Sidford. 
MIT ICML 2013 workshop 
May/08/2013 Jun/19/2013 

Motivated by applications of largescale graph clustering, we study randomwalkbased local algorithms whose running times depend only on the size of the output cluster, rather than the entire graph. All previously known such algorithms guarantee an output conductance of \(\tilde{O}(\sqrt{\phi(A)})\) when the target set A has conductance \(\phi(A)\in[0,1]\). In this paper, we improve it to \[\tilde{O}\bigg( \min\Big\{\sqrt{\phi(A)}, \frac{\phi(A)}{\sqrt{\mathsf{Conn}(A)}} \Big\} \bigg)\enspace, \] where the internal connectivity parameter \(\mathsf{Conn}(A)\in[0,1]\) is defined as the reciprocal of the mixing time of the random walk over the induced subgraph on A. For instance, using \(\mathsf{Conn}(A)=\Omega(\lambda(A)/\log n)\) where \(\lambda\) is the second eigenvalue of the Laplacian of the induced subgraph on A, our conductance guarantee can be as good as \(\tilde{O}(\phi(A)/\sqrt{\lambda(A)})\). This builds an interesting connection to the recent advance of the socalled improved Cheeger's Inequality [KKL+13], which says that global spectral algorithms can provide a conductance guarantee of \(O(\phi_{\mathsf{opt}}/\sqrt{\lambda_3})\) instead of \(O(\sqrt{\phi_{\mathsf{opt}}})\). In addition, we provide theoretical guarantee on the clustering accuracy (in terms of precision and recall) of the output set. We also prove that our analysis is tight, and perform empirical evaluation to support our theory on both synthetic and real data. It is worth noting that, our analysis outperforms prior work when the cluster is wellconnected. In fact, the better it is wellconnected inside, the more significant improvement (both in terms of conductance and accuracy) we can obtain. Our results shed light on why in practice some randomwalkbased algorithms perform better than its previous theory, and help guide future research about local clustering. This is joint work with Silvio Lattanzi and Vahab Mirrokni. 
MIT Google Research 
May/29/2013 Oct/31/2013 


Google Research  Aug/28/2012  

MSRAsia  Aug/19/2009  

MSRAsia  Apr/30/2009  

MSRAsia  Feb/03/2009 