Our 12 scaling laws (for LLM knowledge capacity) are out: https://t.co/qNTarfEb3l. Took me 4mos to submit 50,000 jobs; took Meta 1mo for legal review; FAIR sponsored 4,200,000 GPU hrs. Hope this is a new direction to study scaling laws + help practitioners make informed decisions pic.twitter.com/gF2O3LivEW
— Zeyuan Allen-Zhu (@ZeyuanAllenZhu) April 9, 2024

Research Interests

My current research focuses on investigating the physics of language models and AI in a broader sense. This involves designing experiments to elucidate the underlying fundamental principles governing how transformers/GPTs learn to accomplish diverse AI tasks. By probing into the neurons of the pre-trained transformers, my goal is to uncover and comprehend the intricate (and sometimes surprising!) physical mechanisms behind large language models. Our first paper in this series can be found on arxiv.

Before that, I work on the mathematics of deep learning. That involves developing rigorous theoretical proofs towards the learnability of neural networks, in ideal and theory-friendly settings, to explain certain mysterious phenomena observed in deep learning. In this area, our paper on ensemble / knowledge distillation received some award from ICLR'23; although I am most proud of our COLT'23 result that provably shows why deep learning is actually deep –– better than shallow learners such as layer-wise training, kernel methods, etc.

In my past life, I have also worked in machine learning, optimization theory, and theoretical computer science.

Conferences

ICML
ICLR
NeurIPS
STOC
FOCS
SODA
ICALP
SoCG
WSDM
ICDM
CIKM
EC
ITCS
WINE
POPL

Journals

Acknowledgements

I would love to thank my wonderful collaborators without whom these results below would never have been accomplished. In inverse chronological order:

Olga Golovneva (1), Jason Weston (1), Sainbayar Sukhbaatar (1), Cathy Li (1), Emily Wenger (1), Francois Charton (1), Kristin Lauter (1), Edward J. Hu (1), Yelong Shen (1), Phillip Wallis (1), Shean Wang (1), Faeze Ebrahimian (1), Yingyu Liang(1), Zhao Song(2), Michael Jordan (1), Chi Jin (1), David Simchi-Levi (1), Xinshang Wang (1), Dan Alistarh (2), Jerry Li (2), Sébastien Bubeck (2), Ankit Garg (1), Wei Hu (1), Avi Wigderson (2), Rafael Oliveira (2), Yining Wang (2), Aarti Singh (2), Naman Agarwal (1), Brian Bullins (1), Tengyu Ma (1), Yuanzhi Li (26), Elad Hazan (4), Karthik Sridharan (1), Yang Yuan (4), Peter Richtárik (1), Zheng Qu (1), Zhenyu Liao (2), Yin Tat Lee (1), Lorenzo Orecchia (8), Aditya Bhaskara (1), Ilya Razenshteyn (1), Rati Gelashvili (2), Nir Shavit (1), Aaron Sidford (1), Silvio Lattanzi (2), Vahab Mirrokni (2), Jonathan Kelner (2), Sasa Misailovic (1), Martin Rinard (1), Alessandro Chiesa (4), Silvio Micali (5), Wei Chen (1), Pinyan Lu (2), Xiaorui Sun (2), Bo Tang (1), Yajun Wang (2), Zheng Chen (5), Tom Minka (1), Haixun Wang (1), Dong Wang (1), Chenguang Zhu (5), Gang Wang (4), Weizhu Chen (6).

Download Counts

Special Topics and Surveys

2024 Physics of Language Models series: stay tuned for a big tutorial, individual topics are covered in the papers listed below.
2020 MSR AI Seminar: Why Does Deep Learning Perform Deep Learning?
ICML 2017 tutorial: Recent Advances in Stochastic Convex and Non-Convex Optimization.
2017 SVD and Eigenvectors for Big Data.
2016 Fast Than SGD 2: Katyusha Acceleration.
2016 Fast Than SGD 1: Variance Reduction.
2016 Coordinate Descent in One Line, or Three if Accelerated.
2016 How to Solve Classification and Regression Fast, Faster, and Fastest.
2015 Linear Coupling of Gradient Descent and Mirror Descent.
2014 Knightian Uncertainty in Auctions.

Working Papers

Remarks

Authors marked with ★ are for equal contribution.
My middle name "Allen" was legally merged into my family name in Feburary 2014, becoming "Allen-Zhu".

Working Papers

"Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws"
By: Zeyuan Allen-Zhu★, Yuanzhi Li★.
Apr 2024 (version 1)

"Reverse Training to Nurse the Reversal Curse"
By: Olga Golovneva, Zeyuan Allen-Zhu, Jason Weston, Sainbayar Sukhbaatar.
Mar 2024

Large language models (LLMs) have a surprising failure: when trained on "A has a feature B", they do not generalize to "B is a feature of A", which is termed the Reversal Curse. Even when training with trillions of tokens this issue still appears due to Zipf's law - hence even if we train on the entire internet. This work proposes an alternative training scheme, called reverse training, whereby all words are used twice, doubling the amount of available tokens. The LLM is trained in both forward and reverse directions by reversing the training strings while preserving (i.e., not reversing) chosen substrings, such as entities. We show that data-matched reverse-trained models provide superior performance to standard models on standard tasks, and compute-matched reverse-trained models provide far superior performance on reversal tasks, helping resolve the reversal curse issue.

"Physics of Language Models: Part 3.2, Knowledge Manipulation"
By: Zeyuan Allen-Zhu★, Yuanzhi Li★.
Sep 2023 (version 1)

"Physics of Language Models: Part 3.1, Knowledge Storage and Extraction"
By: Zeyuan Allen-Zhu★, Yuanzhi Li★.
Sep 2023 (version 1)

"Physics of Language Models: Part 1, Context-Free Grammar"
By: Zeyuan Allen-Zhu★, Yuanzhi Li★.
Oct 2023 (version 2)

We design experiments to study how generative language models, like GPT, learn context-free grammars (CFGs) --- diverse language systems with a tree-like structure capturing many aspects of natural languages, programs, and human logics. CFGs are as hard as pushdown automata, and can be ambiguous so that verifying if a string satisfies the rules requires dynamic programming. We construct synthetic data and demonstrate that even for very challenging CFGs, pre-trained transformers can learn to generate sentences with near-perfect accuracy and remarkable diversity.

More importantly, we delve into the >physical principles behind how transformers learns CFGs. We discover that the hidden states within the transformer implicitly and precisely encode the CFG structure (such as putting tree node information exactly on the subtree boundary), and learn to form ``boundary to boundary'' attentions that resemble dynamic programming. We also cover some extension of CFGs as well as the robustness aspect of transformers against grammar mistakes. Overall, our research provides a comprehensive and empirical understanding of how transformers learn CFGs, and reveals the physical mechanisms utilized by transformers to capture the structure and rules of languages.

Published Papers

Click here to show only (equal) first-author papers.

"SALSA VERDE: a machine learning attack on Learning With Errors with sparse small secrets"
By: Cathy Li, Emily Wenger, Zeyuan Allen-Zhu, Francois Charton, Kristin Lauter.

NeurIPS 2023: Neural Information Processing Systems.

Learning with Errors (LWE) is a hard math problem used in post-quantum cryptography. Homomorphic Encryption (HE) schemes rely on the hardness of the LWE problem for their security, and two LWE-based cryptosystems were recently standardized by NIST for digital signatures and key exchange (KEM). Thus, it is critical to continue assessing the security of LWE and specific parameter choices. For example, HE uses small secrets, and the HE community has considered standardizing small sparse secrets to improve efficiency and functionality. However, prior work, SALSA and PICANTE, showed that ML attacks can recover sparse binary secrets. Building on these, we propose VERDE, an improved ML attack that can recover sparse binary, ternary, and small Gaussian secrets. Using improved preprocessing and secret recovery techniques, VERDE can attack LWE with larger dimensions (n=512) and smaller moduli ($\log_2(q)=12$ for n=256), using less time and power. We propose novel architectures for scaling. Finally, we develop a theory that explains the success of ML LWE attacks.

"Backward Feature Correction: How Deep Learning Performs Deep Learning"
By: Zeyuan Allen-Zhu★, Yuanzhi Li★.

COLT 2023: Conference on Learning Theory.

"Forward Super-Resolution: How Can GANs Learn Hierarchical Generative Models for Real-World Distributions"
By: Zeyuan Allen-Zhu★, Yuanzhi Li★.

ICLR 2023: International Conference on Learning Representations.

"Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning"
By: Zeyuan Allen-Zhu★, Yuanzhi Li★.

ICLR 2023: International Conference on Learning Representations.
ICLR 2023 Outstanding Paper Honorable Mention (top 9 in 4966)

"LoRA: Low-Rank Adaptation of Large Language Models"
By: Edward J. Hu★, Yelong Shen★, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Weizhu Chen.

ICLR 2022: International Conference on Learning Representations.
See huggingface pages on LoRA for large language models and for stable diffusion.

The dominant paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, conventional fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example, deploying many independent instances of fine-tuned models, each with 175B parameters, is extremely expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. For GPT-3, LoRA can reduce the number of trainable parameters by 10,000 times and the computation hardware requirement by 3 times compared to full fine-tuning. LoRA performs on-par or better than fine-tuning in model quality on both GPT-3 and GPT-2, despite having fewer trainable parameters, a higher training throughput, and no additional inference latency. We also provide an empirical investigation into rank-deficiency in language model adaptations, which sheds light on the efficacy of LoRA.

"Feature Purification: How Adversarial Training Performs Robust Deep Learning"
By: Zeyuan Allen-Zhu★, Yuanzhi Li★.

FOCS 2021: Symposium on Foundations of Computer Science.

"Byzantine-Resilient Non-Convex Stochastic Gradient Descent"
By: Zeyuan Allen-Zhu, Faeze Ebrahimian, Jerry Li, Dan Alistarh.

ICLR 2021: Conference on Learning Representations.

We study adversary-resilient stochastic distributed optimization, in which m machines can independently compute stochastic gradients, and cooperate to jointly optimize over their local objective functions. However, an ��-fraction of the machines are Byzantine, in that they may behave in arbitrary, adversarial ways. We consider a variant of this procedure in the challenging non-convex case. Our main result is a new algorithm SafeguardSGD which can provably escape saddle points and find approximate local minima of the non-convex objective. The algorithm is based on a new concentration filtering technique, and its sample and time complexity bounds match the best known theoretical bounds in the stochastic, distributed setting when no Byzantine machines are present.

Our algorithm is very practical: it improves upon the performance of all prior methods when training deep neural networks, it is relatively lightweight, and it is the first method to withstand two recently-proposed Byzantine attacks.

"Near-Optimal Discrete Optimization for Experimental Design: A Regret Minimization Approach".
By: Zeyuan Allen-Zhu★, Yuanzhi Li★, Aarti Singh★, Yining Wang★.

MAPR 2020: Mathematical Programming.

The experimental design problem concerns the selection of k points from a potentially large design pool of p-dimensional vectors, so as to maximize the statistical efficiency regressed on the selected k design points. Statistical efficiency is measured by \emph{optimality criteria}, including A(verage), D(eterminant), T(race), E(igen), V(ariance) and G-optimality. Except for the T-optimality, exact optimization is NP-hard.

We propose a polynomial-time regret minimization framework to achieve a $(1+\varepsilon)$ approximation with only $O(p/\varepsilon^2)$ design points, for all the optimality criteria above.

In contrast, to the best of our knowledge, before our work, no polynomial-time algorithm achieves $(1+\varepsilon)$ approximations for D/E/G-optimality, and the best poly-time algorithm achieving $(1+\varepsilon)$-approximation for A/V-optimality requires $k=\Omega(p^2/\varepsilon)$ design points.

"What Can ResNet Learn Efficiently, Going Beyond Kernels?"
By: Zeyuan Allen-Zhu★, Yuanzhi Li★.

NeurIPS 2019: Neural Information Processing Systems.

"Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers"
By: Zeyuan Allen-Zhu★, Yuanzhi Li★, Yingyu Liang★.

NeurIPS 2019: Neural Information Processing Systems.

The fundamental learning theory behind neural networks remains largely open. What classes of functions can neural networks actually learn? Why doesn't the trained neural networks overfit when the it is overparameterized (namely, having more parameters than statistically needed to overfit training data)?

In this work, we prove that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations. Moreover, the learning can be simply done by SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples. The sample complexity can also be almost independent of the number of parameters in the overparameterized network.

"Can SGD Learn Recurrent Neural Networks with Provable Generalization?"
By: Zeyuan Allen-Zhu★, Yuanzhi Li★.

NeurIPS 2019: Neural Information Processing Systems.

Recurrent Neural Networks (RNNs) are among the most popular models in sequential data analysis. Yet, in the foundational PAC learning language, what concept class can it learn? Moreover, how can the same recurrent unit simultaneously learn functions from different input tokens to different output tokens, without affecting each other? Existing generalization bounds for RNN scale exponentially with the input length, significantly limiting their practical implications.

In this paper, we show using the vanilla stochastic gradient descent (SGD), RNN can actually learn some notable concept class efficiently, meaning that both time and sample complexity scale polynomially in the input length (or almost polynomially, depending on the concept). This concept class at least includes functions where each output token is generated from inputs of earlier tokens using a smooth two-layer neural network.

"On the Convergence Rate of Training Recurrent Neural Networks"
By: Zeyuan Allen-Zhu★, Yuanzhi Li★, Zhao Song★.

NeurIPS 2019: Neural Information Processing Systems.

How can local-search methods such as stochastic gradient descent (SGD) avoid bad local minima in training multi-layer neural networks? Why can they fit random labels even given non-convex and non-smooth architectures? Most existing theory only covers networks with one hidden layer, so can we go deeper?

In this paper, we focus on recurrent neural networks (RNNs) which are multi-layer networks widely used in natural language processing. They are harder to analyze than feedforward neural networks, because the same recurrent unit is repeatedly applied across the entire time horizon of length L, which is analogous to feedforward networks of depth L. We show when the number of neurons is sufficiently large, meaning polynomial in the training data size and in L, then SGD is capable of minimizing the regression loss in the linear convergence rate. This gives theoretical evidence of how RNNs can memorize data.

More importantly, in this paper we build general toolkits to analyze multi-layer networks with ReLU activations. For instance, we prove why ReLU activations can prevent exponential gradient explosion or vanishing, and build a perturbation theory to analyze first-order approximation of multi-layer networks.

"A Convergence Theory for Deep Learning via Over-Parameterization"
By: Zeyuan Allen-Zhu★, Yuanzhi Li★, Zhao Song★.

ICML 2019: International Conference on Machine Learning.

Deep neural networks (DNNs) have demonstrated dominating performance in many fields; since AlexNet, networks used in practice are going wider and deeper. On the theoretical side, a long line of works have been focusing on why we can train neural networks when there is only one hidden layer. The theory of multi-layer networks remains somewhat unsettled.

In this work, we prove why simple algorithms such as stochastic gradient descent (SGD) can find global minima on the training objective of DNNs in polynomial time. We only make two assumptions: the inputs do not degenerate and the network is over-parameterized. The latter means the number of hidden neurons is sufficiently large: polynomial in $L$, the number of DNN layers and in $n$, the number of training samples.

As concrete examples, starting from randomly initialized weights, we show that SGD attains 100% training accuracy in classification tasks, or minimizes regression loss in linear convergence speed $\varepsilon \propto e^{-\Omega(T)}$, with running time polynomial in $n$ and $L$. Our theory applies to the widely-used but non-smooth ReLU activation, and to any smooth and possibly non-convex loss functions. In terms of network architectures, our theory at least applies to fully-connected neural networks, convolutional neural networks (CNN), and residual neural networks (ResNet).

"Natasha 2: Faster Non-Convex Optimization Than SGD".
By: Zeyuan Allen-Zhu.

NeurIPS 2018: Neural Information Processing Systems.

We design a stochastic algorithm to train any smooth neural network to $\varepsilon$-approximate local minima, using $O(\varepsilon^{-3.25})$ backpropagations. The best result was essentially $O(\varepsilon^{-4})$ by SGD.

More broadly, it finds $\varepsilon$-approximate local minima of any smooth nonconvex function in rate $O(\varepsilon^{-3.25})$, with only oracle access to stochastic gradients.

"How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD".
By: Zeyuan Allen-Zhu.

NeurIPS 2018: Neural Information Processing Systems.

Stochastic gradient descent (SGD) gives an optimal convergence rate when minimizing convex stochastic objectives $f(x)$. However, in terms of making the gradients small, the original SGD does not give an optimal rate, even when $f(x)$ is convex.

If $f(x)$ is convex, to find a point with gradient norm $\varepsilon$, we design an algorithm SGD3 with a near-optimal rate $\tilde{O}(\varepsilon^{-2})$, improving the best known rate $O(\varepsilon^{-8/3})$.

If $f(x)$ is nonconvex, to find its $\varepsilon$-approximate local minimum, we design an algorithm SGD5 with rate $\tilde{O}(\varepsilon^{-3.5})$, where previously SGD variants are only known to achieve rate $\tilde{O}(\varepsilon^{-4})$.

"Is Q-Learning Provably Efficient?"
By: Chi Jin★, Zeyuan Allen-Zhu★, Sébastien Bubeck, Michael Jordan.

NeurIPS 2018: Neural Information Processing Systems.
ICML 2018 eRL workshop best paper award $1000.

Model-free reinforcement learning (RL) algorithms, such as Q-learning, directly parameterize and update value functions or policies without explicitly modeling the environment. They are typically simpler, more flexible to use, and thus more prevalent in modern deep RL than model-based approaches. However, empirical work has suggested that model-free algorithms may require more samples to learn. The theoretical question of "whether model-free algorithms can be made sample efficient" is one of the most fundamental questions in RL, and remains unsolved even in the basic scenario with finitely many states and actions.

We prove that, in an episodic MDP setting, Q-learning with UCB exploration achieves regret $\tilde{O}(\sqrt{H^3SAT})$, where S and A are the numbers of states and actions, H is the number of steps per episode, and T is the total number of steps. This sample efficiency matches the optimal regret that can be achieved by any model-based approach, up to a single $\sqrt{H}$ factor. This is the first analysis in the model-free setting that establishes $\sqrt{T}$ regret without requiring access to a "simulator."

"The Lingering of Gradients: How to Reuse Gradients Over Time".
By: Zeyuan Allen-Zhu★, David Simchi-Levi★, Xinshang Wang★.

NeurIPS 2018: Neural Information Processing Systems.

Classically, the time complexity of a first-order method is estimated by its number of gradient computations. In this paper, we study a more refined complexity by taking into account the "lingering" of gradients: once a gradient is computed at $x_k$, the additional time to compute gradients at $x_{k+1},x_{k+2},\dots$ may be reduced.

We show how this improves the running time of gradient descent and SVRG. For instance, if the "additional time" scales linearly with respect to the traveled distance, then the "convergence rate" of gradient descent can be improved from $1/T$ to $\exp(−T^{1/3})$. On the empirical side, we solve a hypothetical revenue management problem on the Yahoo! Front Page Today Module application with 4.6m users to 10⁻⁶ error (or 10⁻¹² dual error) using 6 passes of the dataset.

"Byzantine Stochastic Gradient Descent".
By: Dan Alistarh★, Zeyuan Allen-Zhu★, Jerry Li★.

NeurIPS 2018: Neural Information Processing Systems.

This paper studies the problem of distributed stochastic optimization in an adversarial setting where, out of the $m$ machines which allegedly compute stochastic gradients every iteration, an $\alpha$-fraction are Byzantine, and can behave arbitrarily and adversarially. Our main result is a variant of stochastic gradient descent (SGD) which finds $\varepsilon$-approximate minimizers of convex functions in $T = \tilde{O}\big( \frac{1}{\varepsilon^2 m} + \frac{\alpha^2}{\varepsilon^2} \big)$ iterations. In contrast, traditional mini-batch SGD needs $T = O\big( \frac{1}{\varepsilon^2 m} \big)$ iterations, but cannot tolerate Byzantine failures. Further, we provide a lower bound showing that, up to logarithmic factors, our algorithm is information-theoretically optimal both in terms of sampling complexity and time complexity.

"Neon2: Finding Local Minima via First-Order Oracles".
By: Zeyuan Allen-Zhu★, Yuanzhi Li★.

NeurIPS 2018: Neural Information Processing Systems.

We propose a reduction for non-convex optimization that can (1) turn a stationary-point finding algorithm into a local-minimum finding one, and (2) replace the Hessian-vector product computations with only gradient computations. It works both in the stochastic and the deterministic settings, without hurting the algorithm's performance.

As applications, our reduction turns Natasha2 into a first-order method without hurting its performance. It also converts SGD, GD, SCSG, and SVRG into local-minimum finding algorithms outperforming some best known results.

"Katyusha X: Practical Momentum Method for Stochastic Sum-of-Nonconvex Optimization".
By: Zeyuan Allen-Zhu.

ICML 2018: International Conference on Machine Learning.

The problem of minimizing sum-of-nonconvex functions (i.e., convex functions that are average of non-convex ones) is becoming increasingly important in machine learning, and is the core machinery for PCA, SVD, regularized Newton's method, accelerated non-convex optimization, and more.

We show how to provably obtain an accelerated stochastic algorithm for minimizing sum-of-nonconvex functions, by adding one additional line to the well-known SVRG method. This line corresponds to momentum, and shows how to directly apply momentum to the finite-sum stochastic minimization of sum-of-nonconvex functions. As a side result, our method enjoys linear parallel speed-up using mini-batch.

"Make the Minority Great Again: First-Order Regret Bound for Contextual Bandits".
By: Zeyuan Allen-Zhu★, Sébastien Bubeck★, Yuanzhi Li★.

ICML 2018: International Conference on Machine Learning.
Solves a COLT 2017 $250 open problem.

Regret bounds in online learning compare the player's performance to $L^*$, the optimal performance in hindsight with a fixed strategy. Typically such bounds scale with the square root of the time horizon $T$. The more refined concept of first-order regret bound replaces this with a scaling $\sqrt{L^*}$, which may be much smaller than $\sqrt{T}$. It is well known that minor variants of standard algorithms satisfy first-order regret bounds in the full information and multi-armed bandit settings. In a COLT 2017 open problem, Agarwal, Krishnamurthy, Langford, Luo, and Schapire raised the issue that existing techniques do not seem sufficient to obtain first-order regret bounds for the contextual bandit problem. In the present paper, we resolve this open problem by presenting a new strategy based on augmenting the policy space.

"Operator Scaling via Geodesically Convex Optimization, Invariant Theory and Polynomial Identity Testing".
By: Zeyuan Allen-Zhu★, Ankit Garg★, Yuanzhi Li★, Rafael Oliveira★, Avi Wigderson★.

STOC 2018: Symposium on Theory of Computing.

We propose a new second-order method for geodesically convex optimization on the natural hyperbolic metric over positive definite matrices. We apply it to solve the operator scaling problem in time polynomial in the input size and logarithmic in the error. This is an exponential improvement over previous algorithms which were analyzed in the usual Euclidean, commutative metric (for which the above problem is not convex). Our method is general and applicable to other settings.

As a consequence, we solve the equivalence problem for the left-right group action underlying the operator scaling problem, and derandomize a new class of Polynomial Identity Testing (PIT) problems, which was the original motivation for studying operator scaling.

"Linear Convergence of a Frank-Wolfe Type Algorithm over Trace-Norm Balls".
By: Zeyuan Allen-Zhu★, Elad Hazan★, Wei Hu★, Yuanzhi Li★.

NeurIPS 2017: Neural Information Processing Systems.

We propose a rank-k variant of the classical Frank-Wolfe algorithm to solve convex optimization over a trace-norm ball. Our algorithm replaces the top singular-vector computation (1-SVD) in Frank-Wolfe with a top-k singular-vector computation (k-SVD), which can be done by repeatedly applying 1-SVD k times. Our algorithm has a linear convergence rate when the objective function is smooth and strongly convex, and the optimal solution has rank at most k. This improves the convergence rate and the total complexity of the Frank-Wolfe method and its variants.

"First Efficient Convergence for Streaming k-PCA: a Global, Gap-Free, and Near-Optimal Rate".
By: Zeyuan Allen-Zhu★, Yuanzhi Li★.

FOCS 2017: Symposium on Foundations of Computer Science.

We study online principle component analysis (PCA), that is to find the top k eigenvectors of a $d\times d$ hidden matrix $\mathbf{\Sigma}$ with online data samples drawn from covariance matrix $\mathbf{\Sigma}$.

We provide global convergence for Oja's algorithm which is popularly used in practice but lacks theoretical understanding for $k>1$. We also provide a modified variant $\mathsf{Oja}^{++}$ that runs even faster than Oja's. Our results match the information theoretic lower bound in terms of dependency on error, on eigengap, on rank $k$, and on dimension $d$, up to poly-log factors. In addition, our convergence rate can be made gap-free, that is proportional to the approximation error and independent of the eigengap.

In contrast, for general rank k, before our work (1) it was open to design any algorithm with efficient global convergence rate; and (2) it was open to design any algorithm with (even local) gap-free convergence rate.

"Much Faster Algorithms for Matrix Scaling".
By: Zeyuan Allen-Zhu★, Yuanzhi Li★, Rafael Oliveira★, Avi Wigderson★.

FOCS 2017: Symposium on Foundations of Computer Science.

We develop several efficient algorithms for the classical Matrix Scaling problem, which is used in many diverse areas, from preconditioning linear systems to approximation of the permanent. On an input $n\times n$ matrix $A$, this problem asks to find diagonal scaling matrices $X, Y$ (if they exist), so that $X A Y$ $\varepsilon$-approximates a doubly stochastic matrix, or more generally a matrix with prescribed row and column sums.

We address the general scaling problem as well as some important special cases. In particular, if $A$ has $m$ nonzero entries, and if there exist $X, Y$ with polynomially large entries such that $X A Y$ is doubly stochastic, then we can solve the problem in total complexity $\tilde{O}(m + n^{4/3})$. This greatly improves on the best known previous results, which were either $\tilde{O}(n^4)$ or $O(m n^{1/2}/\varepsilon)$.

"Follow the Compressed Leader: Faster Online Learning of Eigenvectors and Faster MMWU".
By: Zeyuan Allen-Zhu★, Yuanzhi Li★.

ICML 2017: International Conference on Machine Learning.

Matrix multiplicative weight update (MMWU) is an extremely powerful algorithmic tool for computer science and related fields. However, it comes with a slow running time due to the matrix exponential and eigendecomposition computations. For this reason, many researchers studied the followed-the-perturbed-leader (FTPL) framework which is faster, but a factor $\sqrt{d}$ worse than the optimal regret of MMWU for dimension-d matrices.

In this paper, we propose a followed-the-compressed-leader framework which, not only matches the optimal regret of MMWU (up to polylog factors), but runs even faster than FTPL.

Our main idea is to compress the matrix exponential computation to dimension 3 in the adversarial setting, or dimension 1 in the stochastic setting. This result resolves an open question regarding how to obtain both (nearly) optimal and efficient algorithms for the online eigenvector problem.

"Natasha: Faster Non-Convex Stochastic Optimization via Strongly Non-Convex Parameter".
By: Zeyuan Allen-Zhu.

ICML 2017: International Conference on Machine Learning.

Given a non-convex function $f(x)$ that is an average of n smooth functions, we design stochastic first-order methods to find its approximate stationary points. The performance of our new methods depend on the smallest (negative) eigenvalue $-\sigma$ of the Hessian. This parameter $\sigma$ captures how strongly non-convex $f(x)$ is, and is analogous to the strong convexity parameter for convex optimization.

Our methods outperform the best known results for a wide range of $\sigma$, and can also be used to find approximate local minima.

In particular, we find an interesting dichotomy: there exists a threshold $\sigma_0$ so that the fastest methods for $\sigma>\sigma_0$ and for $\sigma<\sigma_0$ have drastically different behaviors: the former scales with $n^{2/3}$ and the latter scales with $n^{3/4}$.

"Near-Optimal Design of Experiments via Regret Minimization".
By: Zeyuan Allen-Zhu★, Yuanzhi Li★, Aarti Singh★, Yining Wang★.

ICML 2017: International Conference on Machine Learning.

This paper has been superceded by our subsequent paper "Near-Optimal Discrete Optimization for Experimental Design: A Regret Minimization Approach".

"Faster Principal Component Regression and Stable Matrix Chebyshev Approximation".
By: Zeyuan Allen-Zhu★, Yuanzhi Li★.

ICML 2017: International Conference on Machine Learning.

We solve principal component regression (PCR), up to a multiplicative accuracy $1+\gamma$, by reducing the problem to $\tilde{O}(\gamma^{-1})$ black-box calls of ridge regression. Therefore, our algorithm does not require any explicit construction of the top principal components, and is suitable for large-scale PCR instances. In contrast, previous result requires $\tilde{O}(\gamma^{-2})$ such black-box calls.

We obtain this result by developing a general stable recurrence formula for matrix Chebyshev polynomials, and a degree-optimal polynomial approximation to the matrix sign function. Our techniques may be of independent interests, especially when designing iterative methods.

"Doubly Accelerated Methods for Faster CCA and Generalized Eigendecomposition".
By: Zeyuan Allen-Zhu★, Yuanzhi Li★.

ICML 2017: International Conference on Machine Learning.

We study k-GenEV, the problem of finding the top k generalized eigenvectors, and k-CCA, the problem of finding the top k vectors in canonical-correlation analysis. We propose algorithms LazyEV and LazyCCA to solve the two problems with running times linearly dependent on the input size and on k.

Furthermore, our algorithms are DOUBLY-ACCELERATED: our running times depend only on the square root of the matrix condition number, and on the square root of the eigengap. This is the first such result for both k-GenEV or k-CCA. We also provide the first gap-free results, which provide running times that depend on $1/\sqrt{\varepsilon}$ rather than the eigengap.

"Katyusha: The First Direct Acceleration of Stochastic Gradient Methods".
By: Zeyuan Allen-Zhu.
JMLR 2018: Journal of Machine Learning Research.
STOC 2017: Symposium on Theory of Computing (abstract).

We introduce Katyusha, the first direct, primal-only stochastic gradient method that has a provably accelerated convergence rate in convex optimization. In contrast, previous methods are based on dual coordinate descent which are more restrictive, or based on outer-inner loops which make them "blind" to the underlying stochastic nature of the optimization process. Katyusha is the first algorithm that incorporates acceleration directly into stochastic gradient updates.

Unlike previous results, Katyusha obtains an optimal convergence rate. It also supports proximal updates, non-Euclidean norm smoothness, non-uniform sampling, and mini-batch sampling. When applied to interesting classes of convex objectives, including smooth objectives (e.g., Lasso, Logistic Regression), strongly-convex objectives (e.g., SVM), and non-smooth objectives (e.g., L1SVM), Katyusha improves the best known convergence rates.

The main ingredient behind our result is Katyusha momentum, a novel "negative momentum on top of momentum" that can be incorporated into a variance-reduction based algorithm and speed it up. As a result, since variance reduction has been successfully applied to a fast growing list of practical problems, our paper suggests that in each of such cases, one had better hurry up and give Katyusha a hug.

"Finding Approximate Local Minima Faster Than Gradient Descent".
By: Naman Agarwal★, Zeyuan Allen-Zhu★, Brian Bullins★, Elad Hazan★, Tengyu Ma★.
STOC 2017: Symposium on Theory of Computing.

We design a non-convex second-order optimization algorithm that is guaranteed to return an approximate local minimum in time which is linear in the input representation. The time complexity of our algorithm to find an approximate local minimum is even faster than that of gradient descent to find a critical point. Our algorithm applies to a general class of optimization problems including training a neural network and other non-convex objectives arising in machine learning.

"Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent".
By: Zeyuan Allen-Zhu★, Lorenzo Orecchia★.
ITCS 2017: Innovations in Theoretical Computer Science.

First-order methods play a central role in large-scale machine learning. Even though many variations exist, each suited to a particular problem, almost all such methods fundamentally rely on two types of algorithmic steps: gradient descent, which yields primal progress, and mirror descent, which yields dual progress.

We observe that the performances of gradient and mirror descent are complementary, so that faster algorithms can be designed by linearly coupling the two. We show how to reconstruct Nesterov's accelerated gradient methods using linear coupling, which gives a cleaner interpretation than Nesterov's original proofs. We also discuss the power of linear coupling by extending it to many other settings that Nesterov's methods cannot apply to.

"LazySVD: Even Faster SVD Decomposition Yet Without Agonizing Pain".
By: Zeyuan Allen-Zhu★, Yuanzhi Li★.
NeurIPS 2016: Neural Information Processing Systems.

We study k-SVD that is to obtain the first k singular vectors of a matrix A. Recently, a few breakthroughs have been discovered on k-SVD: Musco and Musco [1] provided the first gap-free theorem for the block Krylov method, Shamir [2] discovered the first variance-reduction stochastic method, and Bhojanapalli et al. [3] provided the fastest $O(\mathsf{nnz}(A)+\mathsf{poly}(1/\varepsilon))$-time algorithm using alternating minimization.

In this paper, we put forward a new and simple LazySVD framework to improve the above breakthroughs. This framework leads to a faster gap-free method outperforming [1], and the first accelerated and stochastic method outperforming [2]. In the $O(\mathsf{nnz}(A)+\mathsf{poly}(1/\varepsilon))$ running-time regime, LazySVD outperforms [3] in certain parameter regimes without even using alternating minimization.

"Optimal Black-Box Reductions Between Optimization Objectives".
By: Zeyuan Allen-Zhu★, Elad Hazan★.
NeurIPS 2016: Neural Information Processing Systems.

The diverse world of machine learning applications has given rise to a plethora of algorithms and optimization methods, finely tuned to the specific regression or classification task at hand. We reduce the complexity of algorithm design for machine learning by reductions: we develop reductions that take a method developed for one setting and apply it to the entire spectrum of smoothness and strong-convexity in applications.

Furthermore, unlike existing results, our new reductions are OPTIMAL and more PRACTICAL. We show how these new reductions give rise to new and faster running times on training linear classifiers for various families of loss functions, and conclude with experiments showing their successes also in practice.

"Exploiting the Structure: Stochastic Gradient Methods Using Raw Clusters".
By: Zeyuan Allen-Zhu★, Yang Yuan★, Karthik Sridharan.
NeurIPS 2016: Neural Information Processing Systems.

The amount of data available in the world is growing faster and bigger than our ability to deal with it. However, if we take advantage of the internal structure, data may become much smaller for machine learning purposes. In this paper we focus on one of the most fundamental machine learning tasks, empirical risk minimization (ERM), and provide faster algorithms with the help from the clustering structure of the data.

We introduce a simple notion of raw clustering that can be efficiently obtained with just one pass of the data, and propose two algorithms. Our variance-reduction based algorithm ClusterSVRG introduces a new gradient estimator using the clustering information, and our accelerated algorithm ClusterACDM is built on a novel Haar transformation applied to the dual space of each cluster. Our algorithms outperform their classical counterparts both in theory and practice.

"Variance Reduction for Faster Non-Convex Optimization".
By: Zeyuan Allen-Zhu★, Elad Hazan★.
ICML 2016: International Conference on Machine Learning.

We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions.

We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $\Omega(n^{1/3})$.

We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.

"Even Faster Accelerated Coordinate Descent Using Non-Uniform Sampling".
By: Zeyuan Allen-Zhu★, Peter Richtárik★, Zheng Qu★, Yang Yuan★.
ICML 2016: International Conference on Machine Learning.

Accelerated coordinate descent is widely used in optimization due to its cheap per-iteration cost and scalability to large-scale problems. Up to a primal-dual transformation, it is also the same as accelerated stochastic gradient descent that is one of the central methods used in machine learning.

In this paper, we improve the best known running time of accelerated coordinate descent by a factor up to $\sqrt{n}$. Our improvement is based on a clean, novel non-uniform sampling that selects each coordinate with a probability proportional to the square root of its smoothness parameter. Our proof technique also deviates from the classical estimation sequence technique used in prior work. Our speed-up applies to important problems such as empirical risk minimization and solving linear systems, both in theory and in practice.

"Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives".
By: Zeyuan Allen-Zhu★ and Yang Yuan★.
ICML 2016: International Conference on Machine Learning.

Many classical algorithms are found until several years later to outlive the confines in which they were conceived, and continue to be relevant in unforeseen settings. In this paper, we show that SVRG is one such method: being originally designed for strongly convex objectives, it is also very robust in non-strongly convex or sum-of-non-convex settings.

More precisely, we provide new analysis to improve the state-of-the-art running times in both settings by either applying SVRG or its novel variant. Since non-strongly convex objectives include important examples such as Lasso or logistic regression, and sum-of-non-convex objectives include famous examples such as stochastic PCA and is even believed to be related to training deep neural nets, our results also imply better performances in these applications.

"Optimization Algorithms for Faster Computational Geometry".
By: Zeyuan Allen-Zhu★, Zhenyu Liao★, Yang Yuan★.
ICALP 2016: International Colloquium on Automata, Languages, and Programming.

We study two fundamental problems in computational geometry: finding the maximum inscribed ball (MaxIB) inside a bounded polyhedron defined by $m$ hyperplanes, and the minimum enclosing ball (MinEB) of a set of $n$ points, both in $d$-dimensional space. We improve the running time of iterative algorithms on

MaxIB from $\tilde{O}(m d \alpha^3 / \varepsilon^3)$ to $\tilde{O}(md + m \sqrt{d} \alpha / \varepsilon)$, a speed-up up to $\tilde{O}(\sqrt{d} \alpha^2 / \varepsilon^2)$, and
MinEB from $\tilde{O}(n d / \sqrt{\varepsilon})$ to $\tilde{O}(nd + n \sqrt{d} / \sqrt{\varepsilon})$, a speed-up up to $\tilde{O}(\sqrt{d})$.

Our improvements are based on a novel saddle-point optimization framework. We propose a new algorithm L1L2SPSolver for solving a class of regularized saddle-point problems, and apply a randomized Hadamard space rotation which is a technique borrowed from compressive sensing. Interestingly, the motivation of using Hadamard rotation solely comes from our optimization view but not the original geometry problem: indeed, it is not immediately clear why MaxIB or MinEB, as a geometric problem, should be easier to solve if we rotate the space by a unitary matrix. We hope that our optimization perspective sheds lights on solving other geometric problems as well.

"Restricted Isometry Property for General p-Norms".
By: Zeyuan Allen-Zhu★, Rati Gelashvili★, Ilya Razenshteyn★.
IEEE-IT 2016: IEEE Transactions on Information Theory.
SoCG 2015: International Symposium on Computational Geometry (abstract).

The Restricted Isometry Property (RIP) is a fundamental property of a matrix which enables sparse recovery. Informally, an $m \times n$ matrix satisfies RIP of order $k$ for the $\ell_p$ norm, if $\|Ax\|_p \approx \|x\|_p$ for every $x$ with at most $k$ non-zero coordinates.

For every $1 \leq p < \infty$ we obtain almost tight bounds on the minimum number of rows $m$ necessary for the RIP property to hold. Before, only the cases $p \in \big\{1, 1 + \frac{1}{\log k}, 2\big\}$ were studied. Interestingly, our results show that the case $p = 2$ is a `singularity' in terms of the optimum value of $m$.

We also obtain almost tight bounds for the column sparsity of RIP matrices and discuss implications of our results for the Stable Sparse Recovery problem.

"Using Optimization to Obtain a Width-Independent, Parallel, Simpler, and Faster Positive SDP Solver".
By: Zeyuan Allen-Zhu★, Yin Tat Lee★, and Lorenzo Orecchia★.
SODA 2016: Symposium on Discrete Algorithms.

We study the design of polylogarithmic depth algorithms for approximately solving packing and covering semidefinite programs (or positive SDPs for short). This is a natural SDP generalization of the well-studied positive LP problem.

Although positive LPs can be solved in polylogarithmic depth while using only $\log^{2} n/\varepsilon^3$ parallelizable iterations [3], the best known positive SDP solvers due to Jain and Yao [18] require $\log^{14} n /\varepsilon^{13}$ parallelizable iterations. Several alternative solvers have been proposed to reduce the exponents in the number of iterations [19, 29]. However, the correctness of the convergence analyses in these works has been called into question [29], as they both rely on algebraic monotonicity properties that do not generalize to matrix algebra.

In this paper, we propose a very simple algorithm based on the optimization framework proposed in [3] for LP solvers. Our algorithm only needs $\log^2 n / \varepsilon^3$ iterations, matching that of the best LP solver. To surmount the obstacles encountered by previous approaches, our analysis requires a new matrix inequality that extends Lieb-Thirring's inequality, and a sign-consistent, randomized variant of the gradient truncation technique proposed in [2, 3].

"Expanders via Local Edge Flips".
By: Zeyuan Allen-Zhu★, Aditya Bhaskara★, Silvio Lattanzi★, Vahab Mirrokni★, and Lorenzo Orecchia★.
SODA 2016: Symposium on Discrete Algorithms.

Designing distributed and scalable algorithms to improve network connectivity is a central topic in peer-to-peer networks. In this paper we focus on the following well-known problem: given an n-node d-regular network, we want to design a decentralized, local algorithm that transforms the graph into one that has good connectivity properties (low diameter, expansion, etc.) without affecting the sparsity of the graph. To this end, Mahlmann and Schindelhauer introduced the random "flip" transformation, where in each time step, a random pair of vertices that have an edge decide to 'swap a neighbor'. They conjectured that performing $\mathrm{poly}(d) \times n\log n$ such flips at random would convert any connected d-regular graph into a d-regular expander graph, with high probability. However, the best known upper bound for the number of steps is roughly $O(n^{17} d^{23})$, obtained via a delicate Markov chain comparison argument.

Our main result is to prove that a natural instantiation of the random flip produces an expander in at most $O(n^2 d^2 \sqrt{\log n})$ steps, with high probability. Our argument uses a potential-function analysis based on the matrix exponential, together with the recent beautiful results on the higher-order Cheeger inequality of graphs. We also show that our technique can be used to analyze another well-studied random process known as the 'random switch', and show that it produces an expander in $O(n d)$ steps for $d=\Omega(\log n)$, with high probability.

"Nearly Linear-Time Packing and Covering LP Solvers".
By: Zeyuan Allen-Zhu★, Lorenzo Orecchia★.
STOC 2015: Symposium on Theory of Computing (abstract).
MAPR 2018: Mathematical Programming.

Packing and covering linear programs (PC-LPs) form an important class of linear programs (LPs) across computer science, operations research, and optimization. In 1993, Luby and Nisan constructed an iterative algorithm for approximately solving PC-LPs in nearly linear time, where the time complexity scales nearly linearly in $N$, the number of nonzero entries of the matrix, and polynomially in $\varepsilon$, the (multiplicative) approximation error. Unfortunately, existing nearly linear-time algorithms for solving PC-LPs require time at least proportional to $\varepsilon^{-2}$.

In this paper, we break this longstanding barrier by designing a packing solver that runs in time $\tilde{O}(N \varepsilon^{-1})$ and covering LP solver that runs in time $\tilde{O}(N \varepsilon^{-1.5})$. Our packing solver can be extended to run in time $\tilde{O}(N \varepsilon^{-1})$ for a class of well-behaved covering programs. In a follow-up work, Wang et al. showed that all covering LPs can be converted into well-behaved ones by a reduction that blows up the problem size only logarithmically.

"Spectral Sparsification and Regret Minimization Beyond Multiplicative Updates".
By: Zeyuan Allen-Zhu★, Zhenyu Liao★, Lorenzo Orecchia★.
STOC 2015: Symposium on Theory of Computing.

In this paper, we provide a novel construction of the linear-sized spectral sparsifiers of Batson, Spielman and Srivastava [BSS14]. While previous constructions required $\Omega(n^4)$ running time [BSS14, Zou12], our sparsification routine can be implemented in almost-quadratic running time $O(n^{2+eps})$.

The fundamental conceptual novelty of our work is the leveraging of a strong connection between sparsification and a regret minimization problem over density matrices. This connection was known to provide an interpretation of the randomized sparsifiers of Spielman and Srivastava [SS11] via the application of matrix multiplicative weight updates (MWU) [CHS11, Vis14]. In this paper, we explain how matrix MWU naturally arises as an instance of the Follow-the-Regularized-Leader framework and generalize this approach to yield a larger class of updates. This new class allows us to accelerate the construction of linear-sized spectral sparsifiers, and give novel insights on the motivation behind Batson, Spielman and Srivastava [BSS14].

"Using Optimization to Solve Positive LPs Faster in Parallel".
By: Zeyuan Allen-Zhu★, Lorenzo Orecchia★.
SODA 2015: Symposium on Discrete Algorithms.

We study the design of nearly-linear-time algorithms for approximately solving positive linear programs. Both the parallel and the sequential deterministic versions of these algorithms require $\tilde{O}(\varepsilon^{-4})$ iterations, a dependence that has not been improved since the introduction of these methods in 1993 by Luby and Nisan. Moreover, previous algorithms and their analyses rely on update steps and convergence arguments that are combinatorial in nature, and do not seem to arise naturally from an optimization viewpoint. In this paper, we leverage insights from optimization theory to construct a novel algorithm that breaks the longstanding $\tilde{O}(\varepsilon^{-4})$ barrier. Our algorithm has a simple analysis and a clear motivation. Our work introduces a number of novel techniques, such as the combined application of gradient descent and mirror descent, and a truncated, smoothed version of the standard multiplicative weight update, which may be of independent interest.

"Knightian Analysis of the Vickrey Mechanism"
By: Alessandro Chiesa★, Silvio Micali★ and Zeyuan Allen Zhu★.
Econometrica 2015: Volume 83, Issue 5, pages 1727-1754.

See here for a special topic on this.

"Shorter Arithmetization of Nondeterministic Computations".
By: Alessandro Chiesa★, Zeyuan Allen Zhu★.
TCS 2015: Theoretical Computer Science, Volume 600, pages 107-131.

Arithmetizing computation is a crucial component of many fundamental results in complexity theory, including results that gave insight into the power of interactive proofs, multi-prover interactive proofs, and probabilistically-checkable proofs. Informally, an arithmetization is a way to encode a machine's computation so that its correctness can be easily verified via few probabilistic algebraic checks.

We study the problem of arithmetizing nondeterministic computations for the purpose of constructing short probabilistically-checkable proofs (PCPs) with polylogarithmic query complexity. In such a setting, a PCP's proof length depends (at least!) linearly on the length, in bits, of the encoded computation. Thus, minimizing the number of bits in the encoding is crucial for minimizing PCP proof length.

In this paper we show how to arithmetize any T-step computation on a nondeterministic Turing machine by using a polynomial encoding of length $$O\big( T \cdot (\log T)^2 \big) \enspace.$$ Previously, the best known length was $\Omega(T \cdot (\log T)^4)$. For nondeterministic random-access machines, our length is $O(T \cdot (\log T)^{2+o(1)})$, while prior work only achieved $\Omega(T \cdot (\log T)^5)$.

The polynomial encoding that we use is the Reed--Solomon code. When combined with the best PCPs of proximity for this code, our result yields quasilinear-size PCPs with polylogarithmic query complexity that are shorter, by at least two logarithmic factors, than in all prior work.

Our arithmetization also enjoys additional properties. First, it is succinct, i.e., the encoding of the computation can be probabilistically checked in $(\log T)^{O(1)}$ time; this property is necessary for constructing short PCPs with a polylogarithmic-time verifier. Furthermore, our techniques extend, in a certain well-defined sense, to the arithmetization of yet other NEXP-complete languages.

"Johnson-Lindenstrauss Compression with Neuroscience-Based Constraints".
By: Zeyuan Allen-Zhu★, Rati Gelashvili★, Silvio Micali★, Nir Shavit★.
PNAS 2014: Proceedings of the National Academy of Sciences of the USA, vol 111, no 47.

Johnson-Lindenstrauss (JL) matrices implemented by sparse random synaptic connections are thought to be a prime candidate for how convergent pathways in the brain compress information. However, to date, there is no complete mathematical support for such implementations given the constraints of real neural tissue. The fact that neurons are either excitatory or inhibitory implies that every so implementable JL matrix must be sign-consistent (i.e., all entries in a single column must be either all non-negative or all non-positive), and the fact that any given neuron connects to a relatively small subset of other neurons implies that the JL matrix had better be sparse.

We construct sparse JL matrices that are sign-consistent, and prove that our construction is essentially optimal. Our work answers a mathematical question that was triggered by earlier work and is necessary to justify the existence of JL compression in the brain, and emphasizes that inhibition is crucial if neurons are to perform efficient, correlation-preserving compression.

"Reconstructing Markov Processes from Independent and Anonymous Experiments".
By: Silvio Micali★, Zeyuan Allen Zhu★.
DAM 2015: Discrete Applied Mathematics, volume 200, pages 108-122.

We investigate the problem of exactly reconstructing, with high confidence and up to isomorphism, the ball of radius r centered at the starting state of a Markov process from independent and anonymous experiments. In an anonymous experiment, the states are visited according to the underlying transition probabilities, but no global state names are known: one can only recognize whether two states, reached within the same experiment, are the same.

We prove quite tight bounds for such exact reconstruction in terms of both the number of experiments and their lengths.

"Knightian Self Uncertainty in the VCG Mechanism for Unrestricted Combinatorial Auctions".
By: Alessandro Chiesa★, Silvio Micali★, Zeyuan Allen Zhu★.
ACM-EC 2014: Conference on Economics and Computation.

See here for a special topic on this.

"Flow-Based Algorithms for Local Graph Clustering".
By: Lorenzo Orecchia★ and Zeyuan Allen Zhu★.
SODA 2014: Symposium on Discrete Algorithms.

Given a subset S of vertices of an undirected graph G, the cut-improvement problem asks us to find a subset S that is similar to A but has smaller conductance. A very elegant algorithm for this problem has been given by Andersen and Lang [AL08] and requires solving a small number of single-commodity maximum flow computations over the whole graph G. In this paper, we introduce LocalImprove, the first cut-improvement algorithm that is local, i.e. that runs in time dependent on the size of the input set A rather than on the size of the entire graph. Moreover, LocalImprove achieves this local behaviour while essentially matching the same theoretical guarantee as the global algorithm of Andersen and Lang.

The main application of LocalImprove is to the design of better local-graph-partitioning algorithms. All previously known local algorithms for graph partitioning are random-walk based and can only guarantee an output conductance of $\tilde{O}(\sqrt{\phi})$ when the target set has conductance $\phi \in [0,1]$. Very recently, Zhu, Lattanzi and Mirrokni [ZLM13] improved this to $O(\phi / \sqrt{\mathsf{Conn}})$ where the internal connectivity parameter $\mathsf{Conn} \in [0,1]$ is defined as the reciprocal of the mixing time of the random walk over the graph induced by the target set. In this work, we show how to use LocalImprove to obtain a constant approximation $O(\phi)$ as long as $\mathsf{Conn}/\phi = \Omega(1)$. This yields the first flow-based algorithm. Moreover, its performance strictly outperforms the ones based on random walks and surprisingly matches that of the best known global algorithm, which is SDP-based, in this parameter regime [MMV12].

Finally, our results show that spectral methods are not the only viable approach to the construction of local graph partitioning algorithm and open door to the study of algorithms with even better approximation and locality guarantees.

"Local Graph Clustering Beyond Cheeger's Inequality".
By: Zeyuan Allen Zhu, Silvio Lattanzi, Vahab Mirrokni.
ICML 2013: International Conference on Machine Learning.

Motivated by applications of large-scale graph clustering, we study random-walk-based local algorithms whose running times depend only on the size of the output cluster, rather than the entire graph. All previously known such algorithms guarantee an output conductance of $\tilde{O}(\sqrt{\phi(A)})$ when the target set A has conductance $\phi(A)\in[0,1]$. In this paper, we improve it to \[\tilde{O}\bigg( \min\Big\{\sqrt{\phi(A)}, \frac{\phi(A)}{\sqrt{\mathsf{Conn}(A)}} \Big\} \bigg)\enspace, \] where the internal connectivity parameter $\mathsf{Conn}(A)\in[0,1]$ is defined as the reciprocal of the mixing time of the random walk over the induced subgraph on A.

For instance, using $\mathsf{Conn}(A)=\Omega(\lambda(A)/\log n)$ where $\lambda$ is the second eigenvalue of the Laplacian of the induced subgraph on A, our conductance guarantee can be as good as $\tilde{O}(\phi(A)/\sqrt{\lambda(A)})$. This builds an interesting connection to the recent advance of the so-called improved Cheeger's Inequality [KKL+13], which says that global spectral algorithms can provide a conductance guarantee of $O(\phi_{\mathsf{opt}}/\sqrt{\lambda_3})$ instead of $O(\sqrt{\phi_{\mathsf{opt}}})$.

In addition, we provide theoretical guarantee on the clustering accuracy (in terms of precision and recall) of the output set. We also prove that our analysis is tight, and perform empirical evaluation to support our theory on both synthetic and real data.

It is worth noting that, our analysis outperforms prior work when the cluster is well-connected. In fact, the better it is well-connected inside, the more significant improvement (both in terms of conductance and accuracy) we can obtain. Our results shed light on why in practice some random-walk-based algorithms perform better than its previous theory, and help guide future research about local clustering.

"A Simple, Combinatorial Algorithm for Solving SDD Systems in Nearly-Linear Time".
By: Jonathan A. Kelner★, Lorenzo Orecchia★, Aaron Sidford★, Zeyuan Allen Zhu★.
STOC 2013: Symposium on Theory of Computing.

In this paper, we present a simple combinatorial algorithm that solves symmetric diagonally dominant (SDD) linear systems in nearly-linear time. It uses very little of the machinery that previously appeared to be necessary for a such an algorithm. It does not require recursive preconditioning, spectral sparsi cation, or even the Chebyshev Method or Conjugate Gradient. After constructing a "nice" spanning tree of a graph associated with the linear system, the entire algorithm consists of the repeated application of a simple (non-recursive) update rule, which it implements using a lightweight data structure. The algorithm is numerically stable and can be implemented without the increased bit-precision required by previous solvers. As such, the algorithm has the fastest known running time under the standard unit-cost RAM model. We hope that the simplicity of the algorithm and the insights yielded by its analysis will be useful in both theory and practice.

"Mechanism Design with Approximate Valuations".
By: Alessandro Chiesa★, Silvio Micali★, Zeyuan Allen Zhu★.
ITCS 2012: Innovations in Theoretical Computer Science.

See here for a special topic on this.

"Randomized Accuracy-Aware Program Transformations For Efficient Approximate Computations".
By: Zeyuan Allen Zhu, Sasa Misailovic, Jonathan A. Kelner, Martin Rinard.
POPL 2012: Symposium on Principles of Programming Languages.

Despite the fact that approximate computations have come to dominate many areas of computer science, the field of program transformations has focused almost exclusively on traditional semantics-preserving transformations that do not attempt to exploit the opportunity, available in many computations, to acceptably trade off accuracy for benefits such as increased performance and reduced resource consumption.

We present a model of computation for approximate computations and an algorithm for optimizing these computations. The algorithm works with two classes of transformations: substitution transformations (which select one of a number of available implementations for a given function, with each implementation offering a different combination of accuracy and resource consumption) and sampling transformations (which randomly discard some of the inputs to a given reduction). The algorithm produces a (1+ε) randomized approximation to the optimal randomized computation (which minimizes resource consumption subject to a probabilistic accuracy specification in the form of a maximum expected error or maximum error variance).

"Optimal Pricing in Social Networks with Incomplete Information".
By: Wei Chen★, Pinyan Lu★, Xiaorui Sun★, Bo Tang★, Yajun Wang★, Zeyuan Allen Zhu★.
WINE 2011: Workshop on Internet & Network Economics.

In revenue maximization of selling a digital product in a social network, the utility of an agent is often considered to have two parts: a private valuation, and linearly additive in uences from other agents. We study the incomplete information case where agents know a common distribution about others' private valuations, and make decisions simultaneously. The "rational behavior" of agents in this case is captured by the well-known Bayesian Nash equilibrium.

Two challenging questions arise: how to compute an equilibrium and how to optimize a pricing strategy accordingly to maximize the revenue assuming agents follow the equilibrium? In this paper, we mainly focus on the natural model where the private valuation of each agent is sampled from a uniform distribution, which turns out to be already challenging.

Our main result is a polynomial-time algorithm that can exactly compute the equilibrium and the optimal price, when pairwise in uences are non-negative. If negative in uences are allowed, computing any equilibrium even approximately is PPAD-hard. Our algorithm can also be used to design an FPTAS for optimizing discriminative price profile.

[arXiv version]

"Asymptotically Optimal Strategy-Proof Mechanisms for Two-Facility Games".
By: Pinyan Lu★, Xiaorui Sun★, Yajun Wang★, Zeyuan Allen Zhu★.
ACM-EC 2010: Conference on Economics and Computation.

We consider the problem of locating facilities in a metric space to serve a set of selfish agents. The cost of an agent is the distance between her own location and the nearest facility. The social cost is the total cost of the agents. We are interested in designing strategy-proof mechanisms without payment that have a small approximation ratio for social cost. A mechanism is a (possibly randomized) algorithm which maps the locations reported by the agents to the locations of the facilities. A mechanism is strategy-proof if no agent can benefit from misreporting her location in any configuration.

This setting was first studied by Procaccia and Tennenholtz [21]. They focused on the facility game where agents and facilities are located on the real line. Alon et al. studied the mechanisms for the facility games in a general metric space [1]. However, they focused on the games with only one facility. In this paper, we study the two-facility game in a general metric space, which extends both previous models.

We first prove an $\Omega(n)$ lower bound of the social cost approximation ratio for deterministic strategy-proof mechanisms. Our lower bound even holds for the line metric space. This significantly improves the previous constant lower bounds [21, 17]. Notice that there is a matching linear upper bound in the line metric space [21]. Next, we provide the first randomized strategy-proof mechanism with a constant approximation ratio of 4. Our mechanism works in general metric spaces. For randomized strategy-proof mechanisms, the previous best upper bound is $O(n)$ which works only in the line metric space.

"A Novel Click Model and Its Applications to Online Advertising".
By: Zeyuan Allen Zhu, Weizhu Chen, Tom Minka, Chenguang Zhu, Zheng Chen.
WSDM 2010: International Conference on Web Search and Data Mining.

Recent advances in click model have positioned it as an attractive method for representing user preferences in web search and online advertising. Yet, most of the existing works focus on training the click model for individual queries, and cannot accurately model the tail queries due to the lack of training data. Simultaneously, most of the existing works consider the query, url and position, neglecting some other important attributes in click log data, such as the local time. Obviously, the click through rate is different between daytime and midnight. In this paper, we propose a novel click model based on Bayesian network, which is capable of modeling the tail queries because it builds the click model on attribute values, with those values being shared across queries. We called our work General Click Model (GCM) as we found that most of the existing works can be special cases of GCM by assigning different parameters. Experimental results on a largescale commercial advertisement dataset show that GCM can significantly and consistently lead to better results as compared to the state-of-the-art works.

"Inverse Time Dependency in Convex Regularized Learning".
By: Zeyuan Allen Zhu, Weizhu Chen, Chenguang Zhu, Gang Wang, Haixun Wang, Zheng Chen.

ICDM 2009: International Conference on Data Mining.
Best student paper award runner up.

In the conventional regularized learning, training time increases as the training set expands. Recent work on L₂ linear SVM challenges this common sense by proposing the inverse time dependency on the training set size. In this paper, we first put forward a Primal Gradient Solver (PGS) to effectively solve the convex regularized learning problem. This solver is based on the stochastic gradient descent method and the Fenchel conjugate adjustment, employing the well-known online strongly convex optimization algorithm with logarithmic regret. We then theoretically prove the inverse dependency property of our PGS, embracing the previous work of the L₂ linear SVM as a special case and enable the ℓ_p-norm optimization to run within a bounded sphere, which qualifies more convex loss functions in PGS. We further illustrate this solver in three examples: SVM, logistic regression and regularized least square. Experimental results substantiate the property of the inverse dependency on training data size.

"P-packSVM: Parallel Primal grAdient desCent Kernel SVM".
By: Zeyuan Allen Zhu, Weizhu Chen, Gang Wang, Chenguang Zhu, Zheng Chen.
ICDM 2009: International Conference on Data Mining.

(Several minor typos are found and fixed after the conference version is published.)

It is an extreme challenge to produce a nonlinear SVM classifier on very large scale data. In this paper we describe a novel P-packSVM algorithm that can solve the Support Vector Machine (SVM) optimization problem with an arbitrary kernel. This algorithm embraces the best known stochastic gradient descent method to optimize the primal objective, and has 1/ε dependency in complexity to obtain a solution of optimization error ε. The algorithm can be highly parallelized with a special packing strategy, and experiences sub-linear speed-up with hundreds of processors. We demonstrate that P-packSVM achieves accuracy sufficiently close to that of SVM-light, and overwhelms the state-of-the-art parallel SVM trainer PSVM in both accuracy and efficiency. As an illustration, our algorithm trains CCAT dataset with 800k samples in 13 minutes and 95% accuracy, while PSVM needs 5 hours but only has 92% accuracy. We at last demonstrate the capability of P-packSVM on 8 million training samples.

"To Divide and Conquer Search Ranking by Learning Query Difficulty".
By: Zeyuan Allen Zhu, Weizhu Chen, Tao Wan, Chenguang Zhu, Gang Wang, Zheng Chen.
CIKM 2009: Conference on Information and Knowledge Management.

Learning to rank plays an important role in information retrieval. In most of the existing solutions for learning to rank, all the queries with their returned search results are learnt and ranked with a single model. In this paper, we demonstrate that it is highly beneficial to divide queries into multiple groups and conquer search ranking based on query difficulty. To this end, we propose a method which first characterizes a query using a variety of features extracted from user search behavior, such as the click entropy, the query reformulation probability. Next, a classification model is built on these extracted features to assign a score to represent how difficult a query is. Based on this score, our method automatically divides queries into groups, and trains a specific ranking model for each group to conquer search ranking. Experimental results on RankSVM and RankNet with a largescale evaluation dataset show that the proposed method can achieve significant improvement in the task of web search ranking.

"A General Magnitude-Preserving Boosting Algorithm for Search Ranking".
By: Chenguang Zhu, Weizhu Chen, Zeyuan Allen Zhu, Gang Wang, Dong Wang, Zheng Chen.
CIKM 2009: Conference on Information and Knowledge Management.

Traditional boosting algorithms for the ranking problems usually employ the pairwise approach and convert the document rating preference into a binary-value label, like RankBoost. However, such a pairwise approach ignores the information about the magnitude of preference in the learning process. In this paper, we present the directed distance function (DDF) as a substitute for binary labels in pairwise approach to preserve the magnitude of preference and propose a new boosting algorithm called MPBoost, which applies GentleBoost optimization and directly incorporates DDF into the exponential loss function. We give the boundedness property of MPBoost through theoretic analysis. Experimental results demonstrate that MPBoost not only leads to better NDCG accuracy as compared to state-of-the-art ranking solutions in both public and commercial datasets, but also has good properties of avoiding the overfitting problem in the task of learning ranking functions.

Earlier Publications

"Three Steps to Make the Traffic Circle Go Round". Outstanding paper for the Mathematical Contest of Modeling 2009. Zeyuan A. Zhu, Tianyi Mao, Yichen Huang. UMAP Journal 30.3 Fall 2009 Edition.
"New Algorithm for Half-plane Intersection and its Practical Value". Zeyuan A. Zhu. 22nd National Olympiad in Informatics Winter Camp (WC 2006).
"The Breakthrough Thinking: Back to the Start - A Modified Union-Find Algorithm". [in Chinese] Zeyuan A. Zhu. 21st National Olympiad in Informatics Winter Camp (WC 2005).
"Multi-pattern Matching Algorithm and Its Inspiration". [in Chinese]. Zeyuan A. Zhu. 20th National Olympiad in Informatics Winter Camp (WC 2004).
"In Introduction to the Heuristic Search and Its Optimization". [in Chinese]. Zeyuan A. Zhu. Journal of National Olympiad in Informatics 2000.

Zeyuan ALLEN-ZHU

Publications

Pages

Research Interests

Conferences

Journals

Acknowledgements

Download Counts

Special Topics and Surveys

Working Papers

Remarks

Working Papers

Published Papers

Earlier Publications