Project Title: Communication-Scalable Machine Learning
(co-supervised with Prof. Ce Zhang)
Stochastic Gradient Descent (SGD) is one of the standard tools in machine learning and data science, and represents the backbone of many large-scale learning systems, such as training Deep Neural Networks, or Stanford’s DeepDive. When running SGD at scale, one of the main concerns is to reduce the amount of communication between the nodes. Recently, several communication-efficient variants of SGD have been proposed, including lower-precision quantization (e.g., Google TensorFlow), one-bit quantization (e.g., Microsoft CNTK), or randomized quantization. The goal of this project is to investigate the trade-offs between reduced communication and precision of SGD training of state-of-the-art deep neural networks.
With the recent effort in building beefy machines such as NVIDIA’s DGX-1 (170 TFLOPs in a single box), our first hypothesis is that, on NVIDIA DGX-1, even with fast network such as NVLINK, we need network communications with limited precision to achieve reasonable scalability, for both synchronous and asynchronous scheduling. To validate this hypothesis, we need to first implement existing approaches on quantization, optimize their performance, and benchmark on standard tasks and datasets.
If this first hypothesis is true, one step moving forward is to build a low-precision version for each of the popular distributed Deep Learning systems. This goal is out of the scope of this master thesis proposal, but it represents one direction that the result of this master thesis is going to enable down the road. The hypothesis there is that, instead of revising each tool one by one, it is possible to build a new network protocol that has limited-precision primitives that all these distributed deep learning engines can rely on–In the ideal case, to implement a low-precision distributed deep learning system, we would simply change its network communication code to this new protocol, and need to do nothing else.
Objectives: The goal of this project is to investigate the trade-offs between reduced communication and precision of SGD training of state-of-the-art deep neural networks. Your mission, should you choose to accept it, will be the following:
· Become familiar with parallel SGD and its communication-efficient variants.
· Implement SGD to train DNNs on a multi-GPU machine on a state-of-the-art framework (e.g. TensorFlow or CNTK).
· Understand the performance/accuracy trade-offs, and potentially design new quantization algorithms.