Matei Zaharia
Assistant Professor
Douglas T. Ross Career Development Professor of Software Technology
CSAIL,
EECS,
MIT
matei@mit.edu
I’m an assistant professor at MIT CSAIL, where I work on computer systems and big data. I’m also co-founder and CTO of Databricks, the big data company commercializing Apache Spark.
NOTE: In Fall 2016, I’ll be moving to Stanford University.
Projects
I work on computer systems, with a focus in large-scale data processing. My recent projects include:
-
Spark, a unified engine for distributed data processing that generalizes the MapReduce model with efficient data sharing (paper). Now developed at the Apache Software Foundation as Apache Spark. I’ve continued to work on modules over Spark including Spark SQL, Shark, MLlib and Spark Streaming.
-
Mesos, a resource manager for datacenters that supports heterogeneous applications by giving them control of their scheduling (paper). Now developed as Apache Mesos.
-
Multi-Resource Scheduling Algorithms to divide heterogeneous resources among users of a computer system, including Dominant Resource Fairness (DRF), FairRide, DRFQ, and Choosy.
-
Vuvuzela, the first private messaging system that hides metadata about which pairs of users are communicating while scaling linearly with the number of users (paper).
Teaching
- Spring 2016: 6.033 Computer Systems Engineering (recitations).
- Fall 2015: 6.S897 Large-Scale Systems.
Publications
2016
- F. Abuzaid, J. Bradley, F. Liang, A. Feng, L. Yang, M. Zaharia and A. Talwalkar. Yggdrasil: An Optimized System for Training Deep Decision Trees at Scale, to appear at NIPS 2016.
- R.B. Zadeh, X. Meng, A. Staple, B. Yavuz, L. Pu, S. Venkataraman, E. Sparks, A. Ulanov and M. Zaharia. Matrix Computations and Optimizations in Apache Spark, to appear at KDD 2016. Best Paper Award Runner-Up.
- A. Dave, A. Jindal, L.E. Li, R. Xin, J. Gonzalez and M. Zaharia. GraphFrames: An Integrated API for Mixing Graph and Relational Queries, GRADES 2016.
- M. Vartak, H. Subramanyam, W.E. Lee, S. Viswanathan, S. Husnoo, S. Madden and M. Zaharia. ModelDB: A System for Machine Learning Model Management, HILDA 2016.
- S. Venkataraman, Z. Yang, D. Liu, E. Liang, X. Meng, R. Xin, A. Ghodsi, M. Franklin, I. Stoica and M. Zaharia. SparkR: Scaling R Programs with Spark, SIGMOD 2016.
- X. Meng, J. Bradley, B. Yuvaz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar. MLlib: Machine Learning in Apache Spark, JMLR, 17(34):1–7, 2016.
- Q. Pu, H. Li, M. Zaharia, A. Ghodsi, and I. Stoica. FairRide: Near-Optimal, Fair Cache Sharing, NSDI 2016.
2015
- J. van den Hooff, D. Lazar, M. Zaharia and N. Zeldovich. Vuvuzela: Scalable Private Messaging Resistant to Traffic Analysis, SOSP 2015, October 2015.
- M. Armbrust, T. Das, A. Davidson, A. Ghodsi, A. Or, J. Rosen, I. Stoica, P. Wendell, R. Xin and M. Zaharia. Scaling Spark in the Real World: Performance and Usability, VLDB 2015, August 2015.
- M. Armbrust, R. Xin, C. Lian, Y. Huai, D. Liu, J. Bradley, X. Meng, T. Kaftan, M. Franklin, A. Ghodsi and M. Zaharia. Spark SQL: Relational Data Processing in Spark. SIGMOD 2015, June 2015.
2014
- H. Li, A. Ghodsi, M. Zaharia, S. Shenker and I. Stoica, Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks, SOCC 2014, November 2014.
- S.N. Naccache, S. Federman, N. Veeeraraghavan, M. Zaharia, D. Lee, E. Samayoa, J. Bouquet, A.L. Greninger, K. Luk, B. Enge, D.A. Wadford, S.L. Messenger, G.L. Genrich, K. Pellegrino, G. Grard, E. Leroy, B.S. Schneider, J.N. Fair, M.A. Martinez, P. Isa, J.A. Crump, J.L. DeRisi, T. Sittler, J. Hackett Jr., S. Miller and C.Y. Chiu, A Cloud-Compatible Bioinformatics Pipeline for Ultrarapid Pathogen Identification from Next-Generation Sequencing of Clinical Samples, Genome Research, June 2014.
2013
- M. Zaharia. An Architecture for Fast and General Data Processing on Large Clusters. PhD Disseration, 2014 ACM Doctoral Dissertation Award.
- M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized Streams: Fault-Tolerant Streaming Computation at Scale, SOSP 2013, November 2013.
- K. Ousterhout, P. Wendell, M. Zaharia and I. Stoica. Sparrow: Distributed, Low-Latency Scheduling, SOSP 2013, November 2013.
- R. Xin, J. Rosen, M. Zaharia, M. Franklin, S. Shenker, and I. Stoica. Shark: SQL and Rich Analytics at Scale, SIGMOD 2013, June 2013.
- A. Ghodsi, M. Zaharia, S. Shenker and I. Stoica. Choosy: Max-Min Fair Sharing for Datacenter Jobs with Constraints, EuroSys 2013, April 2013.
2012
- A. Ghodsi, V. Sekar, M. Zaharia and I. Stoica. Multi-Resource Fair Queueing for Packet Processing, SIGCOMM 2012, August 2012. Best Paper Award.
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, I. Stoica. Fast and Interactive Analytics over Hadoop Data with Spark, USENIX ;login:, August 2012.
- M. Zaharia, T. Das, H. Li, S. Shenker and I. Stoica. Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters, HotCloud 2012, June 2012.
- L. Martignoni, P. Poosankam, M. Zaharia, J. Han, S. McCamant, D. Song, V. Paxson, A. Perrig, S. Shenker, I. Stoica. Cloud Terminal: Secure Access to Sensitive Applications from Untrusted Systems, USENIX ATC 2012, June 2012.
- C. Engle, A. Lupher, R. Xin, M. Zaharia, M. Franklin, S. Shenker, I. Stoica. Shark: Fast Data Analysis Using Coarse-grained Distributed Memory (demo), SIGMOD 2012, May 2012. Best Demo Award.
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, I. Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI 2012, April 2012. Best Paper Award and Honorable Mention for Community Award.
2011
- T. Hunter, T. Moldovan, M. Zaharia, S. Merzgui, J. Ma, M.J. Franklin, P. Abbeel, and A.M. Bayen. Scaling the Mobile Millennium System in the Cloud, SOCC 2011, October 2011.
- M. Chowdhury, M. Zaharia, J. Ma, M.I. Jordan and I. Stoica, Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM 2011, August 2011.
- B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A.D. Joseph, R. Katz, S. Shenker and I. Stoica, Mesos: Flexible Resource Sharing for the Cloud, USENIX ;login:, August 2011.
- M. Zaharia, B. Hindman, A. Konwinski, A. Ghodsi, A.D. Joseph, R. Katz, S. Shenker and I. Stoica, The Datacenter Needs an Operating System, HotCloud 2011, June 2011.
- B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A.D. Joseph, R. Katz, S. Shenker and I. Stoica, Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, NSDI 2011, March 2011.
- A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica, Dominant Resource Fairness: Fair Allocation of Multiple Resources Types, NSDI 2011, March 2011.
2010
- M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker and I. Stoica. Spark: Cluster Computing with Working Sets, HotCloud 2010, June 2010.
- M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker and I. Stoica. Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling, EuroSys 2010, April 2010.
- M. Armbrust, A. Fox, R. Griffith, A.D. Joseph, R.H. Katz, A. Konwinski, G. Lee, D.A. Patterson, A. Rabkin, I. Stoica and M. Zaharia, Above the Clouds: A View of Cloud Computing, Communications of the ACM, April 2010.
- S. Guo, M. Derakhshani, M.H. Falaki, U. Ismail, R. Luk, E.A. Oliver, S. Ur Rahman, A. Seth, M.A. Zaharia, S. Keshav, Design and Implementation of the KioskNet System, Computer Networks, ISSN 1389-1286, DOI: 10.1016/j.comnet.2010.08.001
Earlier
- B. Hindman, A. Konwinski, M. Zaharia and I. Stoica, A Common Substrate for Cluster Computing, HotCloud 2009, June 2009.
- R. Luk, M. Zaharia, M. Ho, B. Levine and P. Aoki, ICTD for Healthcare in Ghana: Two Parallel Case Studies, ICTD 2009, April 2009.
- M. Zaharia, A. Konwinski, A.D. Joseph, R. Katz and I. Stoica, Improving MapReduce Performance in Heterogeneous Environments, OSDI 2008, December 2008.
- S. Guo, M.H. Falaki, E.A. Oliver, S. Ur Rahman, A. Seth, M. Zaharia, U. Ismail, and S. Keshav, Design and Implementation of the KioskNet System, ICTD 2007, December 2007.
- S. Guo, M.H. Falaki, E.A. Oliver, S. Ur Rahman, A. Seth, M. Zaharia, and S. Keshav, Very Low-Cost Internet Access Using KioskNet, ACM Computer Communication Review, October 2007.
- M. Zaharia and S. Keshav, Gossip-based Search Selection in Hybrid Peer-to-Peer Networks, J. Concurrency and Computation: Practice and Experience, 2007.
- M. Zaharia, A. Chandel, S. Saroiu, and S. Keshav, Finding Content in File-Sharing Networks When You Can’t Even Spell, Proc. IPTPS, February 2007.
- A. Seth, D. Kroeker, M. Zaharia, S. Guo, S. Keshav, Low-cost Communication for Rural Internet Kiosks Using Mechanical Backhaul, Proc. MOBICOM 2006, September 2006.
- M. Zaharia and S. Keshav, Gossip-Based Search Selection in Hybrid Peer-to-Peer Networks, Proc. IPTPS, February 2006.
Full Publication List and Technical Reports
Open Source
Almost all of my work is open source:
- The Spark engine is now an Apache project at spark.apache.org. We have also open sourced subsequent projects including Shark, Spark SQL, MLlib, GraphFrames and Spark Streaming.
- The Mesos cluster manager is a top-level Apache project.
- The LATE algorithm for straggler mitigation and the Hadoop Fair Scheduler are included in Apache Hadoop.
- The SNAP sequence aligner is available on GitHub.
I’m also a committer on the Apache Hadoop, Spark and Mesos projects.
Adapted from a template by Andreas Viklund.