6.506 Algorithm Engineering Spring 2024

Basic Information

Instructor: Julian Shun
TA: Amy Hu
Schedule: Tuesdays and Thursdays 11am-12:30pm ET
Location: 3-370
Office hours: By appointment
Email: jshun AT mit.edu, amyhu AT mit.edu
Piazza
Units: 3-0-9
Prerequisites: 6.122 (6.046), 6.106 (6.172)

Course Description

This is a research-oriented course on algorithm engineering, which will cover both the theory and practice of algorithms and data structures. Students will learn about models of computation, algorithm design and analysis, and performance engineering of algorithm implementations. We will study the design and implementation of sequential, parallel, cache-efficient, external-memory, and write-efficient algorithms for fundamental problems in computing. Many of the principles of algorithm engineering will be illustrated in the context of parallel algorithms and graph problems. Students will read and present research papers, write paper reviews, complete problem sets, participate in classroom discussions, and complete a semester-long research project. Class time will consist of lectures, student presentations, and group project meetings. This course is suitable for graduate students or advanced undergraduates who have taken 6.122 (6.046) and 6.106 (6.172). Mathematical maturity and familiarity with algorithm analysis and performance engineering will be assumed. Lectures will consist of instructor and student presentations. Lecture attendance is required and participation counts toward the grade.

Skip to Course Policy

Announcements

Schedule (tentative)

This is a tentative schedule that will be updated before the course begins. However, it should give you an idea of the topics that we will cover.

DateTopicSpeakerRequired ReadingOptional Reading
Tuesday 2/6Course IntroductionJulian Shun Algorithm Engineering - An Attempt at a Definition

A Theoretician's Guide to the Experimental Analysis of Algorithms

Slides
Algorithm Engineering: Bridging the Gap Between Algorithm Theory and Practice

A Guide to Experimental Algorithmics

Algorithm engineering: an attempt at a definition using sorting as an example

Algorithm Engineering for Parallel Computation

Distributed Algorithm Engineering

Experimental algorithmics

Programming Pearls

Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time
Thursday 2/8Parallel AlgorithmsJulian ShunParallel Algorithms

Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques (Chapters 4-8)

CLRS Chapter 26 (Parallel Algorithms)

Slides
Prefix Sums and Their Applications

Algorithm Design: Parallel and Sequential

Introduction to Parallel Algorithms

Scheduling Multithreaded Computations by Work Stealing

Thread Scheduling for Multiprogrammed Multiprocessors

Multidimensional Included and Excluded Sums

Work-Efficient Parallel Algorithms for Accurate Floating-Point Prefix Sums

Problem Based Benchmark Suite
Tuesday 2/13Parallel Graph TraversalDaniel Zeng

Joseph Zhang
Direction-Optimizing Breadth-First Search Slides

The More the Merrier: Efficient Multi-Source Graph Traversal Slides

(Presentations postponed to 2/15 due to MIT closing.)
A Work-Efficient Parallel Breadth-First Search Algorithm (or How to Cope with the Nondeterminism of Reducers)

Internally Deterministic Parallel Algorithms Can Be Fast

SlimSell: A Vectorizable Graph Representation for Breadth-First Search
Thursday 2/15Parallel Graph TraversalPresentations from 2/13
Tuesday 2/20No class
Thursday 2/22External-Memory AlgorithmsAlex Bardon

Nguyen Le
The input/output complexity of sorting and related problems Slides

A Functional Approach to External Graph Algorithms Slides
Algorithms and Data Structures for External Memory

I/O-Complexity of Graph Algorithms

External-Memory Graph Algorithms

Beyond Synchronous: New Techniques for External-Memory Graph Connectivity and Minimum Spanning Forest

Fundamental parallel algorithms for private-cache chip multiprocessors

Parallel External Memory Graph Algorithms

Efficient Parallel and External Matching

External-Memory Computational Geometry

External Memory Geometric Data Structures

A Bulk-Parallel Priority Queue in External Memory with STXXL
Tuesday 2/27Cache-Oblivious Algorithms Alex Bardon

Julian Shun
Cache-Oblivious Algorithms

Engineering a cache-oblivious sorting algorithm Slides

A Simple and Practical Linear-Work Parallel Algorithm for Connectivity Slides

Cache-Oblivious Algorithms and Data Structures

Cache Oblivious Distribution Sweeping

Cache-oblivious databases: Limitations and opportunities

An Optimal Cache-Oblivious Priority Queue and Its Application to Graph Algorithms

Cache-Oblivious B-Trees

Cache-oblivious streaming B-trees

An experimental comparison of cache-oblivious and cache-conscious programs
Thursday 2/29Sorting AlgorithmsKasra Mazaheri

Alex Wang
Engineering In-place (Shared-memory) Sorting Algorithms Slides

High-Performance and Flexible Parallel Algorithms for Semisort and Related Problems Slides

Theoretically-Efficient and Practical Parallel In-Place Radix Sorting

Super Scalar Sample Sort

Parallel Merge Sort

Parallelism in Randomized Incremental Algorithms

Optimal and sublogarithmic time randomized parallel sorting algorithms

A comparison of sorting algorithms for the connection machine CM-2

Efficient implementation of sorting on multi-core SIMD CPU architecture

Sorting Data on Ultra-Large Scale with RADULS

Even Faster Sorting of (Not Only) Integers

Highly scalable parallel sorting

A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs

Designing efficient sorting algorithms for manycore GPUs

TritonSort: A Balanced Large-Scale Sorting System

A comprehensive study of main-memory partitioning and its application to large-scale comparison- and radix-sort

Practical Massively Parallel Sorting

Engineering a Multi-core Radix Sort

PARADIS: An Efficient Parallel Algorithm for In-place Radix Sort

Parallel In-Place Algorithms: Theory and Practice

Sort Benchmark
Monday 3/4Problem Set due today
Tuesday 3/5Pre-proposal Meeting
Thursday 3/7Distributed Graph FrameworksJulian Shun

Miranda Cai

Daniel Zeng
Scalability! But at what COST? Slides

Pregel: A System for Large-Scale Graph Processing* Slides

PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs* Slides

(Pick one of the starred papers to write a paper review for.)
Distributed Memory Graph Frameworks
Tuesday 3/12Single-Machine Graph FrameworksJonathan Li

Daniel Schaffer
GraphChi: Large-Scale Graph Computation on Just a PC Slides

CoroGraph: Bridging Cache Efficiency and Work Efficiency for Graph Algorithm Execution Slides
External Memory Graph Frameworks

Shared Memory Graph Frameworks
Thursday 3/14Scheduling and Triangle CountingJason Liu

Julian Shun
Thread Scheduling for Multiprogrammed Multiprocessors Slides

Multicore Triangle Computations Without Tuning Slides
The Implementation of the Cilk-5 Multithreaded Language

Scheduling Multithreaded Computations by Work Stealing

The Data Locality of Work Stealing

OpenCilk: A Modular and Extensible Software Infrastructure for Fast Task-Parallel Code
Friday 3/15Project Proposal due today
Tuesday 3/19Locality in Graph ProcessingMiranda Cai

Julian Shun
Making Caches Work for Graph Analytics Slides

Chapters 7-8 of Shared-Memory Parallelism Can Be Simple, Fast, and Scalable Slides
Reducing Pagerank Communication via Propagation Blocking

GraphIt: A High-Performance DSL for Graph Analytics Locality Optimizations

Shared Memory Graph Frameworks
Thursday 3/21Subgraph FindingRaymond Pan

Xander Morgan
GraphMineSuite: Enabling High-Performance and Programmable Graph Mining Algorithms with Set Algebra Slides

ProbGraph: High-Performance and High-Accuracy Graph Mining with Probabilistic Set Representations Slides
Subgraph Finding
Tuesday 3/26Spring Break
Thursday 3/28Spring Break
Tuesday 4/2Graph Compression and ReorderingRaymond Pan

Anika Cheerla
An Experimental Analysis of a Compact Graph Representation Slides

CompressGraph: Efficient Parallel Graph Analytics with Rule-Based Compression Slides
Compact Representations of Separable Graphs

Chapter 8 of Shared-Memory Parallelism Can Be Simple, Fast, and Scalable

Graph Compression

Graph Partitioning and Reordering
Thursday 4/4CompressionXander Morgan

Joseph Zhang
Decoding billions of integers per second through vectorization Slides

Techniques for Inverted Index Compression Slides
An Experimental Study of Bitmap Compression vs. Inverted List Compression

Stream VByte: Faster byte-oriented integer compression

Roaring bitmaps: Implementation of an optimized software library

SIMD compression and the intersection of sorted integers

A General SIMD-Based Approach to Accelerating Compression Algorithms

Lightweight Data Compression Algorithms: An Experimental Survey

Compact inverted index storage using general-purpose compression libraries

An Extended Compression Format for the Optimization of Sparse Matrix-Vector Multiplication

SparseX: A Library for High-Performance Sparse Matrix-Vector Multiplication on Multicore Platforms
Tuesday 4/9Dynamic Graph AlgorithmsIan Gatlin

Kasra Mazaheri
A New Parallel Algorithm for Connected Components in Dynamic Graphs Slides

Work-Efficient Parallel Union-Find Slides
Dynamic Graph Algorithms

Streaming Graph Frameworks
Thursday 4/11Stencil ComputationsVishaal Ram

Alex Wang
Cache oblivious stencil computations Slides

Fast Stencil Computations using Fast Fourier Transforms Slides
The cache complexity of multithreaded cache oblivious algorithms

The pochoir stencil compiler
Tuesday 4/16Aggregation and Sorting

Mid-term report due today
Ian Gatlin

Jonathan Li
Efficient Sorting, Duplicate Removal, Grouping, and Aggregation Slides

Parallel Integer Sort: Theory and Practice Slides
A Top-Down Parallel Semisort

Multi-Core, Main-Memory Joins: Sort vs. Hash Revisited

Main-Memory Hash Joins on Modern Processor Architectures

Sort vs. Hash Join Revisited for Near-Memory Execution

Many-query join: efficient shared execution of relational joins on modern hardware

An Experimental Comparison of Thirteen Relational Equi-Joins in Main Memory

Optimal and sublogarithmic time randomized parallel sorting algorithms
Thursday 4/18Static and Streaming Graph ProcessingJulian Shun GraphIt: A High-Performance DSL for Graph Analytics

Low-Latency Graph Streaming Using Compressed Purely-Functional Trees

Slides
Streaming Graph Frameworks

PaC-trees: supporting parallel and compressed purely-functional collections
Tuesday 4/23Concurrent AlgorithmsYuanhao Wei Art of Multiprocessor Programming Chapter 3.1-3.6, Chapter 9

Slides
Linearizability: A Correctness Condition for Concurrent Objects

A Lazy Concurrent List-Based Set Algorithm

A Pragmatic Implementation of Non-Blocking Linked-Lists

Non-blocking binary search trees
Thursday 4/25Sparse Computations on GPUsChangwan Hong Globally Homogeneous, Locally Adaptive Sparse Matrix-vector Multiplication on the GPU

spECK: Accelerating GPU Sparse Matrix-matrix Multiplication through Lightweight Analysis

Slides
Fast Sparse Matrix-Vector Multiplication on GPUs for Graph Applications

CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication

yaSpMV: yet another SpMV framework on GPUs

Merge-Based Parallel Sparse Matrix-Vector Multiplication

Balanced Hashing and Efficient GPU Sparse General Matrix-Matrix Multiplication

Adaptive Sparse Matrix-Matrix Multiplication on the GPU
Tuesday 4/30ClusteringJulian Shun Theoretically-Efficient and Practical Parallel DBSCAN

Parallel Index-Based Structural Graph Clustering and Its Approximation

Slides
Fast Parallel Algorithms for Euclidean Minimum Spanning Tree and Hierarchical Spatial Clustering

ParChain: A Framework for Parallel Hierarchical Agglomerative Clustering using Nearest-Neighbor Chain
Thursday 5/2Nearest Neighbors Parallel Nearest Neighbors in Low Dimensions with Batch Updates

DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node
Scaling Graph-Based ANNS Algorithms to Billion-Size Datasets: A Comparative Analysis

Worst-case Performance of Popular Approximate Nearest Neighbor Search Implementations: Guarantees and Limitations
Tuesday 5/7Graph Neural Networks Scalable and Efficient Full-Graph GNN Training for Large Graphs

MariusGNN: Resource-Efficient Out-of-Core Training of Graph Neural Networks
Betty: Enabling Large-Scale GNN Training with Batch-Level Graph Partitioning

LightNE: A Lightweight Graph Processing System for Network Embedding

Accelerating Training and Inference of Graph Neural Networks with Fast Sampling and Pipelining

Marius: Learning Massive Graph Embeddings on a Single Machine

ThunderRW: An In-Memory Graph Random Walk Engine

Random Walks on Huge Graphs at Cache Efficiency

Graph Representation Learning Book

CS224W: Machine Learning with Graphs (Stanford course)
Thursday 5/9Batch-Dynamic AlgorithmsParallel Batch-Dynamic Algorithms for k-Core Decomposition and Related Graph Problems Julienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing
Tuesday 5/14Project Presentations

Final report due today

Course Policy

This is a graduate-level course where we will cover research in algorithm engineering. Advanced undergraduates may enroll if they have taken 6.122 (6.046) and 6.106 (6.172). The course units are 3-0-9.

Assignments will be posted and submitted on Canvas.

Lectures

For each lecture we will study several research papers; for each paper one student will give a presentation, which will be followed by a discussion.

Grading

The assignments consist of answering questions about papers, writing paper reviews, completing a problem set, doing several paper presentations, class participation, and completing an open-ended research project. The grading breakdown is as follows.

Grading Breakdown
Paper Reviews20%
Problem Set10%
Paper Presentations15%
Research Project45%
Class Participation10%

You must complete all assignments to pass the class.

Paper Reviews and Problem Set

Prior to each lecture, you are expected to read the papers under "Required Reading" in the schedule for that lecture. In addition, students are required to submit one paper review before every lecture, due at 10am on the day of the lecture. The review should be on a paper chosen from any of the papers under "Required Reading" for the lecture.

The review should first describe the problem the paper is trying to solve, why it is important, the main ideas proposed, and the results obtained. The review should then describe how the ideas are novel compared to existing work at the time, the strengths and weaknesses of the paper, and any ideas you may have for improving the techniques and/or evaluation criteria, and any open problems or directions for further work that you can think of. The length should be 3-4 paragraphs. You should write the review in your own words. Please do not copy text from the paper, your classmates, or ChatGPT.

Finally, there will be a problem set on parallel algorithms to be released in the second week of the semester and due at 11:59pm on Monday 3/4.

Paper Presentations

Students will give presentations on assigned papers during the semester. These presentations should be 25-30 minutes long with slides, followed by a discussion. The presentation should discuss the motivation for the problem being solved, any definitions needed to understand the paper, key technical ideas in the paper, theoretical results and proofs, experimental results, and existing work. Theoretical proofs may be presented on the board. The presentation should also include any relevant content from related work that would be needed to fully understand the paper being presented. The presenter should also present his or her own thoughts on the paper (perceived strengths and weaknesses, directions for future work, etc.), and pose several questions for discussion. The presenter will be expected to lead the discussion on the paper.

Research Project

A large part of the work in this course is in proposing and completing an open-ended research project. The project may involve (but is not limited to) any of the following tasks: implementation of non-trivial and theoretically-efficient algorithms; analyzing and optimizing the performance of existing algorithm implementations; designing and implementing new algorithms that are theoretically and/or practically efficient; applying algorithms in the context of larger applications; and improving or designing new programming frameworks/systems for classes of algorithms. The project may explore parallelism, cache-efficiency, I/O-efficiency, and memory-efficiency. There should be an implementation component to the project. The project can be related to research that you are currently doing, subject to instructor approval.

The project will be done in groups of 1-3 people and consist of a proposal, mid-term report, final presentation, and final report. The timeline for the project is as follows.

AssignmentDue Date
Pre-proposal Meeting3/5
Proposal3/15
Weekly Progress Reports3/22, 4/5, 4/12, 4/19, 4/26, 5/3, 5/10
Mid-term Report4/16
Project Presentations5/14
Final Report5/14

Piazza

We will be using Piazza for answering questions. You may publicly post questions about the papers or any cool ideas that you have. Discussions among students is encouraged. If possible, please use Piazza instead of emailing the course staff. You may use a private note if you want your post to be visible only to the course staff.

Collaboration Policy

You are welcome to read the papers with other students, but all of your written work should be your own. You should carefully acknowledge any collaborators and any outside sources you use to complete the paper reviews, problem set, and the project reports.

Computing Resources

You can obtain student credits on a cloud computing provider, such as Google Cloud Platform.

Useful Resources

Introduction to Algorithms, 4th Edition by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein (CLRS)

Algorithm Engineering: Bridging the Gap Between Algorithm Theory and Practice

Networks, Crowds, and Markets by David Easley and Jon Kleinberg

A list of papers related to graph analytics