GPU Computing

Location, Time
Instructor: Xuhao Chen
Office Hours: Time, Location

Course Description

This course is an introduction to parallel computing using graphics processing units (GPUs). We will be focussing on CUDA programming, but the concepts taught will apply to other GPU frameworks as well. The course will start by covering CUDA syntax extensions and the CUDA runtime API, then move on to more advanced topics such as bandwidth optimization, memory access performance, and floating point considerations. We will learn about common parallel computing patterns such as scans and reductions, and study use cases for GPU acceleration such as matrix multiplication and convolution.


As CUDA is an extension of the C language, students taking this course should be familiar with C programming.

Prior knowledge of computer architecture concepts such as data locality will be useful but not required.


Grades for this course will be based on a series of 3-5 programming assignments designed to allow students to apply GPU programming skills taught in the lectures.

Textbook (Optional)

Programming Massively Parallel Processors, Third Edition: A Hands-on Approach
David B. Kirk and Wen-mei W. Hwu.

The Second Edition is online available here

Computing Resources

For the programming assignments, students will need access to a computer with a CUDA-compatible GPU. I can help arrange access to a remote CUDA-capable machine for students without local access.

The NVIDIA Deep Learning Institute (DLI) Teaching Kit Program

WebGPU.com A System for Online GPU Development

Fundamentals of Accelerated Computing with CUDA Python

Teach GPU Accelerating Computing: Hands-on with NVIDIA Teaching Kit for Educators

Schedule and Slides (subject to change)

3 / 27 Course Introduction Paper reading
3 / 29 Intro to CUDA C axpy
4 / 03 CUDA parallelism model Paper reading
4 / 05 Memory and data locality
Thread execution efficiency
Paper reading
4 / 10 Memory performance
Stencil pattern
Paper reading
4 / 12 Prefix sum pattern Paper reading
4 / 17 Histogram pattern TiledMatrixMultiplication due
4 / 19 Sparse matrix pattern Paper reading
4 / 24 Reduction pattern Assignment 2
4 / 26 Graph traversal pattern Paper reading
5 / 01 Advanced host / device interface
Streams, events, and concurrency
Paper reading
Paper reading
5 / 03 Dynamic parallelism / recursion Paper reading
5 / 08 Floating point considerations
Intrinsic Functions
Assignment 2 due
Final project
5 / 10 In-warp shuffles Paper reading
5 / 15 Multi-GPU programming Final project proposal due
5 / 17 Using CUDA Libraries cuDNN
5 / 22 OpenCL / OpenACC Paper reading
5 / 24 Deep Learning and Tensor Core Paper reading
5 / 29 Graph Processing with GPU Paper reading
5 / 31 Data Science and Bioinfomatics with GPU
Edge AI and Robotics with GPU
Paper reading
Paper reading
5 / 31 Ray Tracing Final project due