Parallel Computer Architecture and Programming (CMU 15-418/618)
This page contains lecture slides, videos, and recommended readings for the Spring 2016 offering of 15-418/618.The full listing of lecture videos is available on the Panopto sitehere.
Lecture 1: Why Parallelism
Further Reading:
- The Future of Microprocessors. by K. Olukotun and L. Hammond, ACM Queue 2005
- Power: A First-Class Architectural Design Constraint. by Trevor Mudge IEEE Computer 2001
Lecture 2: A Modern Multi-Core Processor
(forms of parallelism + understanding Latency and BW)
Further Reading:
- CPU DB: Recording Microprocessor History. A. Danowitz, K. Kelley, J. Mao, J.P. Stevenson, M. Horowitz, ACM Queue 2005. (You can also take a peak at the CPU DB website)
- The Compute Architecture of Intel Processor Graphics. Intel Technical Report, 2015 (a very nice description of a modern throughput processor)
- Intel's Haswell CPU Microarchitecture. D. Kanter, 2013 (realworldtech.com article)
- NVIDIA GeForce GTX 980 Whitepaper. NVIDIA Technical Report 2014
Lecture 3: Parallel Programming Models
(ways of thinking about parallel programs, and their corresponding hardware implementations)
Lecture 4: Parallel Programming Basics
(the thought process of parallelizing a program)
Lecture 5: GPU Architecture and CUDA Programming
(CUDA programming abstractions, and how they are implemented on modern GPUs)
Further Reading:
- You may enjoy the free Udacity Course: Intro to Parallel Programming Using CUDA, by Luebke and Owens
- The Thrust Library is a useful collection library for CUDA.
- Rise of the Graphics Processor. D. Blythe (Proceedings of IEEE 2008) a nice overview of GPU history.
- NVIDIA GeForce GTX 980 Whitepaper. NVIDIA Technical Report 2014
- The Compute Architecture of Intel Processor Graphics. Intel Technical Report, 2015 (a very nice description of a modern Intel integrated GPU)
Lecture 6: Performance Optimization I: Work Distribution and Scheduling
(good work balance while minimizing the overhead of making the assignment, scheduling Cilk programs with work stealing)
Further Reading:
- CilkPlus documentation
- Scheduling Multithreaded Computations by Work Stealing. by Blumofe and Leiserson, JACM 1999
- Implementation of the Cilk 5 Multi-Threaded Language. by Frigo et al. PLDI 1998
Lecture 7: Performance Optimization II: Locality, Communication, and Contention
(message passing, async vs. blocking sends/receives, pipelining, techniques to increase arithmetic intensity, avoiding contention)
Lecture 8: Parallel Programming Case Studies
(examples of optimizing parallel programs)
Lecture 9: Workload-Driven Performance Evaluation
(hard vs. soft scaling, memory-constrained scaling, scaling problem size, tips for analyzing code performance)
Lecture 10: Snooping-Based Cache Coherence
(definition of memory coherence, invalidation-based coherence using MSI and MESI, maintaining coherence with multi-level caches, false sharing)
Lecture 11: Directory-Based Cache Coherence
(scaling problem of snooping, implementation of directories, directory storage optimization)
Lecture 12: A Basic Snooping-Based Multi-Processor Implementation
(deadlock, livelock, starvation, implementation of coherence on an atomic and split-transaction bus)
Lecture 13: Memory Consistency
(consistency vs. coherence, relaxed consistency models and their motivation, acquire/release semantics)
Further Reading:
Lecture 14: Scaling a Web Site
(scale out, load balancing, elasticity, caching)
Further Reading:
- www.highscalability.com. A cool site with many case studies (see "All Time Favorites" section)
- James Hamilton's Blog
Lecture 15: Interconnection Networks
(network properties, topology, basics of flow control)
Lecture 16: Implementing Synchronization
(machine-level atomic operations, implementing locks, implementing, barriers)
Lecture 17: Fine-Grained Synchronization and Lock-Free Programming
(fine-grained snychronization via locks, basics of lock-free programming: single-reader/writer queues, lock-free stacks, the ABA problem, hazard pointers)
Further Reading:
- A Pragmatic Implementation of Non-Blocking Linked-Lists. by T. Harris, 2001
- Lock-Free Linked Lists and Skip Lists. by M. Fomitchev and E. Ruppert, 2004
- Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects. by M. Michael, IEEE Trans on Parallel and Distributed Systems, 2004
- Lock-Free Data Structures with Hazard pointers. by A. Alexandrescu and M. Michael, Dr. Dobbs, 2004
Lecture 18: Transactional Memory
(motivation for transactions, design space of transactional memory implementations, lazy-optimistic HTM)
Lecture 19: Heterogeneous Parallelism and Hardware Specialization
(energy-efficient computing, motivation for heterogeneous processing, fixed-function processing, FPGAs, what's in a modern SoC)
Lecture 20: Domain-Specific Programming Systems
(motivation for DSLs, case studies on Lizst and Halide)
Further Reading:
Lecture 21: Domain-Specific Programming on Graphs
(GraphLab abstractions, GraphLab implementation, streaming graph processing, graph compression)
Further Reading:
- GraphLab Documentation
- Ligra: A Lightweight Graph Processing Framework for Shared Memory. by Shun and Blelloch, PPOPP 13
- GraphChi: Large-Scale Graph Computation on Just a PC. by Kyrola et al. OSDI 12
Lecture 22: In-Memory Distributed Computing using Spark
(producer-consumer locality, RDD abstraction, Spark implementation and scheduling)
Further Reading:
- Apache Spark Web Site
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. by Zaharia et al. NSDI 2012
Lecture 23: Addressing the Memory Wall
(how DRAM works, cache compression, DRAM compression, upgoing memory technologies)
Further Reading:
- Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches. by Pekhimenko et al. PACT 2012
Lecture 24: The Future of High-Performance Computing
(supercomputing vs. distributed computing/analytics, design philosophy of both systems)
Watch the Lecture
Lecture 25: Efficiently Evaluating Deep Networks
(intro to deep networks, what convolution does, mapping convolutin to matrix multiplication, deep network compression)
Lecture 26: Parallel Deep Network Training
(basics of gradient descent and backpropagation, memory footpring issues, asynchronous parallel implementations of gradient descent)
Lecture 27: Parallelizing the 3D Graphics Pipeline
(parallel rasterization, Z/color-buffer compression, tiled rendering, sort-everywhere parallel rendering)
Lecture 28: Course Wrap Up + How to Give a Talk
(tips for giving a clear talk, a bit of philosophy)
Student Final Projects
(the students explore high-performance and high-efficiency topics of their choosing)