This course will equip students with the necessary knowledge and skills to develop fast algorithms and their massively parallel implementation on modern supercomputers using parallel programming techniques such as SIMD, OpenMP, MPI, and CUDA. The course will cover how to use various linear algebra libraries for parallel execution on both CPUs and GPUs. Tutorials on how to use debuggers and profilers in a massively parallel environment will also be given. Demonstration of performance primitives such as MapReduce and graph partitioning tools will be given, along with tips on how to execute deep learning frameworks on large GPU supercomputers.
By the end of this course, students will be able to
1. Use SIMD vectorization, shared memory parallelization via OpenMP, and distributed memory parallelization via MPI
2. Program GPUs using OpenACC, CUDA, and OpenCL
3. Understand how high performance numerical libraries function, and will be able to use them appropriately
4. Debug and profile code in a parallel environment by using parallel debuggers and profilers
5. Use performance primitives such as ModernGPU and MapReduce to achieve high performance with minimal effort
6. Use graph partitioning tools and deep learning frameworks on massively parallel computers
Vectorization, Shared memory parallelism, Distributed memory parallelism, GPU programming, Numerical libraries, Matrix Multiplication, Linear solvers, FFT, Parallel debugger, Parallel profilers, Graph partitioning, Deep Learning
✔ Specialist skills | Intercultural skills | Communication skills | ✔ Critical thinking skills | ✔ Practical and/or problem-solving skills |
Sample codes will be prepared for each lecture, and exercises will be performed on TSUBAME.
Course schedule | Required learning | |
---|---|---|
Class 1 | How to use TSUBAME | Login to Tokyo Tech's supercomputer TSUBAME and learn how to use libraries and the job scheduler |
Class 2 | Shared memory parallelization | Use pthreads and OpenMP to achieve shared memory parallelization |
Class 3 | Distributed memory parallelization | Use MPI to achieve distributed memory parallelization |
Class 4 | SIMD parallelization | Use SSE, AVX, and AVX512 to achieve SIMD vectorization |
Class 5 | GPU programming | Use OpenACC, CUDA, and OpenCL to program GPUs |
Class 6 | Multi-GPU programming | Combine CUDA and MPI to use multiple GPUs on TSUBAME |
Class 7 | Cache blocking | Use BLISLAB and CUBLAS as an example to practice cache blocking |
Class 8 | Numerical libraries | Understand how LAPACK, SCALAPACK, and FFTW work, and learn to use them appropriately |
Class 9 | Fast linear solvers | Understand how to choose the appropriate solvers in PETSc and Trilinos |
Class 10 | I/O libraries | Use NetCDF, HDF5, MPI-IO to read and write on large parallel file systems |
Class 11 | Parallel debugger | Use CUDA-GDB, Valgrind, TotalView to debug parallel code |
Class 12 | Parallel profiler | Use gprof, VTune, PAPI, Tau, Vampire to profile parallel code |
Class 13 | Performance primitives | Learn how to use performance primitives such as ModernGPU and MapReduce |
Class 14 | Graph partitioning | Use METIS and ParMETIS to partition a large graph in parallel |
Class 15 | Deep Learning | Use ChainerMN to train a large neural network on a parallel computer |
None
None
Evaluation is based on written reports (40%) and final report (60%).
None