一直想系统看一下某个课程系统和科学的学习下 CUDA ,感觉 CUDA-MODE 这个课程能满足我的需求。这个课程是几个 PyTorch 的 Core Dev 搞的,比较系统和专业。不过由于这个课程是 Youtube 上的英语课程,所以要学习和理解这个课程还是需要花不少时间的,我这里记录一下学习这个课程的每一课的笔记,希望可以通过这个笔记帮助对这个课程以及 CUDA 感兴趣的读者更快吸收这个课程的知识。这个课程相比于以前的纯教程更加关注的是我们可以利用 CUDA 做什么事情,而不是让读者陷入到 CUDA 专业术语的细节中,那会非常痛苦。伟大无需多言,感兴趣请阅读本文件夹下的各个课程的学习笔记。
这里是课程规划,有三位讲师 Andreas, Thomas, Mark,然后大概2周出一个 CUDA 主题的讲解以及工程或者结对编程的视频。课程讨论的主题是根据 《Programming Massively Parallel Processors》这本书来的,Mark 也是在8分钟的时候强推了这本书。另外在6分钟左右 Mark 指出,学习 CUDA 的困难之处在于对于新手来说,可能会陷入不断循环查找文档的状态,非常痛苦。
这里是说Lecture 1的目标是如何把一个 CUDA kernel 嵌入到 PyTorch 里面,以及如何对它进行 Profile 。相关的代码都在:https://github.com/cuda-mode/lectures/tree/main/lecture_001 。Mark 还提到说这个课程相比于以前的纯教程更加关注的是我们可以利用 CUDA 做什么事情,而不是让读者陷入到 CUDA 专业术语的细节中,那会非常痛苦。
deftime_pytorch_function(func, input): # CUDA IS ASYNC so can't use python time module # CUDA是异步的,所以你不能使用python的时间模块,而应该使用CUDA Event start = torch.cuda.Event(enable_timing=True) end = torch.cuda.Event(enable_timing=True)
# Warmup (防止CUDA Context初始化影响时间记录的准确性) for _ in range(5): func(input)
start.record() func(input) end.record() # 程序完成之后需要做一次 CUDA 同步 torch.cuda.synchronize() return start.elapsed_time(end)
b = torch.randn(10000, 10000).cuda()
defsquare_2(a): return a * a
defsquare_3(a): return a ** 2
time_pytorch_function(torch.square, b) time_pytorch_function(square_2, b) time_pytorch_function(square_3, b)
# Look at this test for inspiration # https://github.com/pytorch/pytorch/blob/main/test/test_cpp_extensions_jit.py
import torch from torch.utils.cpp_extension import load_inline
# Define the CUDA kernel and C++ wrapper cuda_source = ''' __global__ void square_matrix_kernel(const float* matrix, float* result, int width, int height) { int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row int idx = row * width + col; result[idx] = matrix[idx] * matrix[idx]; } }
torch::Tensor square_matrix(torch::Tensor matrix) { const auto height = matrix.size(0); const auto width = matrix.size(1);
# Adapted straight from https://triton-lang.org/main/getting-started/tutorials/02-fused-softmax.html import triton import triton.language as tl import torch
# if @triton.jit(interpret=True) does not work, please use the following two lines to enable interpret mode # import os # os.environ["TRITON_INTERPRET"] = "1"
@triton.jit defsquare_kernel(output_ptr, input_ptr, input_row_stride, output_row_stride, n_cols, BLOCK_SIZE: tl.constexpr): # The rows of the softmax are independent, so we parallelize across those row_idx = tl.program_id(0) # The stride represents how much we need to increase the pointer to advance 1 row row_start_ptr = input_ptr + row_idx * input_row_stride # The block size is the next power of two greater than n_cols, so we can fit each # row in a single block col_offsets = tl.arange(0, BLOCK_SIZE) input_ptrs = row_start_ptr + col_offsets # Load the row into SRAM, using a mask since BLOCK_SIZE may be > than n_cols row = tl.load(input_ptrs, mask=col_offsets 'inf'))
square_output = row * row
# Write back output to DRAM output_row_start_ptr = output_ptr + row_idx * output_row_stride output_ptrs = output_row_start_ptr + col_offsets tl.store(output_ptrs, square_output, mask=col_offsets
defsquare(x): n_rows, n_cols = x.shape # The block size is the smallest power of two greater than the number of columns in `x` BLOCK_SIZE = triton.next_power_of_2(n_cols) # Another trick we can use is to ask the compiler to use more threads per row by # increasing the number of warps (`num_warps`) over which each row is distributed. # You will see in the next tutorial how to auto-tune this value in a more natural # way so you don't have to come up with manual heuristics yourself. num_warps = 4 if BLOCK_SIZE >= 2048: num_warps = 8 if BLOCK_SIZE >= 4096: num_warps = 16 # Allocate output y = torch.empty_like(x) # Enqueue kernel. The 1D launch grid is simple: we have one kernel instance per row o # f the input matrix square_kernel[(n_rows, )]( y, x, x.stride(0), y.stride(0), n_cols, num_warps=num_warps, BLOCK_SIZE=BLOCK_SIZE, ) return y