专栏名称: GiantPandaCV

专注于机器学习、深度学习、计算机视觉、图像处理等多个方向技术分享。团队由一群热爱技术且热衷于分享的小伙伴组成。我们坚持原创，每天一到两篇原创技术分享。希望在传播知识、分享知识的同时能够启发你，大家一起共同进步(･ω<)☆

CUDA-MODE 课程笔记第一课: 如何在 PyTorch 中 profile CUDA kernels

GiantPandaCV · 公众号 · · 2024-07-01 19:28

正文

记录 CUDA-MODE 课程学习笔记

课程的 Slides 和脚本：https://github.com/cuda-mode/lectures
课程地址：https://www.youtube.com/@CUDAMODE
我的课程笔记：https://github.com/BBuf/how-to-optim-algorithm-in-cuda/tree/master/cuda-mode

一直想系统看一下某个课程系统和科学的学习下 CUDA ，感觉 CUDA-MODE 这个课程能满足我的需求。这个课程是几个 PyTorch 的 Core Dev 搞的，比较系统和专业。不过由于这个课程是 Youtube 上的英语课程，所以要学习和理解这个课程还是需要花不少时间的，我这里记录一下学习这个课程的每一课的笔记，希望可以通过这个笔记帮助对这个课程以及 CUDA 感兴趣的读者更快吸收这个课程的知识。这个课程相比于以前的纯教程更加关注的是我们可以利用 CUDA 做什么事情，而不是让读者陷入到 CUDA 专业术语的细节中，那会非常痛苦。伟大无需多言，感兴趣请阅读本文件夹下的各个课程的学习笔记。

这里的记录可能会随意些，主要是记录一下个人觉得有用的知识或者有趣的话。

我的课程笔记，欢迎关注：https://github.com/BBuf/how-to-optim-algorithm-in-cuda/tree/master/cuda-mode

第一课: 如何在 PyTorch 中 profile CUDA kernels

这里是课程规划，有三位讲师 Andreas, Thomas, Mark，然后大概2周出一个 CUDA 主题的讲解以及工程或者结对编程的视频。课程讨论的主题是根据《Programming Massively Parallel Processors》这本书来的，Mark 也是在8分钟的时候强推了这本书。另外在6分钟左右 Mark 指出，学习 CUDA 的困难之处在于对于新手来说，可能会陷入不断循环查找文档的状态，非常痛苦。

这里是说Lecture 1的目标是如何把一个 CUDA kernel 嵌入到 PyTorch 里面，以及如何对它进行 Profile 。相关的代码都在：https://github.com/cuda-mode/lectures/tree/main/lecture_001 。Mark 还提到说这个课程相比于以前的纯教程更加关注的是我们可以利用 CUDA 做什么事情，而不是让读者陷入到 CUDA 专业术语的细节中，那会非常痛苦。

这一页 Slides 中的代码在 https://github.com/cuda-mode/lectures/blob/main/lecture_001/pytorch_square.py

import torch

a = torch.tensor([1., 2., 3.])

print(torch.square(a))
print(a ** 2)
print(a * a)

def time_pytorch_function(func, input):
    # CUDA IS ASYNC so can't use python time module
    # CUDA是异步的，所以你不能使用python的时间模块，而应该使用CUDA Event
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    # Warmup (防止CUDA Context初始化影响时间记录的准确性)
    for _ in range(5):
        func(input)

    start.record()
    func(input)
    end.record()
    # 程序完成之后需要做一次 CUDA 同步
    torch.cuda.synchronize()
    return start.elapsed_time(end)

b = torch.randn(10000, 10000).cuda()

def square_2(a):
    return a * a

def square_3(a):
    return a ** 2

time_pytorch_function(torch.square, b)
time_pytorch_function(square_2, b)
time_pytorch_function(square_3, b)

print("=============")
print("Profiling torch.square")
print("=============")

# Now profile each function using pytorch profiler
with torch.autograd.profiler.profile(use_cuda=True) as prof:
    torch.square(b)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

print("=============")
print("Profiling a * a")
print("=============")

with torch.autograd.profiler.profile(use_cuda=True) as prof:
    square_2(b)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

print("=============")
print("Profiling a ** 2")
print("=============")

with torch.autograd.profiler.profile(use_cuda=True) as prof:
    square_3(b)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

这里通过在 PyTorch 中实现平方和立方函数并使用 autograd profiler 工具进行 profile 。 time_pytorch_function 这个函数的计时功能和 torch.autograd.profiler.profile 类似，第三页 Slides 里面我们可以通过 PyTorch Profiler 的结果看到当前被 torch.autograd.profiler.profile context manager 包起来的 PyTorch 程序 cuda kernel 在 cpu, cuda 上的执行时间以及占比以及 kernel 的调用次数，当前 kernel 的执行时间占总时间的比例。

这一页Slides是对 https://github.com/cuda-mode/lectures/blob/main/lecture_001/pt_profiler.py 这个文件进行讲解，之前我也翻译过PyTorch Profiler TensorBoard 插件教程，地址在 https://zhuanlan.zhihu.com/p/692749819

可以看到 aten::square 实际上是调用的 aten::pow ，然后 aten::pow 下方的 cud 指的是cuda kernel dispatch也就是启动CUDA kernel，我们还可以看到这个CUDA kernel的名字是 naive_vectorized_elementwise_kernel<4, ..> ，其中4表示Block的数量。但是这里的问题是，我们只能看到kernel的名称，无法知道它运行得多快。然后up主推荐去了解和学习PyTorch的 .cu 实现，这些实现是一个很好的工具。

PyTorch的 load_inline 可以把c/c++源码以函数的方式加载到模块中。接着作则还展示了一下怎么使用 load_inline 加载cuda的源代码：https://github.com/cuda-mode/lectures/blob/main/lecture_001/load_inline.py 。

# Look at this test for inspiration
# https://github.com/pytorch/pytorch/blob/main/test/test_cpp_extensions_jit.py

import torch
from torch.utils.cpp_extension import load_inline

# Define the CUDA kernel and C++ wrapper
cuda_source = '''
__global__ void square_matrix_kernel(const float* matrix, float* result, int width, int height) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (row         int idx = row * width + col;
        result[idx] = matrix[idx] * matrix[idx];
    }
}

torch::Tensor square_matrix(torch::Tensor matrix) {
    const auto height = matrix.size(0);
    const auto width = matrix.size(1);

    auto result = torch::empty_like(matrix);

    dim3 threads_per_block(16, 16);
    dim3 number_of_blocks((width + threads_per_block.x - 1) / threads_per_block.x,
                          (height + threads_per_block.y - 1) / threads_per_block.y);

    square_matrix_kernel<<>>(
        matrix.data_ptr(), result.data_ptr(), width, height);

    return result;
    }
'''

cpp_source = "torch::Tensor square_matrix(torch::Tensor matrix);"

# Load the CUDA kernel as a PyTorch extension
square_matrix_extension = load_inline(
    name='square_matrix_extension',
    cpp_sources=cpp_source,
    cuda_sources=cuda_source,
    functions=['square_matrix'],
    with_cuda=True,
    extra_cuda_cflags=["-O2"




    
],
    build_directory='./load_inline_cuda',
    # extra_cuda_cflags=['--expt-relaxed-constexpr']
)

a = torch.tensor([[1., 2., 3.], [4., 5., 6.]], device='cuda')
print(square_matrix_extension.square_matrix(a))

# (cudamode) ubuntu@ip-172-31-9-217:~/cudamode/cudamodelecture1$ python load_inline.py 
# tensor([[ 1.,  4.,  9.],
#         [16., 25., 36.]], device='cuda:0')

注意到这里的 build_directory='./load_inline_cuda', 表示构建过程生成的代码一集编译的中间产物都会保存到 https://github.com/cuda-mode/lectures/tree/main/lecture_001/load_inline_cuda 这个文件夹中。

如果想避免这种编译过程，可以考虑使用Triton，它是一个Python程序。

这个是用Triton写的square kernel，下面展示了 torch.compile , naive torch, Triton 实现的kernel在A10的性能对比：

可以看到naive torch的kernel比Triton和 torch.compile 生产的kernel都更快一点。接着又在4090上做了实验，得到了类似的结果。作者写的kernel在：https://github.com/cuda-mode/lectures/blob/main/lecture_001/triton_square.py

Triton kernel为：

# Adapted straight from https://triton-lang.org/main/getting-started/tutorials/02-fused-softmax.html
import triton
import triton.language as tl
import torch

# if @triton.jit(interpret=True) does not work, please use the following two lines to enable interpret mode
# import os
# os.environ["TRITON_INTERPRET"] = "1"

@triton.jit
def square_kernel(output_ptr, input_ptr, input_row_stride, output_row_stride, n_cols, BLOCK_SIZE: tl.constexpr):
    # The rows of the softmax are independent, so we parallelize across those
    row_idx = tl.program_id(0)
    # The stride represents how much we need to increase the pointer to advance 1 row
    row_start_ptr = input_ptr + row_idx * input_row_stride
    # The block size is the next power of two greater than n_cols, so we can fit each
    # row in a single block
    col_offsets = tl.arange(0, BLOCK_SIZE)
    input_ptrs = row_start_ptr + col_offsets
    # Load the row into SRAM, using a mask since BLOCK_SIZE may be > than n_cols
    row = tl.load(input_ptrs, mask=col_offsets 'inf'))

    square_output = row * row
    
    # Write back output to DRAM
    output_row_start_ptr = output_ptr + row_idx * output_row_stride
    output_ptrs = output_row_start_ptr + col_offsets
    tl.store(output_ptrs, square_output, mask=col_offsets 

def square(x):
    n_rows, n_cols = x.shape
    # The block size is the smallest power of two greater than the number of columns in `x`
    BLOCK_SIZE = triton.next_power_of_2(n_cols)
    # Another trick we can use is to ask the compiler to use more threads per row by
    # increasing the number of warps (`num_warps`) over which each row is distributed.
    # You will see in the next tutorial how to auto-tune this value in a more natural
    # way so you don't have to come up with manual heuristics yourself.
    num_warps = 4
    if BLOCK_SIZE >= 2048:
        num_warps = 8
    if BLOCK_SIZE >= 4096:
        num_warps = 16
    # Allocate output
    y = torch.empty_like(x)
    # Enqueue kernel. The 1D launch grid is simple: we have one kernel instance per row o
    # f the input matrix
    square_kernel[(n_rows, )](
        y,
        x,
        x.stride(0),
        y.stride(0),
        n_cols,
        num_warps=num_warps,
        BLOCK_SIZE=BLOCK_SIZE,
    )
    return y

这个kernel是Triton的fused softmax 教程改过来的，在那个教程里 Triton 的速度比 PyTorch 和 torch.compile 都要快，所以这里的性能表现似乎有点奇怪，因为两者都是element-wise操作。接着作者把上面的 BLOCK_SIZE 固定为1024，观察到性能有很大提升

这里如果固定了BLOCK_SIZE，那上面的Kernel也要做对应的修改比如以BLOCK_SIZE的步长来循环加载列方向的数据。

下一页Slides提到Triton现在提供了一个debugger：

开启debugger模式之后你就可以在Triton kernel里的任意一行打断点一行行检查代码，几乎所有的变量都是Tensor，你可以使用 var_name.tensor 来打印。

CUDA-MODE 课程笔记 第一课: 如何在 PyTorch 中 profile CUDA kernels

正文

(adsbygoogle = window.adsbygoogle || []).push({}); 记录 CUDA-MODE 课程学习笔记

第一课: 如何在 PyTorch 中 profile CUDA kernels

请到「今天看啥」查看全文

CUDA-MODE 课程笔记第一课: 如何在 PyTorch 中 profile CUDA kernels

记录 CUDA-MODE 课程学习笔记