本文探讨了在测量GPU CUDA Kernel耗时时可能遇到的问题,例如输入相同但测量结果差异大的原因,并提供了精确测量kernel耗时的方法。文章分析了可能的原因,包括torch.cuda.event测量的时间可能包含了其他过程、GPU缓存的影响,以及GPU频率的变化,并给出了一些建议,如使用nsys工具进行更准确的测量。
>>
加入极市CV技术交流群,走在计算机视觉的最前沿
背景
Rainlin:如何优雅地测量GPU CUDA Kernel耗时?(一)中介绍了常用的测量gpu耗时方法,而实际应用中,还会遇到其他的问题,比如:
为什么同样的输入,测量的耗时存在较大差距?
怎样才能精确的测量kernel耗时?
问题
我们看以下常见代码,仅仅做了linear操作:
def test(): a_size = (20, 8192) b_size = (5120, 8192) events = [ [torch.cuda.Event(enable_timing=True) for _ in range(6)] for _ in range(50) ]
# warm up for _ in range(10): a = torch.rand(a_size, dtype=torch.float16).cuda() b = torch.rand( b_size, dtype=torch.float16, ).cuda() c = F.linear(a, b)
# 测量 for i in range(10): a = torch.rand(a_size, dtype=torch.float16).cuda() b = torch.rand(b_size, dtype=torch.float16).cuda()
events[i][0].record() c = F.linear(a, b) events[i][1].record()
events[i][2].record() c = F.linear(a, b) events[i][3].record()
events[i][4].record() c = F.linear(a, b) events[i][5].record() torch.cuda.synchronize()
# 输出时间 for i in range(5): print( f"{i}: t1:{events[i][0].elapsed_time(events[i][1])},t2:{events[i][2].elapsed_time(events[i][3])},t3:{events[i][4].elapsed_time(events[i][5])}" ) torch.cuda.synchronize()
for i in range(10): # 改成直接从GPU生成rand数据,而不是拷贝 a = torch.rand(a_size, dtype=torch.float16, device="cuda") b = torch.rand(b_size, dtype=torch.float16, device="cuda")
events[i][0].record() c = F.linear(a, b) events[i][1].record()
events[i][2].record() c = F.linear(a, b) events[i][3].record()
events[i][4].record() c = F.linear(a, b) events[i][5].record() .....
fc = torch.empty(int(40 * (1024**2)), dtype=torch.int8, device="cuda")
def flush_cache(): fc.zero_()
... for i in range(10): a = torch.rand(a_size, dtype=torch.float16, device="cuda") b = torch.rand(b_size, dtype=torch.float16, device="cuda")
flush_cache() events[i][0].record() c = F.linear(a, b) events[i][1].record()
flush_cache() events[i][2].record() c = F.linear(a, b) events[i][3].record()
flush_cache() events[i][4].record() c = F.linear(a, b) events[i][5].record() ...
再次运行,结果为:
nsys结果为:
可以发现此时3个计算kernel的耗时基本一致,说明缓存的确影响了kernel的耗时。
除此之外,影响耗时的原因还可能是GPU频率的变化,可以通过以下代码进行设置频率:
DEVICE = os.environ.get("CUDA_VISIBLE_DEVICES") CLOCK_SPEED = 1350 # Must choose a clock speed that's supported on your device.
def set_clock_speed(): """ Set GPU clock speed to a specific value. This doesn't guarantee a fixed value due to throttling, but can help reduce variance. """ process = subprocess.Popen("nvidia-smi", stdout=subprocess.PIPE, shell=True) stdout, _ = process.communicate() process = subprocess.run(f"nvidia-smi -pm ENABLED -i {DEVICE}", shell=True) process = subprocess.run(f"nvidia-smi -lgc {CLOCK_SPEED} -i {DEVICE}", shell=True)