Rainlin:如何优雅地测量GPU CUDA Kernel耗时?(一)中介绍了常用的测量gpu耗时方法,而实际应用中,还会遇到其他的问题,比如:
为什么同样的输入,测量的耗时存在较大差距?
怎样才能精确的测量kernel耗时?
问题
我们看以下常见代码,仅仅做了linear操作:
def test(): a_size = (20, 8192) b_size = (5120, 8192) events = [ [torch.cuda.Event(enable_timing=True) for _ in range(6)] for _ in range(50) ]
# warm up for _ in range(10): a = torch.rand(a_size, dtype=torch.float16).cuda() b = torch.rand( b_size, dtype=torch.float16, ).cuda() c = F.linear(a, b)
# 测量 for i in range(10): a = torch.rand(a_size, dtype=torch.float16).cuda() b = torch.rand(b_size, dtype=torch.float16).cuda()
events[i][0].record() c = F.linear(a, b) events[i][1].record()
events[i][2].record() c = F.linear(a, b) events[i][3].record()
events[i][4].record() c = F.linear(a, b) events[i][5].record() torch.cuda.synchronize()
# 输出时间 for i in range(5): print( f"{i}: t1:{events[i][0].elapsed_time(events[i][1])},t2:{events[i][2].elapsed_time(events[i][3])},t3:{events[i][4].elapsed_time(events[i][5])}" ) torch.cuda.synchronize()
for i in range(10): # 改成直接从GPU生成rand数据,而不是拷贝 a = torch.rand(a_size, dtype=torch.float16, device="cuda") b = torch.rand(b_size, dtype=torch.float16, device="cuda"