TensorCore 是从Nvidia Volta 架构GPU开始支持的重要特性,使CUDA开发者能够使用混合精度来获得更高的吞吐量,而不牺牲精度。TensorCore已经在许多深度学习框架(包括Tensorflow、PyTorch、MXNet和Caffe2)中支持深度学习训练。本文将展示如何使用CUDA库在自己的应用程序中使用张量核,以及如何在CUDA C++设备代码中直接编程。
// First, create a cuBLAS handle:cublasStatus_t cublasStat = cublasCreate(&handle);// Set the math mode to allow cuBLAS to use Tensor Cores:cublasStat = cublasSetMathMode(handle, CUBLAS_TENSOR_OP_MATH);// Allocate and initialize your matrices (only the A matrix is shown):size_t matrixSizeA = (size_t)rowsA * colsA;T_ELEM_IN **devPtrA =0;cudaMalloc((void**)&devPtrA[0], matrixSizeA *sizeof(devPtrA[0][0]));T_ELEM_IN A = (T_ELEM_IN *)malloc(matrixSizeA *sizeof(A[0]));memset( A, 0xFF, matrixSizeA*sizeof(A[0]));status1 = cublasSetMatrix(rowsA, colsA, sizeof(A[0]), A, rowsA, devPtrA[i], rowsA);// ... allocate and initialize B and C matrices (not shown) ...// Invoke the GEMM, ensuring k, lda, ldb, and ldc are all multiples of 8, // and m is a multiple of 4:cublasStat = cublasGemmEx(handle, transa, transb, m, n, k, alpha, A, CUDA_R_16F, lda, B, CUDA_R_16F, ldb, beta, C, CUDA_R_16F, ldc, CUDA_R_32F, algo);
cuBLAS用户将注意到与现有的cuBLAS GEMM代码相比有一些变化:
例程必须是一个GEMM;目前只有GEMM支持Tensor Core执行。
数学模式必须设置为
CUBLAS_TENSOR_OP_MATH
。
k
、
lda
、
ldb
和
ldc
必须是8的倍数;
m
必须是4的倍数。Tensor Core数学例程以8个值为一步跨越输入数据,因此矩阵的维度必须是8的倍数。
矩阵的输入和输出数据类型必须是半精度或单精度。
不满足上述规则的GEMM将回退到非Tensor Core实现。
cuDNN中使用TensorCore
在cuDNN中使用Tensor Cores也很简单,而且同样只需要对现有代码进行轻微修改。
// Create a cuDNN handle:checkCudnnErr(cudnnCreate(&handle_));// Create your tensor descriptors:checkCudnnErr( cudnnCreateTensorDescriptor( &cudnnIdesc ));checkCudnnErr( cudnnCreateFilterDescriptor( &cudnnFdesc ));checkCudnnErr( cudnnCreateTensorDescriptor( &cudnnOdesc ));checkCudnnErr( cudnnCreateConvolutionDescriptor( &cudnnConvDesc ));// Set tensor dimensions as multiples of eight (only the input tensor is shown here):int dimA[] = {1, 8, 32, 32};int strideA[] = {8192, 1024, 32, 1};checkCudnnErr( cudnnSetTensorNdDescriptor(cudnnIdesc, getDataType(), convDim+2, dimA, strideA) );// Allocate and initialize tensors (again, only the input tensor is shown):checkCudaErr( cudaMalloc((void**)&(devPtrI), (insize) *sizeof(devPtrI[0]) ));hostI = (T_ELEM*)calloc (insize, sizeof(hostI[0]) );initImage(hostI, insize);checkCudaErr( cudaMemcpy(devPtrI, hostI, sizeof(hostI[0]) * insize, cudaMemcpyHostToDevice));// Set the compute data type (below as CUDNN_DATA_FLOAT):checkCudnnErr( cudnnSetConvolutionNdDescriptor(cudnnConvDesc, convDim, padA, convstrideA, dilationA, CUDNN_CONVOLUTION, CUDNN_DATA_FLOAT) );// Set the math type to allow cuDNN to use Tensor Cores:checkCudnnErr( cudnnSetConvolutionMathType(cudnnConvDesc, CUDNN_TENSOR_OP_MATH) );// Choose a supported algorithm:cudnnConvolutionFwdAlgo_t algo = CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM;// Allocate your workspace:checkCudnnErr( cudnnGetConvolutionForwardWorkspaceSize(handle_, cudnnIdesc, cudnnFdesc, cudnnConvDesc, cudnnOdesc, algo, &workSpaceSize) );if (workSpaceSize >0) { cudaMalloc(&workSpace, workSpaceSize);}// Invoke the convolution:checkCudnnErr( cudnnConvolutionForward(handle_, (void*)(&alpha), cudnnIdesc, devPtrI, cudnnFdesc, devPtrF, cudnnConvDesc, algo, workSpace, workSpaceSize, (void*)(&beta), cudnnOdesc, devPtrO) );
// The only dimensions currently supported by WMMAconstint WMMA_M =16;constint WMMA_N =16;constint WMMA_K =16;__global__ voidwmma_example(half *a, half *b, float*c, int M, int N, int K, float alpha, float beta) {
// Leading dimensions. Packed with no transpositions.int lda = M; int ldb = K; int ldc = M;
// Tile using a 2D gridint warpM = (blockIdx.x * blockDim.x + threadIdx.x) / warpSize; int warpN = (blockIdx.y * blockDim.y + threadIdx.y);