NCCL (NVIDIA Collective Communications Library) 是 NVIDIA 推出的一个用于 GPU 之间高性能通信的库。随着深度学习模型规模的增长(如 GPT-3 的 1750 亿参数),单个 GPU 已无法满足训练需求。这就需要将模型或数据分割到多个 GPU 上进行并行训练,而 GPU 之间必然需要进行数据交换。NCCL 就是为了解决这个场景而生的。它主要解决以下问题:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS SYS SYS SYS SYS SYS SYS 0-15,32-47 0 N/A
GPU1 SYS X SYS SYS SYS SYS SYS SYS 0-15,32-47 0 N/A
GPU2 SYS SYS X SYS SYS SYS SYS SYS 0-15,32-47 0 N/A
GPU3 SYS SYS SYS X SYS SYS SYS SYS 0-15,32-47 0 N/A
GPU4 SYS SYS SYS SYS X SYS SYS SYS 16-31,48-63 1 N/A
GPU5 SYS SYS SYS SYS SYS X SYS SYS 16-31,48-63 1 N/A
GPU6 SYS SYS SYS SYS SYS SYS X SYS 16-31,48-63 1 N/A
GPU7 SYS SYS SYS SYS SYS SYS SYS X 16-31,48-63 1 N/A
Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinksGPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 0-47,96-143 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 0-47,96-143 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 0-47,96-143 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 0-47,96-143 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 48-95,144-191 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 48-95,144-191 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 48-95,144-191 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X 48-95,144-191 1 N/A
Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-5a10e6e5-95f7-2785-ed63-6f6147f304f7)
Link 0: 26.562 GB/s
Link 1: 26.562 GB/s
Link 2: 26.562 GB/s
Link 3: 26.562 GB/s
Link 4: 26.562 GB/s
Link 5: 26.562 GB/s
Link 6: 26.562 GB/s
Link 7: 26.562 GB/s
Link 8: 26.562 GB/s
Link 9: 26.562 GB/s
Link 10: 26.562 GB/s
Link 11: 26.562 GB/s
Link 12: 26.562 GB/s
Link 13: 26.562 GB/s
Link 14: 26.562 GB/s
Link 15: 26.562 GB/s
Link 16: 26.562 GB/s
Link 17: 26.562 GB/s
GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-5a10e6e5-95f7-2785-ed63-6f6147f304f7)
Link 0, P2P is supported: true
Link 0, Access to system memory supported: true
Link 0, P2P atomics supported: true
Link 0, System memory atomics supported: true
Link 0, SLI is supported: true
Link 0, Link is supported: true
Link 1, P2P is supported: true
Link 1, Access to system memory supported: true
Link 1, P2P atomics supported: true
Link 1, System memory atomics supported: true
Link 1, SLI is supported: true
Link 1, Link is supported: true
Link 2, P2P is supported: true
Link 2, Access to system memory supported: true
Link 2, P2P atomics supported: true
Link 2, System memory atomics supported: true
Link 2, SLI is supported: true
Link 2, Link is supported: true
Link 3, P2P is supported: true
Link 3, Access to system memory supported: true
Link 3, P2P atomics supported: true
Link 3, System memory atomics supported: true
Link 3, SLI is supported: true
Link 3, Link is supported: true
Link 4, P2P is supported: true
Link 4, Access to system memory supported: true
Link 4, P2P atomics supported: true
Link 4, System memory atomics supported: true
Link 4, SLI is supported: true
Link 4, Link is supported: true
Link 5, P2P is supported: true
Link 5, Access to system memory supported: true
Link 5, P2P atomics supported: true
Link 5, System memory atomics supported: true
Link 5, SLI is supported: true
Link 5, Link is supported: true
Link 6, P2P is supported: true
Link 6, Access to system memory supported: true
Link 6, P2P atomics supported: true
Link 6, System memory atomics supported: true
Link 6, SLI is supported: true
Link 6, Link is supported: true
Link 7, P2P is supported: true
Link 7, Access to system memory supported: true
Link 7, P2P atomics supported: true
Link 7, System memory atomics supported: true
Link 7, SLI is supported: true
Link 7, Link is supported: true
Link 8, P2P is supported: true
Link 8, Access to system memory supported: true
Link 8, P2P atomics supported: true
Link 8, System memory atomics supported: true
Link 8, SLI is supported: true
Link 8, Link is supported: true
Link 9, P2P is supported: true
Link 9, Access to system memory supported: true
Link 9, P2P atomics supported: true
Link 9, System memory atomics supported: true
Link 9, SLI is supported: true
Link 9, Link is supported: true
Link 10, P2P is supported: true
Link 10, Access to system memory supported: true
Link 10, P2P atomics supported: true
Link 10, System memory atomics supported: true
Link 10, SLI is supported: true
Link 10, Link is supported: true
Link 11, P2P is supported: true
Link 11, Access to system memory supported: true
Link 11, P2P atomics supported: true
Link 11, System memory atomics supported: true
Link 11, SLI is supported: true
Link 11, Link is supported: true
Link 12, P2P is supported: true
Link 12, Access to system memory supported: true
Link 12, P2P atomics supported: true
Link 12, System memory atomics supported: true
Link 12, SLI is supported: true
Link 12, Link is supported: true
Link 13, P2P is supported: true
Link 13, Access to system memory supported: true
Link 13, P2P atomics supported: true
Link 13, System memory atomics supported: true
Link 13, SLI is supported: true
Link 13, Link is supported: true
Link 14, P2P is supported: true
Link 14, Access to system memory supported: true
Link 14, P2P atomics supported: true
Link 14, System memory atomics supported: true
Link 14, SLI is supported: true
Link 14, Link is supported: true
Link 15, P2P is supported: true
Link 15, Access to system memory supported: true
Link 15, P2P atomics supported: true
Link 15, System memory atomics supported: true
Link 15, SLI is supported: true
Link 15, Link is supported: true
Link 16, P2P is supported: true
Link 16, Access to system memory supported: true
Link 16, P2P atomics supported: true
Link 16, System memory atomics supported: true
Link 16, SLI is supported: true
Link 16, Link is supported: true
Link 17, P2P is supported: true
Link 17, Access to system memory supported: true
Link 17, P2P atomics supported: true
Link 17, System memory atomics supported: true
Link 17, SLI is supported: true
Link 17, Link is supported: true
可以监控 GPU 的方式很多,这里推荐 nvitop(https://github.com/Syllo/nvtop),非常方便,pip 安装即可,看着最赏心悦目。