The Impact of GPU Networking (Part 1)

This post examines the impact of GPU networking on transformer model training performance using Distributed Data Parallel (DDP), comparing high-speed intra-node NVLink with slower inter-node InfiniBand.

The question of how much GPU interconnects influence training performance in large transformer-based models recently sparked a lively discussion with a colleague. Specifically, we debated whether, for simpler parallelism approaches—namely distributed data parallelism (DDP)—network bandwidth and latency still affect performance as strongly as they do for more communication-heavy strategies, such as model parallelism.

In data parallelism, each GPU independently processes a slice of the overall batch, requiring gradients to be synchronized at each training step. Because this synchronization involves only the transfer of gradients—rather than splitting the entire model across devices, as in model parallelism—it is tempting to assume that the interconnect between GPU nodes might not be a major bottleneck.

To gain empirical insights, I conducted a series of experiments using PyTorch Lightning and DDP. In this post, I document my process, present the experimental results, and reflect on what they reveal about GPU interconnects in multi-GPU training. While this is not a formal research paper, I hope it serves as a helpful exploration of how networking can affect large-scale transformer training.

Initial Hypotheses

My prior experience with both training transformer-based models and running large-scale HPC simulations made me skeptical of these hypotheses. However, I wanted to test them rigorously rather than rely on intuition or anecdotal evidence.

Prior Knowledge

In previous work published on Zenodo, I evaluated different HPC configurations for training GPT-2 and explored various parallelism strategies. Those findings hinted that:

I set out to expand on these observations here by running new experiments in a more controlled setting.

What is GPT-2?

GPT-2 is a transformer-based language model designed for text generation. Its architecture leverages self-attention mechanisms, allowing it to effectively model contextual relationships in sequences of text. In my experiments, I worked with two variants:

These models were trained in BF16 precision, with a batch size of 16, on the Shakespeare dataset. The training code relies on lightning-GPT, which wraps the minGPT implementation in PyTorch Lightning to simplify multi-GPU experiments.

Distributed Data Parallelism (DDP)

PyTorch Lightning is a lightweight PyTorch wrapper for high-performance AI research, designed to abstract most boilerplate code and provide a high-level interface for training models. lightning-GPT enables training on multiple GPUs and nodes, supporting different parallelism strategies; for this work, I focus only on DDP.

When using DDP, gradient synchronization across GPUs happens every step when a batch is processed. The batch is split across the GPUs into smaller portions (mini-batches). Each GPU processes a different mini-batch of data in parallel during each training step. After computing gradients locally on their portion of the data, GPUs synchronize these gradients using an all-reduce operation to ensure consistency across the entire model.

Experimental Setup

The experiments were conducted on Baskerville, a GPU cluster at the University of Birmingham. Each node contains four NVIDIA A100 GPUs (40GB/80GB) connected via NVLink 3.0, and nodes are interconnected with InfiniBand HDR (see Appendix A for system specifications).

Measurement Approach

I recorded the time taken to complete the second epoch of training. This approach helps:

The second epoch timing captures the pure training speed once the framework has loaded the data and established the initial model states.

Experiments

GPT-2 Experiments

I first tested the smaller GPT-2 model (~85M parameters) with different node/GPU configurations:

Results:

Experiment Model Parallelism Nodes GPUs per Node Epoch 2 time (s) Peak memory (MiB)
e-15-1 gpt2 DDP 1 1 157.60 3,687.91
e-15-2 gpt2 DDP 1 2 62.96 3,693.55
e-15-3 gpt2 DDP 2 1 62.55 3,696.01

Observations:

GPT-2 Large Experiments

Next, I repeated the procedure with GPT-2 Large (~708M parameters):

Results:

Experiment Model Parallelism Nodes GPUs per Node Epoch 2 time (s) Peak memory (MiB)
e-15-4-1 gpt2-large DDP 1 1 498.37 24,258.05
e-15-5-3 gpt2-large DDP 1 2 266.83 24,258.38
e-15-6-2 gpt2-large DDP 2 1 406.12 24,297.53

Observations:

Analysis

For GPT-2 Large, the number of steps per epoch differs based on the number of GPUs:

The theoretical peak transfer speeds are:

Model Size Calculation:

Total Data Transfer per Epoch:

Communication Time Estimates:

Comparing to Measured Times:

Note: The discrepancies are likely due to factors like latency, network overhead, and resource contention.

Communication Benchmark Experiments

To isolate the communication overhead, I wrote a small benchmark script measuring the throughput of a distributed all-reduce on 700M parameters in BF16 (~1.32 GB). The script relies on PyTorch’s collective communication (NCCL or Gloo) and reports the effective transfer speed:

Results:

Experiment Nodes GPUs per Node Backend Protocol Measured Transfer speed (GB/s) Time (s)
e-16-1 1 2 NVLink 3.0 NCCL 159.29 0.02
e-16-1 1 2 PCIe 4.0 NCCL 26.90 0.10
e-16-2 2 1 InfiniBand (HDR) NCCL 21.67 0.12
e-16-2 2 1 Ethernet (25GbE) NCCL 3.01 0.88
e-16-2 2 1 InfiniBand (HDR) Gloo 1.60 1.70
e-16-2 2 1 Ethernet (25GbE) Gloo 1.60 1.65

Observations:

Recalculating Training Time Differences:

For e-15-5-3 (NVLink 3.0), additional time due to communication:

For e-15-6-2 (InfiniBand HDR), additional time due to communication:

Again, discrepancies are due to other factors affecting training time.

Conclusions

These experiments demonstrate that as transformer models grow larger, the significance of GPU interconnects becomes harder to ignore—even in Distributed Data Parallel setups. For smaller models like GPT-2 (~85M parameters), interconnect speed has minimal effect, in line with initial hypotheses H1 and H2. However, GPT-2 Large (~708M parameters) reveals a striking gap between intra-node NVLink performance and inter-node InfiniBand.

In short:

Next Steps

In the Part 2, I will extend these experiments to 4 GPUs spread across 1, 2, and 4 nodes. This will shed further light on how network topologies (e.g., ring versus tree) influence performance when the number of GPUs per node varies.

Additional Information

Appendix A: Baskerville HPC System Specifications

The experiments were conducted on the Baskerville HPC system, which has the following specifications:

Compute Nodes

There are 57 SD650-N V2 liquid-cooled compute trays with:

The GPUs on 11 nodes have 80GB RAM, while those on the remaining 46 nodes have 40GB RAM. The GPUs are interconnected using NVIDIA NVLINK.

Network

Baskerville uses three networks:

Storage

The system is equipped with Lenovo DSS-G storage systems running IBM® Spectrum Scale™: