Scale Ops

ScaleOps is a repository dedicated to collecting and sharing findings from exploring and optimizing the scalability and performance of AI workloads.

Goals

I warmly welcome any feedback or discussion, especially if you spot potential oversights in my reasoning or experiments.

I hope that this repository will be a useful resource for anyone who is learning about or interested in the performance and scalability of AI workloads.

Links to Sections

Networking

Networking Part 1 This post examines the impact of GPU networking on transformer model training performance using Distributed Data Parallel (DDP), comparing high-speed intra-node NVLink with slower inter-node InfiniBand.
Networking Part 2. Part 2 builds on earlier experiments by examining how distributing 4 GPUs across 1, 2, and 4 nodes impacts transformer model training, with a focus on network topology and NIC sharing.