Logo

Common Architecture for Large-Scale Model Training

Model Parallelism Part 2

In this article, we'll provide a high-level overview of the infrastructure commonly employed by leading players such as OpenAI, Meta, and Google for large-scale model training. This infrastructure leverages a hybrid parallelism approach, integrating data parallelism, tensor parallelism, and pipeline parallelism.

Key Challenges in Large-Scale Model Training

Before diving into architectures, let's see what must be parallelized:

Challenge1: Data Sharding

Training datasets can span petabytes. No single accelerator can ingest or preprocess this volume fast enough.

Challenge2: Model Size

Now SOTA(State-of-the-art) models often exceed hundreds of billions of parameters, requiring more memory than one device can provide.

Challenge3: Communication Overhead

Challenge: Exchanging gradients and activations between devices can saturate network links and become a training bottleneck.

Challenge4: Fault Tolerance

At scale, hardware failures become common; restarting from scratch is prohibitively expensive.

GPU Counts at a Glance

Small AI startups typically run on <100 GPUs for development and light training workloads .

Mid-sized AI companies often deploy on the order of a few hundred GPUs for their main model-training runs.

Meta, by the end of 2024, planned to deploy 350,000 NVIDIA H100 GPUs—with compute equivalent to nearly 600,000 H100s.

Google invented and operates Cloud TPUs instead of GPUs. A single TPU v4 pod contains 4096 TPU chips, and Google runs multiple such pods. Its next-gen “Ironwood” TPU v7x design can scale up to 9216 chips per pod and architectures up to 400,000 TPU chips across pods.

OpenAI has not officially disclosed exact GPU counts used. Community estimated and analyses that ChatGPT inference alone ran on around 30,000 Nvidia GPUs. And the training training may have used 25,000 Nvidia A100 GPUs over 90 days.

Anthropic has not published its training-cluster size. Community estimated it’s on the order of a few thousand GPUs.

Overview of the Training Architecture

Top-Level Parallelization: Data Parallelism

At the highest level, large-scale training utilizes data parallelism. The training dataset is partitioned and distributed across multiple computing clusters. Each island maintains a full copy of the model.

Step 1: Data Partitioning. At the beginning of each training cycle, the dataset is divided into mini-batches, and each GPU is assigned a unique mini-batch.

Step 2: Local Computation. Each accelerator independently performs forward and backward passes on its assigned data subset, computing individual gradients for the model parameters.

Gradient Aggregation: Reduce-Scatter Approach

To handle gradient aggregation efficiently and conserve memory:

Reduce Operation: Gradients from all accelerators are summed collectively.

Scatter Operation: The aggregated gradient is divided into distinct segments, with each GPU receiving a unique segment. While no single GPU holds the entire gradient, the complete gradient is effectively distributed across all GPUs within the pod.

Cross-Pod Gradient Reduction

Once gradients are locally aggregated and partitioned:

Cross-Pod All-Reduce: Each pod performs an "all-reduce" operation across pods using the assigned gradient segments. Each participating host rank contributes to this global reduction.

Result Distribution: Following the all-reduce operation, each host rank receives a reduced gradient segment representing aggregated contributions from all pods. This operation inherently distributes the correct, reduced gradient segments back to each host rank.

Next Post

We will dive deep into more collective operations and continue exploring challenges and solutions for training and serving large language models.

AI Infra System Design Private Course 📒

  • Would you like to master modern AI infrastructure from FAANG engineers with 15+ years of experience? Contact us at [email protected]