Model Parallelism

How Does Parallelism Work in Model Training?

When training trillion-parameter models like GPT-4 or even more advanced versions, it becomes necessary to combine many GPUs to parallelize the workload effectively. Generally, there are three main approaches to parallelizing this workload:

Simple approach - Sharded data parallelism

1, Model Replication: Each GPU holds its own copy of the model.

2, Data Partitioning: Each GPU receives and processes a subset of the dataset.

Pros: This approach has the lowest communication requirements among parallelism methods.

Cons: GPU RAM is limited, particularly for extremely large models. For instance, GPT-4-level models may require up to 10 TB of GPU RAM for training.

Another example: DeepSeek-V2 has approximately 236 billion total parameters. The total model weights require about ~500 GB of storage. Considering that optimizer states may double this requirement, and accounting for the memory needed to store intermediate activations during forward and backward passes, the total memory usage during training could approach ~1.5 TB of RAM.

Deepseek

Given that the Deepseek offers small size models like Deepseek-R1, total around ~10 billion parameters(distilled variants), 20–30 GB for stroage and achieves performance on par with larger top-tier models like GPT4, would it be appropriate to use this sharded data approach for furthure finetuning? Potential concerns?

Parallel the Tensors

Let's look at this example: Consider a neural network layer with a hidden dimension of 2048 being trained on 4 GPUs. Each GPU would handle computations for 512 of these dimensions. After processing, the GPUs synchronize their partial results through an all-reduce operation, effectively combining their outputs to match the result as if the computations were performed on a single GPU.

Tensor Parallelism

In tensor parallelism, each layer's computations and model weights are partitioned across multiple GPUs. GPUs frequently exchange intermediate results through all-reduce operations during self-attention, feed-forward layers, and layer normalizations. That all GPUs collaborate on every layer simultaneously, functioning together as a unified computational unit.

Cons: Requires high bandwidth and extremely low latency.

Pipeline Parallelism

Let's take look at another approach: Pipeline Parallelism. In pipeline parallelism, each GPU is assigned a specific subset of the model's layers. During training, data flows sequentially through these GPUs, each processes its assigned layers and then passes the output to the next GPU in the pipeline.

This approach significantly reduces memory usage, as each GPU only stores a portion of the model. While it introduces communication overhead when transferring data between GPUs, the volume is generally lower than in tensor parallelism.

Imagine you have a transformer model with a total of 48 layers. If you use pipeline parallelism across 4 GPUs, the layers could be divided equally, like:

GPU 0: Layers 1 to 12

GPU 1: Layers 13 to 24

GPU 2: Layers 25 to 36

GPU 3: Layers 37 to 48

Each GPU only holds the weights and activations for its subset of layers. Each GPU performs computations (like self-attention and feed-forward operations) only on the layers it stores.

Here GPU 0 computes the outputs of layer 1 through 12, GPU 1 computes layer 13 through 24, and so forth.

Without pipeline parallelism, each GPU must store the entire 48-layer model. But with pipeline parallelism, each GPU holds only 12 layers, reducing memory requirements per GPU by the number of pipeline ranks (here, 4 times less memory per GPU).

Pipeline vs Tensor Parallelism

In tensor parallelism, GPUs must exchange intermediate results multiple times per layer—during operations like self-attention, layer normalization, and feed-forward networks. This frequent synchronization demands extremely high bandwidth and very low latency.

Pipeline parallelism, on the other hand, involves less frequent communication. Each GPU only sends activation outputs once, after completing its assigned block of layers. While these activations can be large, the overall communication overhead is lower compared to the repeated, fine-grained exchanges required in tensor parallelism.

We will explore additional real-world model training techniques, such as OpenAI's approach to training very large language models using hybrid parallelism strategies that combine tensor, pipeline, and data parallelism.

AI Infra System Design Private Course 📒

Would you like to master modern AI infrastructure from FAANG engineers with 15+ years of experience? Contact us at [email protected]

AI Workshop

Model Parallelism Part 1