Untitled

🎬 introduction

In ML at Scale: Data Parallelism, we learned about how to run many copies of models in parallel. But what do we do once models get so big that they can’t fit on one accelerator anymore? We start to split them across multiple machines.

In ML at Scale: Tensor Parallelism, we’ll learn about how to shard a model through tensor parallelism. In this post, we’ll talk about pipeline parallelism— how to slice a model in many sections and then “pipeline” data through them.

Untitled

🐣 starting simple: naive pipeline parallelism

Imagine no data parallelism; we have one batch that we’re sending through one model copy. But the model is so big that we have to split it sequentially across four machines in a pipeline parallel fashion. So machine_0 has the first 1/4th of the model, machine_1 has the next one-fourth, and so on.

How would this work?

First, all machines would need to perform the forward pass — we need the output of the last model chunk to be able to calculate a loss.

Once we have the loss, we need to calculate gradients. For this, each chunk of the model needs its own activations and the gradients from the previous stage. Therefore, gradients will be calculated from the last chunk all the way back to the first chunk.

Let’s visualize this with a diagram.

Screenshot 2024-07-20 at 12.54.40 PM.png

As displayed in GPipe [1], here is another way of visualizing it

Screenshot 2024-07-20 at 10.29.42 AM.png

Screenshot 2024-07-20 at 10.50.31 AM.png

☹️ the problem with naive parallelism

Think about this: machine_1, machine_2 and machine_3 will be completely idling while machine_0 executes its forward or backward pass. This is very bad! Given four machines, you want them to be performing meaningful operations as often as possible.

So how can we overlap work between the machines? In the current regime, we can’t. Each time step depends on the value of the previous time step. Instead, we have to change the setup — by breaking our task down into smaller chunks.

🤿 true pipelining with microbatches