In ML at Scale: Data Parallelism, we learned about how to run many copies of models in parallel. But what do we do once models get so big that they can’t fit on one accelerator anymore? We start to split them across multiple machines.
In ML at Scale: Tensor Parallelism, we’ll learn about how to shard a model through tensor parallelism. In this post, we’ll talk about pipeline parallelism— how to slice a model in many sections and then “pipeline” data through them.
Imagine no data parallelism; we have one batch that we’re sending through one model copy. But the model is so big that we have to split it sequentially across four machines in a pipeline parallel fashion. So machine_0
has the first 1/4th of the model, machine_1
has the next one-fourth, and so on.
How would this work?
First, all machines would need to perform the forward pass — we need the output of the last model chunk to be able to calculate a loss.
Once we have the loss, we need to calculate gradients. For this, each chunk of the model needs its own activations and the gradients from the previous stage. Therefore, gradients will be calculated from the last chunk all the way back to the first chunk.
Let’s visualize this with a diagram.
As displayed in GPipe [1], here is another way of visualizing it
Think about this: machine_1
, machine_2
and machine_3
will be completely idling while machine_0
executes its forward or backward pass. This is very bad! Given four machines, you want them to be performing meaningful operations as often as possible.
So how can we overlap work between the machines? In the current regime, we can’t. Each time step depends on the value of the previous time step. Instead, we have to change the setup — by breaking our task down into smaller chunks.