In ML at Scale: Data Parallelism, we learned about how to run many copies of models in parallel. But what do we do once models get so big that they canāt fit on one accelerator anymore? We start to split them across multiple machines.
In ML at Scale: Tensor Parallelism, weāll learn about how to shard a model through tensor parallelism. In this post, weāll talk about pipeline parallelismā how to slice a model in many sections and then āpipelineā data through them.
Imagine no data parallelism; we have one batch that weāre sending through one model copy. But the model is so big that we have to split it sequentially across four machines in a pipeline parallel fashion. So machine_0
has the first 1/4th of the model, machine_1
has the next one-fourth, and so on.
How would this work?
First, all machines would need to perform the forward pass ā we need the output of the last model chunk to be able to calculate a loss.
Once we have the loss, we need to calculate gradients. For this, each chunk of the model needs its own activations and the gradients from the previous stage. Therefore, gradients will be calculated from the last chunk all the way back to the first chunk.
Letās visualize this with a diagram.
As displayed in GPipe [1], here is another way of visualizing it
Think about this: machine_1
, machine_2
and machine_3
will be completely idling while machine_0
executes its forward or backward pass. This is very bad! Given four machines, you want them to be performing meaningful operations as often as possible.
So how can we overlap work between the machines? In the current regime, we canāt. Each time step depends on the value of the previous time step. Instead, we have to change the setup ā by breaking our task down into smaller chunks.