Last time, we talked about data parallelism, which is a strategy for sharing data across multiple machines for large model training. today, we’ll extend this idea to address how we can split the model across multiple machines. We need to start doing this when our models get too big (billions of parameters do not fit on a GPU).
of course, you say, split the model into two halves, place each half of a machine and imagine them as one unit! problem solved!
what you have invented is known as pipeline parallelism. it has its own set of complications. We are not covering pipeline parallelism today. Instead, we’re talking about tensor parallelism. Very generally and imprecisely, the idea of tensor parallelism is something like this:
Put another way, if you imagine a language model as a series of steps $s_1…s_n$, pipeline parallelism is splitting the model $s1, s2…$ on one machine and $S_i…s_n$ on another machine. On the other hand, tensor parallelism is about how we can split a single step $s_i$ onto multiple machines.
So let’s get into it!
here we’ll discuss how you can shard a matrix for multiplication and parallelize it. imagine a simple multiplication of the shape below:
Our core insight here is that each of the values of Y depend on a different column of B, and thus can be parallelized.
and thus we can create an equivalent multi-machine version of the matrix multiply that shards B, calculates individual $y_i$ and collects Y at the end.
here i’ll remind readers of a bitter lesson we learned last time— that communication cost is the boogeyman of large model training. If we were to do this for every operation in a large language model, the tensor parallel version of the model would be far slower than if we hadn’t parallelized due to all the communications overhead! Let’s illustrate why by expanding this example.
Suppose now we have a similar computation as before, but we have two sequential matrix multiplies, i.e. $Y = ABC$. If we follow the same trick where we shard B column-wise, we’ll get screwed!