ML at Scale: Tensor Parallelism

Untitled

introduction

Last time, we talked about data parallelism, which is a strategy for sharing data across multiple machines for large model training. today, we’ll extend this idea to address how we can split the model across multiple machines. We need to start doing this when our models get too big (billions of parameters do not fit on a GPU).

of course, you say, split the model into two halves, place each half of a machine and imagine them as one unit! problem solved!

Screenshot 2024-03-23 at 7.12.48 PM.png

what you have invented is known as pipeline parallelism. it has its own set of complications. We are not covering pipeline parallelism today. Instead, we’re talking about tensor parallelism. Very generally and imprecisely, the idea of tensor parallelism is something like this:

Screenshot 2024-03-23 at 7.15.10 PM.png

Put another way, if you imagine a language model as a series of steps $s_1…s_n$, pipeline parallelism is splitting the model $s1, s2…$ on one machine and $S_i…s_n$ on another machine. On the other hand, tensor parallelism is about how we can split a single step $s_i$ onto multiple machines.

So let’s get into it!

simple tensor sharding

here we’ll discuss how you can shard a matrix for multiplication and parallelize it. imagine a simple multiplication of the shape below:

Screenshot 2024-03-23 at 7.24.12 PM.png

Our core insight here is that each of the values of Y depend on a different column of B, and thus can be parallelized.

Screenshot 2024-03-23 at 7.24.25 PM.png

and thus we can create an equivalent multi-machine version of the matrix multiply that shards B, calculates individual $y_i$ and collects Y at the end.

Screenshot 2024-03-23 at 7.26.24 PM.png

here i’ll remind readers of a bitter lesson we learned last time— that communication cost is the boogeyman of large model training. If we were to do this for every operation in a large language model, the tensor parallel version of the model would be far slower than if we hadn’t parallelized due to all the communications overhead! Let’s illustrate why by expanding this example.

column-parallel sharding + row-parallel sharding

Suppose now we have a similar computation as before, but we have two sequential matrix multiplies, i.e. $Y = ABC$. If we follow the same trick where we shard B column-wise, we’ll get screwed!