📝 @Arushi Somani May 19, 2024 10:35 PM (PDT)

<aside> 💡 Live-updating repo: https://github.com/somaniarushi/data-parallel

</aside>

… to really understand the bottlenecks and flaws. I’ve previously done a logical breakdown of the concept: ML at Scale: Data Parallelism. However, understanding how it works does not grant the ability to implement.

In this post, we will attempt to implement data parallelism on 4 GPUs, from scratch. This is not novel research, but will force us to understand the system we rely on more deeply.

Conceptual Refresher

Untitled

The core idea behind data parallelism is to:

  1. Make replicas of your model
  2. Train them on different (but similar) data, calculate loss and gradients for each
  3. Average the gradients
  4. Update the model based on the averaged gradients
  5. Repeat.

V0: Simple one-iteration implementation, no parallelism

I implemented a one-iteration long loop of training, such that 1) the models start equal, 2) each gets its own section of data, 3) each calculates its own gradient, 4) the gradients are reduced and returned, and 5) an optimizer step is taken, such that all models are equal at the end.

Yay!

The setup is not parallel, and is perhaps even slower than no parallelism at all. For the most part, the system is close to what a practical application would look like… except that we’re making n copies of the optimizer and each model copy keeps a replica of the optimizer state, which seems wasteful.

Couldn’t find any direct citations about this, asked a question here: https://stackoverflow.com/questions/78504808/how-is-optimizerth-step-implemented-for-data-parallelism-in-pytorch

We also note that the losses look equal to the DataParallel util in pytorch. The losses are 1.17 vs 1.19 (perhaps not close enough?) Diffing the models now…

tensor([[5.8021e-07, 6.5863e-06],
[1.7134e-06, 1.6799e-06]],