How do Mixture of Expert Models Work?

Untitled

The latest model making waves in the Open Source Language Model Sweepstakes is Mixtral — an 8 x 7B mixture of experts model from the folks at Mistral AI. If you look at the LMSys Chatbot Arena Leaderboard, Mistral’s model are third or fourth, only behind the OAI’s, Google’s and Anthropic’s of the world. Mixtral is the best open-source model according to these metrics:

Untitled

There’s also equivalent rumors that GPT-4 (by far the reigning champion) is a [8 x 220B](https://community.openai.com/t/gpt-4-cost-estimate-updated/578008#:~:text=GPT-4 is a Mixture,at 0.002%24 per 1k token.) mixture of experts model. All this leads to an important question: What even, is a mixture of experts model?

Note that there’s a bunch of MoE shaped work outside of the LLM domain— in this blog, we’re going to stick to LLM MoEs.

<aside> 💡 FYI: I write a similar piece every weekend. Subscribe by emailing literally anything to [email protected]

</aside>

🤔 why do MoE?

Let’s start with a simple question: Why would you try to make a 8x7B frankenstein when you could make a 56B dense model?

There are two different goals to keep in mind:

Ensemble learning: When you train multiple different versions of the same base arch in slightly different ways, you’re able to create a set of models that can compliment each other. This is the idea behind random forests or multi-head attention.
Conditional Computation: If you are able to retain the same level of performance from activating fewer parameters per iteration at inference or training time, you can make your model run faster and be more compute efficient. Yoshua Bengio of Turing Award winner fame has some really interesting work in this domain.

MoEs are intended to incur both these benefits onto large language models.

🎰 but how does it work?

An MoE in concept looks something like this:

Untitled

It is useful to think of them as replacements for an MLP in the network — so you can stack up as many blocks of MoEs as needed in the model itself. Here’s another visualization that might help encapsulate this idea:****

Untitled

📏 the math behind MoEs

The output of the MoE is a weighted sum of the "gating weights" and the "expert outputs”.

🤔 why do MoE?

🎰 but how does it work**?**

📏 the math behind MoEs

🎰 but how does it work?