The latest model making waves in the Open Source Language Model Sweepstakes is Mixtral — an 8 x 7B mixture of experts model from the folks at Mistral AI. If you look at the LMSys Chatbot Arena Leaderboard, Mistral’s model are third or fourth, only behind the OAI’s, Google’s and Anthropic’s of the world. Mixtral is the best open-source model according to these metrics:
There’s also equivalent rumors that GPT-4 (by far the reigning champion) is a [8 x 220B](https://community.openai.com/t/gpt-4-cost-estimate-updated/578008#:~:text=GPT-4 is a Mixture,at 0.002%24 per 1k token.) mixture of experts model. All this leads to an important question: What even, is a mixture of experts model?
Note that there’s a bunch of MoE shaped work outside of the LLM domain— in this blog, we’re going to stick to LLM MoEs.
<aside> 💡 FYI: I write a similar piece every weekend. Subscribe by emailing literally anything to [email protected]
</aside>
Let’s start with a simple question: Why would you try to make a 8x7B frankenstein when you could make a 56B dense model?
There are two different goals to keep in mind:
MoEs are intended to incur both these benefits onto large language models.
An MoE in concept looks something like this:
It is useful to think of them as replacements for an MLP in the network — so you can stack up as many blocks of MoEs as needed in the model itself. Here’s another visualization that might help encapsulate this idea:****
The output of the MoE is a weighted sum of the "gating weights" and the "expert outputs”.