Mixture of Experts (MoE)
A "70 billion parameter model" sounds intimidating and expensive. But if that model is built using a technique called Mixture of Experts, only about 12 billion of those parameters are doing any work for any given token. The rest are sitting idle. This is the architecture behind some of the most capable models available today β and it changes how you should think about model size.
The Problem with Dense Modelsβ
In a standard transformer (sometimes called a "dense" model), every parameter participates in processing every token. The model is fully engaged on every forward pass. This makes sense for small models, but as models grew to hundreds of billions of parameters, the compute cost of evaluating all of them for every token became enormous.
Mixture of Experts (MoE) is an architectural response to that problem. Instead of one big feed-forward network that always runs, MoE replaces it with a set of smaller "expert" subnetworks, plus a lightweight router that decides which experts to use for each token.
How the Routing Worksβ
The router β sometimes called a gating network β is a small learned function. For each token, it assigns weights to each expert and selects the top-k highest-weighted ones. A common configuration is top-2 of 8 experts: each token activates exactly two experts, and their outputs are combined proportionally to the router's assigned weights.
The experts that aren't selected do nothing. No computation, no memory bandwidth. They might as well not exist for that particular token.
The Core Trade-offβ
MoE gives you a favorable deal on compute: the model activates far fewer parameters per token than its total count suggests, so inference is cheaper than a comparable dense model. Mixtral 8Γ7B, for example, has 46.7 billion total parameters but activates around 12.9 billion per forward pass. It runs closer to a 13B model in terms of compute, while achieving quality competitive with much larger dense models.
The catch is memory. All the experts have to live somewhere. Even though most experts aren't active on any given token, the full model must be loaded into memory (typically GPU VRAM) to run. You can't load just the experts you need on the fly β the routing happens in real time.
This makes MoE attractive on servers with ample memory, but harder to deploy on constrained hardware. Running a 46B-parameter MoE model still requires the VRAM to hold all 46B parameters, even though you only compute with 13B of them at a time.
Keeping Experts Balancedβ
A naΓ―ve router quickly develops a preference for a small subset of experts. Left to its own devices, the model might always pick expert 2 and expert 5, leaving the other six perpetually idle. This wastes the capacity you built in.
Training mitigates this with an auxiliary loss β an additional term in the loss function that penalizes uneven expert utilization. The router is nudged to distribute tokens more evenly across all experts, ensuring each one develops distinct, useful specializations. When it works well, different experts specialize in different kinds of content: some handle technical language, others narrative prose, others code syntax.
Which Models Use MoE?β
MoE is increasingly mainstream:
- Mixtral 8Γ7B (Mistral AI): one of the first widely available open-weight MoE models; 8 experts, top-2 routing, ~12.9B active parameters
- Mixtral 8Γ22B: the larger follow-up, with 39B active parameters out of 141B total
- GPT-4: widely rumored to use MoE based on inference cost patterns, though OpenAI has not confirmed the architecture publicly
- Gemini 1.5: confirmed MoE architecture by Google
The trend is clear. As models scale, MoE is becoming the standard way to expand capacity without proportionally increasing per-token compute cost.
Why This Matters for Evaluationsβ
When you're comparing models or evaluating a provider's offering, total parameter count is a poor proxy for what you actually care about β speed and cost per token. Two models with the same total parameters can have very different inference costs if one is dense and the other is MoE.
The relevant metric is active parameters per forward pass: how many parameters are actually engaged when the model processes your input. This is what determines compute, latency, and cost. A 70B MoE model may be faster and cheaper to run than a 13B dense model, depending on the specific architecture.
When comparing models, "total parameters" is a poor proxy for inference cost. For MoE models, ask for "active parameters per forward pass" β that drives latency and pricing. A 70B MoE model may be cheaper to run than a 13B dense model. When vendors quote model sizes, always ask which number they mean.
Further Readingβ
src/data/papers.ts.