Mixture of Experts (MoE)

PM: Skim — 15 min

A "70 billion parameter model" sounds intimidating and expensive. But if that model is built using a technique called Mixture of Experts, only about 12 billion of those parameters are doing any work for any given token. The rest are sitting idle. This is the architecture behind some of the most capable models available today — and it changes how you should think about model size.

The Problem with Dense Models

In a standard transformer (sometimes called a "dense" model), every parameter participates in processing every token. The model is fully engaged on every forward pass. This makes sense for small models, but as models grew to hundreds of billions of parameters, the compute cost of evaluating all of them for every token became enormous.

Mixture of Experts (MoE) is an architectural response to that problem. Instead of one big feed-forward network that always runs, MoE replaces it with a set of smaller "expert" subnetworks, plus a lightweight router that decides which experts to use for each token.

How the Routing Works

The router — sometimes called a gating network — is a small learned function. For each token, it assigns weights to each expert and selects the top-k highest-weighted ones. A common configuration is top-2 of 8 experts: each token activates exactly two experts, and their outputs are combined proportionally to the router's assigned weights.

The experts that aren't selected do nothing. No computation, no memory bandwidth. They might as well not exist for that particular token.

The Core Trade-off

MoE gives you a favorable deal on compute: the model activates far fewer parameters per token than its total count suggests, so inference is cheaper than a comparable dense model. Mixtral 8×7B, for example, has 46.7 billion total parameters but activates around 12.9 billion per forward pass. It runs closer to a 13B model in terms of compute, while achieving quality competitive with much larger dense models.

The catch is memory. All the experts have to live somewhere. Even though most experts aren't active on any given token, the full model must be loaded into memory (typically GPU VRAM) to run. You can't load just the experts you need on the fly — the routing happens in real time.

This makes MoE attractive on servers with ample memory, but harder to deploy on constrained hardware. Running a 46B-parameter MoE model still requires the VRAM to hold all 46B parameters, even though you only compute with 13B of them at a time.

Keeping Experts Balanced

A naïve router quickly develops a preference for a small subset of experts. Left to its own devices, the model might always pick expert 2 and expert 5, leaving the other six perpetually idle. This wastes the capacity you built in.

Training mitigates this with an auxiliary loss — an additional term in the loss function that penalizes uneven expert utilization. The router is nudged to distribute tokens more evenly across all experts, ensuring each one develops distinct, useful specializations. When it works well, different experts specialize in different kinds of content: some handle technical language, others narrative prose, others code syntax.

Which Models Use MoE?

MoE is increasingly mainstream:

Mixtral 8×7B (Mistral AI): one of the first widely available open-weight MoE models; 8 experts, top-2 routing, ~12.9B active parameters
Mixtral 8×22B: the larger follow-up, with 39B active parameters out of 141B total
GPT-4: widely rumored to use MoE based on inference cost patterns, though OpenAI has not confirmed the architecture publicly
Gemini: confirmed MoE architecture by Google (Gemini 1.5 and later)
DeepSeek-V2 / V3: open-weight MoE models from DeepSeek (China); V2 has 236B total / 21B active parameters, V3 has 671B total / 37B active — both demonstrated that open MoE models can match closed frontier quality at a fraction of the training cost

The trend is clear. As models scale, MoE is becoming the standard way to expand capacity without proportionally increasing per-token compute cost.

Why This Matters for Evaluations

When you're comparing models or evaluating a provider's offering, total parameter count is a poor proxy for what you actually care about — speed and cost per token. Two models with the same total parameters can have very different inference costs if one is dense and the other is MoE.

The relevant metric is active parameters per forward pass: how many parameters are actually engaged when the model processes your input. This is what determines compute, latency, and cost. A 70B MoE model may be faster and cheaper to run than a 13B dense model, depending on the specific architecture.

PM Roadmap Tip

When comparing models, "total parameters" is a poor proxy for inference cost. For MoE models, ask for "active parameters per forward pass" — that drives latency and pricing. A 70B MoE model may be cheaper to run than a 13B dense model. When vendors quote model sizes, always ask which number they mean.

The Problem with Dense Models​

How the Routing Works​

The Core Trade-off​

Keeping Experts Balanced​

Which Models Use MoE?​

Why This Matters for Evaluations​

Further Reading​