MoE — Technical Glossary

Mixture of Experts is the architectural choice that lets modern AI models grow huge without becoming proportionally expensive to run. A traditional dense model uses all of its parameters for every input: a 70B-parameter model does 70B parameters worth of work for every token. An MoE model has many more parameters in total but only activates a small subset (typically 10-20%) for each query. The router decides which “experts” to consult based on the input, and only those experts contribute to the output.

The economics are striking. Mixtral 8x7B has 47B total parameters but uses only ~13B per token, so it runs at the speed of a 13B model while having the capability of something closer to a 47B. DeepSeek V3 has 671B total parameters but uses only 37B per token. GPT-4 is widely believed to be MoE based on its leaked architecture details. Frontier models are increasingly MoE because the alternative (dense models that use all their parameters every time) hits scaling limits much faster.

The technical challenge with MoE is training the router. The router has to learn which experts are good at which kinds of inputs, and it has to balance load across experts so some don’t get overworked while others sit idle. Early MoE training was unstable for this reason. Modern MoE training techniques (auxiliary load-balancing losses, expert dropout, capacity factors) have largely solved the stability problems but the engineering is still more complex than dense training.

In DeAI specifically, MoE matters for two reasons. First, the most capable open-weight models in 2026 are increasingly MoE (Mixtral, DeepSeek V3, several others), so DeAI inference networks need to support them efficiently. Second, MoE inference has unusual hardware patterns (only some experts are active at any moment) which creates opportunities for clever compute scheduling that pure dense inference doesn’t have. This is one of the more active areas of optimisation work in distributed inference systems.

Related terms