Mixture of Experts¶

MOE provides the ability to use different smaller models that have better performance in certain domains. Their use is notable, as it has been stated that GPT-4 is powered by 8 different agents.

Scaling Expert Language Models with Unsupervised Domain Discovery

Developments "Our method clusters a corpus into sets of related documents, trains a separate expert language model on each cluster, and combines them in a sparse ensemble for inference. This approach generalizes embarrassingly parallel training by automatically discovering the domains for each expert, and eliminates nearly all the communication overhead of existing sparse language models. "

SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

Paper

Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning

"The codebase is built on T5X, which defines the model and training loop; Flaxformer, which defines the model computation; Flax, which defines the low level model layers; and Jax, which provides the execution." Paper

Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM

Paper The authors demonstrate that selecting parameters from differently trained models at generation can yield significant improvements in performance for lower-sized models. Here is the algorithm:

Algorithm 1 Blended Algorithm¶

1. k ← 1
2. while true do
3.     uₖ ← user’s current input turn
4.     Sample model parameter θₙ ~ Pθ
5.     Generate response rₖ according to:
6.         rₖ ~ P(r|u₁:k, r₁:k−1; θₙ)
7.     k = k + 1
8. end while