Mixture of Experts¶
MOE provides the ability to use different smaller models that have better performance in certain domains. Their use is notable, as it has been stated that GPT-4 is powered by 8 different agents.
Scaling Expert Language Models with Unsupervised Domain Discovery
Developments "Our method clusters a corpus into sets of related documents, trains a separate expert language model on each cluster, and combines them in a sparse ensemble for inference. This approach generalizes embarrassingly parallel training by automatically discovering the domains for each expert, and eliminates nearly all the communication overhead of existing sparse language models. "
Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning
"The codebase is built on T5X, which defines the model and training loop; Flaxformer, which defines the model computation; Flax, which defines the low level model layers; and Jax, which provides the execution." Paper
Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM
Paper The authors demonstrate that selecting parameters from differently trained models at generation can yield significant improvements in performance for lower-sized models. Here is the algorithm:
Algorithm 1 Blended Algorithm¶
1. k ā 1
2. while true do
3. uā ā userās current input turn
4. Sample model parameter Īøā ~ PĪø
5. Generate response rā according to:
6. rā ~ P(r|uā:k, rā:kā1; Īøā)
7. k = k + 1
8. end while