Pre-trained Models¶

Dynamic Field

It is impossible to keep up manually with all pre-trained models. For the most up-to-date information, refer to the Hugging Face Open LLM Leaderboard.

Because of the costs associated with aggregating sufficient data and performing large-scale training, it is often preferable to start with pre-trained models. They can be both open source and closed source in origin, and choosing between them will be an important decision related to project requirements.

To ensure models meet technical, customer, and organizational requirements, it is important to compare and evaluate them.

API-Based Models¶

API Access

OpenAI: Access to GPT models through API
Hugging Face Transformers: Popular library for transformer models

Open Source Models¶

Latest Developments¶

Llama 3

Trained on 15T Multilingual tokens, with 405B trainable parameters: - Powerful data selection and synthesis strategy - Simple post-training with SFT, rejection sampling, and DPO - 4D Parallelism combining TP, PP, CP, and DP

Parallelism approach:

Multimodal training:

[DeepSeek]

Perplexity just released POST TRAINED DeepSeek R1 for factual and unbiased information - MIT Licensed 🔥 Post Trained

DeepSeek R1 MIT-licensed reasoning language model with a 4-stage training process: - Initial R1-Zero model trained directly with RL from base model - Cold-start SFT using synthetic reasoning data from R1-Zero - Large-scale RL training on reasoning problems - Rejection sampling and final RL polish for general capabilities - Competitive with OpenAI's o1 at significantly lower cost - Includes distilled versions for smaller models

Multimodal Models¶

MOLMO

High-quality image captioning using voice recordings: - Blog - Paper

Text Models¶

Llama 2

Open-source set of 7B-70B models: - Paper: Llama 2: Open Foundation and Fine-Tuned Chat Models - Strong performance across tasks

Mistral

Released September 2023: - Announcement - Hugging Face

Additional Text Models

Qwen

Open-source models including Qwen-72B and Qwen-1.8B: - Trained on 3T tokens of high-quality data - 32K context window length - Enhanced system prompt capability - Qwen-1.8B optimized for efficiency (3GB GPU memory) - GitHub Repository

Vision Models¶

Vision-Focused Models

Speech Models¶

Moshi

Speech-text foundation model for real-time dialogue

Closed Source Models¶

OpenAI o1

Next generation model with integrated chain of thought: - Improved complex reasoning and transparent explanations - Scales performance with inference compute - Introduces AGI-benchmark 1.0 with 27 categories - Demonstrates inference time scaling laws - Reproducible Results - System Card

Gemini

Google's multimodal model: - Technical Report - AlphaCode2 Report

Additional Closed Source Models