Speculative Streaming: Fast LLM Inference without Auxiliary Models

The authors show in their paper "single-model speculative decoding method that fuses drafting into the targetmodel by changing the fine-tuning objective from next token prediction to future n-gram prediction. Speculative Streaming speeds up decoding by 1.8 - 3.1X in a diverse set of tasks" | image

Speculative Sampling

Share link! 📋
Link copied!
See the main site!