Speculative Streaming: Fast LLM Inference without Auxiliary Models
The authors show in their paper "single-model speculative decoding method that fuses drafting into the targetmodel by changing the fine-tuning objective from next token prediction to future n-gram prediction. Speculative Streaming speeds up decoding by 1.8 - 3.1X in a diverse set of tasks" |