Generation
Generating new data from an input involves selecting the next best token or sets of tokens given an output logit vector.
Contrastive Decoding¶
Demonstrates large improvements by using differences between better and worse models shows substantial improvement in generative quality.
Contrastive inference:
Any method which controls behavior differential at inference time, directly contrasting outputs from a desirable inference process with outputs from an undesirable inference process. --Sean Obrien
Dola: Decoding by Contrasting Layers Improves Factuality in Large Language Models
"(They) amplify the factual knowledge in an LM
through a contrastive decoding approach, where the output probability over the next word is obtained from
the difference in logits obtained from a higher layer versus a lower layer"
Speculative Approaches¶
Speculative Streaming¶
Speculative Streaming: Fast LLM Inference without Auxiliary Models
The authors show in their paper "single-model speculative decoding method that fuses drafting into the targetmodel by changing the fine-tuning objective from next token prediction to future n-gram prediction. Speculative Streaming speeds up decoding by 1.8 - 3.1X in a diverse set of tasks" |
Speculative Sampling¶
Speculative sampling is a technique that relies on speedups due to generation parallelism to create k-next tokens samples to reduce latency. It starts by using a smaller model to generate a draft set of tokens. These are then run in parallel (instead of serial which is standard) to produce output logits. The draft and target-model tokens are compared and randomly sampled to allow the acceptance of the draft tokens or to generate a new token set.
Joint decoding¶
Co-LLM: Learning to Decode Collaboratively with Multiple Language Models
Developments The author show in their paper that the use of multiple models to improve generated content using the outputs of one as context for the others.