Ouroboros: Speculative Decoding with Large Model Enhanced Drafting
FOS: Computer and information sciences
Computer Science - Computation and Language
Computation and Language (cs.CL)
DOI:
10.48550/arxiv.2402.13720
Publication Date:
2024-02-21
AUTHORS (6)
ABSTRACT
Drafting-then-verifying decoding methods such as speculative are widely adopted training-free to accelerate the inference of large language models (LLMs). Instead employing an autoregressive process decode tokens sequentially, initially creates drafts with efficient small model. Then LLMs required conduct verification and correction in a non-autoregressive fashion minimize time overhead. Generating longer can lead even more significant speedups once verified, but also incurs substantial trial error costs if it fails. Suffering from high failure probability, existing cannot draft too much content for at one time, achieving sub-optimal acceleration. In this paper, we introduce Ouroboros, which constructs phrase candidate pool provide candidates generation Thereby, Ouroboros further improve efficiency effectiveness initial drafts. The experimental results on typical text tasks show that achieves up 1.9x 2.8x compared lookahead decoding, respectively. source code is available https://github.com/thunlp/Ouroboros.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....