Extending LLMs' Context Window with 100 Samples

DOI: 10.48550/arxiv.2401.07004 Publication Date: 2024-01-01
ABSTRACT
Large Language Models (LLMs) are known to have limited extrapolation ability beyond their pre-trained context window, constraining application in downstream tasks with lengthy inputs. Recent studies sought extend LLMs' window by modifying rotary position embedding (RoPE), a popular encoding method adopted well-known LLMs such as LLaMA, PaLM, and GPT-NeoX. However, prior works like Position Interpolation (PI) YaRN resource-intensive lack comparative experiments assess applicability. In this work, we identify the inherent need for attention entropy (i.e. information of scores) maintain stability introduce novel extension RoPE which combines adjusting RoPE's base frequency scaling logits help efficiently adapt larger window. We validate superiority our both fine-tuning performance robustness across different sizes on various context-demanding tasks. Notably, extends LLaMA-2-7B-Chat 16,384 only 100 samples 6 training steps, showcasing extraordinary efficiency. Finally, also explore how data compositions curricula affect specific tasks, suggesting conversations good starting point. release code SFT at https://github.com/GAIR-NLP/Entropy-ABF.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....