Ring Attention with Blockwise Transformers for Near-Infinite Context
Feed forward
DOI:
10.48550/arxiv.2310.01889
Publication Date:
2023-01-01
AUTHORS (3)
ABSTRACT
Transformers have emerged as the architecture of choice for many state-of-the-art AI models, showcasing exceptional performance across a wide range applications. However, memory demands imposed by limit their ability to handle long sequences, thereby posing challenges in utilizing videos, actions, and other long-form sequences modalities complex environments. We present novel approach, Ring Attention with Blockwise (Ring Attention), which leverages blockwise computation self-attention feedforward distribute multiple devices while fully overlapping communication key-value blocks attention. Our approach enables training inference that are up device count times longer than those achievable prior memory-efficient Transformers, without resorting approximations or incurring additional overheads. Extensive experiments on language modeling reinforcement learning tasks demonstrate effectiveness our allowing millions tokens context size improving performance.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....