RWKV: Reinventing RNNs for the Transformer Era
Quadratic growth
Parallelizable manifold
Leverage (statistics)
DOI:
10.48550/arxiv.2305.13048
Publication Date:
2023-01-01
AUTHORS (30)
ABSTRACT
Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in requirements struggle to match the same performance as due limitations parallelization scalability. We propose a novel model architecture, Receptance Weighted Key Value (RWKV), combines efficient parallelizable training of transformers inference RNNs. Our approach leverages attention mechanism allows us formulate either Transformer or an RNN, thus parallelizing computations during maintains constant inference. scale our models large 14 billion parameters, by far largest dense RNN ever trained, find RWKV performs on par similarly sized Transformers, suggesting future work can leverage this architecture create more models. This presents significant step towards reconciling trade-offs between efficiency tasks.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....