What can a Single Attention Layer Learn? A Study Through the Random Features Lens

Feature (linguistics) Feature vector
DOI: 10.48550/arxiv.2307.11353 Publication Date: 2023-01-01
ABSTRACT
Attention layers -- which map a sequence of inputs to outputs are core building blocks the Transformer architecture has achieved significant breakthroughs in modern artificial intelligence. This paper presents rigorous theoretical study on learning and generalization single multi-head attention layer, with key vectors separate query vector as input. We consider random feature setting where layer large number heads, randomly sampled frozen matrices, trainable value matrices. show that such random-feature can express broad class target functions permutation invariant vectors. further provide quantitative excess risk bounds for these from finite samples, using finitely many heads. Our results several implications unique structure compared existing features theory neural networks, (1) Advantages sample complexity over standard two-layer networks; (2) Concrete natural classes be learned efficiently by layer; (3) The effect sampling distribution query-key weight matrix (the product matrix), Gaussian weights non-zero mean result better complexities zero-mean counterpart certain functions. Experiments simulated data corroborate our findings illustrate interplay between size function.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....