NFDI4DS | UHH-SEMS - Publication Details

What can a Single Attention Layer Learn? A Study Through the Random Features Lens

Feature (linguistics) Feature vector

DOI: 10.48550/arxiv.2307.11353 Publication Date: 2023-01-01

Abstract Supplemental Material References Cited by

AUTHORS (4)

Hengyu Fu

Tianyu Guo

Yu Bai

Mei Song

ABSTRACT

Attention layers -- which map a sequence of inputs to outputs are core building blocks the Transformer architecture has achieved significant breakthroughs in modern artificial intelligence. This paper presents rigorous theoretical study on learning and generalization single multi-head attention layer, with key vectors separate query vector as input. We consider random feature setting where layer large number heads, randomly sampled frozen matrices, trainable value matrices. show that such random-feature can express broad class target functions permutation invariant vectors. further provide quantitative excess risk bounds for these from finite samples, using finitely many heads. Our results several implications unique structure compared existing features theory neural networks, (1) Advantages sample complexity over standard two-layer networks; (2) Concrete natural classes be learned efficiently by layer; (3) The effect sampling distribution query-key weight matrix (the product matrix), Gaussian weights non-zero mean result better complexities zero-mean counterpart certain functions. Experiments simulated data corroborate our findings illustrate interplay between size function.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENALEX - Publications OPENAIRE - Products

PlumX Metrics

What can a Single Attention Layer Learn? A Study Through the Random Features Lens

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....