Is the Number of Trainable Parameters All That Actually Matters?
Spurious relationship
Scaling law
DOI:
10.48550/arxiv.2109.11928
Publication Date:
2021-01-01
AUTHORS (5)
ABSTRACT
Recent work has identified simple empirical scaling laws for language models, linking compute budget, dataset size, model and autoregressive modeling loss. The validity of these power across orders magnitude in scale provides compelling evidence that larger models are also more capable models. However, up under the constraints hardware infrastructure is no easy feat, rapidly becomes a hard expensive engineering problem. We investigate ways to tentatively cheat laws, train cheaper. emulate an increase effective parameters, using efficient approximations: either by doping with frozen random or fast structured transforms place dense linear layers. find relationship between test loss depends only on actual number trainable parameters; cannot be deceived spurious parameters.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....