Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More
FOS: Computer and information sciences
Computer Science - Machine Learning
Computer Science - Computation and Language
Computation and Language (cs.CL)
Machine Learning (cs.LG)
DOI:
10.48550/arxiv.2502.07490
Publication Date:
2025-02-11
AUTHORS (7)
ABSTRACT
Large Language Models (LLMs) are discovered to suffer from accurately retrieving key information. To address this, we propose Mask-Enhanced Autoregressive Prediction (MEAP), a simple yet effective training paradigm that seamlessly integrates Masked Modeling (MLM) into Next-Token (NTP) enhance the latter's in-context retrieval capabilities. Specifically, MEAP first randomly masks small fraction of input tokens and then directly performs standard next-token prediction autoregressive using decoder-only Transformer. eliminates need for bidirectional attention or encoder-decoder architectures MLM, incurring no additional computational overhead during pre-training inference. Intensive experiments demonstrate substantially outperforms NTP on information long-context reasoning tasks, while performing par better commonsense tasks. The benefits also extend supervised fine-tuning, where it shows remarkable advantages in lost-in-the-middle scenarios, outperforming by 11.77 percentage points. Our analysis indicates MEAP's effectiveness arises its ability promote more distinguishable scores concentrating reduced set non-masked tokens. This mechanism improves model's focus task-relevant signals mitigating influence peripheral context. These findings position as promising large language models.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....