NFDI4DS | UHH-SEMS - Publication Details

A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding

FOS: Computer and information sciences Computer Science - Computation and Language Artificial Intelligence (cs.AI) Computer Science - Artificial Intelligence Computation and Language (cs.CL) Computer Science - Multimedia Multimedia (cs.MM)

DOI: 10.48550/arxiv.2407.01976 Publication Date: 2024-01-01

Abstract Supplemental Material References Cited by

AUTHORS (12)

Lu, Jinghui

Yu, Haiyang

Wang, Yanjie

Ye, Yongjie

Tang, Jingqun

Yang, Ziwei

Wu, Binghong

Liu, Qi

Feng, Hao

Wang, Han

Liu, Hao

Huang, Can

ABSTRACT

Recently, many studies have demonstrated that exclusively incorporating OCR-derived text and spatial layouts with large language models (LLMs) can be highly effective for document understanding tasks. However, existing methods that integrate spatial layouts with text have limitations, such as producing overly long text sequences or failing to fully leverage the autoregressive traits of LLMs. In this work, we introduce Interleaving Layout and Text in a Large Language Model (LayTextLLM)} for document understanding. In particular, LayTextLLM projects each bounding box to a single embedding and interleaves it with text, efficiently avoiding long sequence issues while leveraging autoregressive traits of LLMs. LayTextLLM not only streamlines the interaction of layout and textual data but also shows enhanced performance in Key Information Extraction (KIE) and Visual Question Answering (VQA). Comprehensive benchmark evaluations reveal significant improvements, with a 27.2% increase on KIE tasks and 12.0% on VQA tasks compared to previous state-of-the-art document understanding MLLMs, as well as a 15.1% improvement over other SOTA OCR-based LLMs on KIE tasks.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products

PlumX Metrics

A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....