NFDI4DS | UHH-SEMS - Publication Details

Understanding HTML with Large Language Models

Benchmark (surveying) Natural language understanding

DOI: 10.48550/arxiv.2210.03945 Publication Date: 2022-01-01

Abstract Supplemental Material References Cited by

AUTHORS (9)

İzzeddin Gür

Ofir Nachum

Yingjie Miao

Mustafa Safdari

Austin Huang

Aakanksha Chowdhery

Sharan Narang

Noah Fiedel

Aleksandra Faust

ABSTRACT

Large language models (LLMs) have shown exceptional performance on a variety of natural tasks. Yet, their capabilities for HTML understanding -- i.e., parsing the raw webpage, with applications to automation web-based tasks, crawling, and browser-assisted retrieval not been fully explored. We contribute (fine-tuned LLMs) an in-depth analysis under three tasks: (i) Semantic Classification elements, (ii) Description Generation inputs, (iii) Autonomous Web Navigation pages. While previous work has developed dedicated architectures training procedures understanding, we show that LLMs pretrained standard corpora transfer remarkably well For instance, fine-tuned are 12% more accurate at semantic classification compared trained exclusively task dataset. Moreover, when data from MiniWoB benchmark, successfully complete 50% tasks using 192x less best supervised model. Out evaluate, evidence T5-based ideal due bidirectional encoder-decoder architecture. To promote further research create open-source large-scale dataset distilled auto-labeled CommonCrawl.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products OPENALEX - Publications

PlumX Metrics

Understanding HTML with Large Language Models

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....