Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?

Benchmark (surveying) Structuring Benchmarking Code (set theory) Disk formatting
DOI: 10.48550/arxiv.2309.08963 Publication Date: 2023-01-01
ABSTRACT
Despite the remarkable capabilities of Large Language Models (LLMs) like GPT-4, producing complex, structured tabular data remains challenging. Our study assesses LLMs' proficiency in structuring tables and introduces a novel fine-tuning method, cognizant structures, to bolster their performance. We unveil Struc-Bench, comprehensive benchmark featuring prominent LLMs (GPT-NeoX-20B, GPT-3.5, Vicuna), which spans text tables, HTML, LaTeX formats. proposed FormatCoT aids crafting format-specific instructions from intended outputs populate this benchmark. Addressing gap task-centered evaluation, we propose two innovative metrics, P-Score (Prompting Score) H-Score (Heuristical Score), more accurately gauge LLM experiments show that applying our structure-aware LLaMA-7B leads substantial performance gains, outshining its counterparts across most measures. In-depth error analysis creating an ability map six dimensions -- coverage, formatting, reasoning, comprehension, pragmatics, hallucination highlight areas for future enhancements suggest forthcoming research trajectories. code models can be found at https://github.com/gersteinlab/Struc-Bench.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....