An Empirical Study on Large Language Models in Accuracy and Robustness under Chinese Industrial Scenarios

Robustness Empirical Research
DOI: 10.48550/arxiv.2402.01723 Publication Date: 2024-01-26
ABSTRACT
Recent years have witnessed the rapid development of large language models (LLMs) in various domains. To better serve number Chinese users, many commercial vendors China adopted localization strategies, training and providing local LLMs specifically customized for users. Furthermore, looking ahead, one key future applications will be practical deployment industrial production by enterprises users those sectors. However, accuracy robustness scenarios not been well studied. In this paper, we present a comprehensive empirical study on context area. We manually collected 1,200 domain-specific problems from 8 different sectors to evaluate LLM accuracy. designed metamorphic testing framework containing four industrial-specific stability categories with eight abilities, totaling 13,631 questions variants robustness. total, evaluated 9 developed vendors, as global vendors. Our major findings include: (1) Current exhibit low contexts, all scoring less than 0.6. (2) The scores vary across sectors, overall perform worse ones. (3) differs significantly abilities. Global are more robust under logical-related variants, while advanced related understanding terminology. results provide valuable guidance promoting domain capabilities both enterprise perspectives. further motivate possible research directions tooling support.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....