Evaluation Framework of Large Language Models in Medical Documentation: Development and Usability Study
Preprint
DOI:
10.2196/58329
Publication Date:
2024-09-24T13:31:56Z
AUTHORS (10)
ABSTRACT
Background The advancement of large language models (LLMs) offers significant opportunities for health care, particularly in the generation medical documentation. However, challenges related to ensuring accuracy and reliability LLM outputs, coupled with absence established quality standards, have raised concerns about their clinical application. Objective This study aimed develop validate an evaluation framework assessing applicability LLM-generated emergency department (ED) records, aiming enhance artificial intelligence integration care Methods We organized Healthcare Prompt-a-thon, a competitive event designed explore capabilities LLMs generating accurate records. involved 52 participants who generated 33 initial ED records using HyperCLOVA X, Korean-specialized LLM. applied dual approach. First, evaluation: 4 professionals evaluated 5-point Likert scale across 5 criteria—appropriateness, accuracy, structure/format, conciseness, validity. Second, quantitative developed categorize count errors identifying 7 key error types. Statistical methods, including Pearson correlation intraclass coefficients (ICC), were used assess consistency agreement among evaluators. Results demonstrated strong interrater reliability, ICC values ranging from 0.653 0.887 (P<.001), test-retest coefficient 0.776 (P<.001). Quantitative analysis revealed that invalid most common, constituting 35.38% total errors, while structural malformation had negative impact on score (Pearson r=–0.654; P<.001). A was found between number scores r=–0.633; P<.001), indicating higher rates corresponded lower acceptability. Conclusions Our research provides robust support acceptability proposed framework. It underscores framework’s potential mitigate burdens foster responsible technologies suggesting promising direction future practical applications field.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (22)
CITATIONS (4)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....