NFDI4DS | UHH-SEMS - Publication Details

Evaluation Framework of Large Language Models in Medical Documentation: Development and Usability Study

Preprint

DOI: 10.2196/58329 Publication Date: 2024-09-24T13:31:56Z

Abstract Supplemental Material References Cited by

AUTHORS (10)

Junhyuk Seo

Dasol Choi

Taerim Kim

Won Chul Cha

Minha Kim

Haanju Yoo

Namkee Oh

YongJin Yi

Kye Hwa Lee

Edward Choi

ABSTRACT

Background The advancement of large language models (LLMs) offers significant opportunities for health care, particularly in the generation medical documentation. However, challenges related to ensuring accuracy and reliability LLM outputs, coupled with absence established quality standards, have raised concerns about their clinical application. Objective This study aimed develop validate an evaluation framework assessing applicability LLM-generated emergency department (ED) records, aiming enhance artificial intelligence integration care Methods We organized Healthcare Prompt-a-thon, a competitive event designed explore capabilities LLMs generating accurate records. involved 52 participants who generated 33 initial ED records using HyperCLOVA X, Korean-specialized LLM. applied dual approach. First, evaluation: 4 professionals evaluated 5-point Likert scale across 5 criteria—appropriateness, accuracy, structure/format, conciseness, validity. Second, quantitative developed categorize count errors identifying 7 key error types. Statistical methods, including Pearson correlation intraclass coefficients (ICC), were used assess consistency agreement among evaluators. Results demonstrated strong interrater reliability, ICC values ranging from 0.653 0.887 (P<.001), test-retest coefficient 0.776 (P<.001). Quantitative analysis revealed that invalid most common, constituting 35.38% total errors, while structural malformation had negative impact on score (Pearson r=–0.654; P<.001). A was found between number scores r=–0.633; P<.001), indicating higher rates corresponded lower acceptability. Conclusions Our research provides robust support acceptability proposed framework. It underscores framework’s potential mitigate burdens foster responsible technologies suggesting promising direction future practical applications field.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES (22)

CITATIONS (4)

EXTERNAL LINKS

OPENALEX - Publications CROSSREF - Publications OPENAIRE - Products

PlumX Metrics

Evaluation Framework of Large Language Models in Medical Documentation: Development and Usability Study

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....