NFDI4DS | UHH-SEMS - Publication Details

Sample Size in Natural Language Processing within Healthcare Research

Sample (material)

DOI: 10.48550/arxiv.2309.02237 Publication Date: 2023-01-01

Abstract Supplemental Material References Cited by

AUTHORS (7)

Jaya Chaturvedi

Diana Shamsutdinova

Felix Zimmer

Sumithra Velupillai

Daniel Ståhl

Robert Stewart

Angus Roberts

ABSTRACT

Sample size calculation is an essential step in most data-based disciplines. Large enough samples ensure representativeness of the population and determine precision estimates. This true for quantitative studies, including those that employ machine learning methods, such as natural language processing, where free-text used to generate predictions classify instances text. Within healthcare domain, lack sufficient corpora previously collected data can be a limiting factor when determining sample sizes new studies. paper tries address issue by making recommendations on text classification tasks domain. Models trained MIMIC-III database critical care records from Beth Israel Deaconess Medical Center were documents having or not Unspecified Essential Hypertension, common diagnosis code database. Simulations performed using various classifiers different class proportions. was repeated comparatively less within diabetes mellitus without mention complication. Smaller resulted better results K-nearest neighbours classifier, whereas larger provided with support vector machines BERT models. Overall, than 1000 provide decent performance metrics. The simulations conducted this study guidelines selecting appropriate proportions, predicting expected performance, building textual data. methodology here modified estimates calculations other datasets.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products OPENALEX - Publications

PlumX Metrics

Sample Size in Natural Language Processing within Healthcare Research

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....