Sample Size in Natural Language Processing within Healthcare Research
Sample (material)
DOI:
10.48550/arxiv.2309.02237
Publication Date:
2023-01-01
AUTHORS (7)
ABSTRACT
Sample size calculation is an essential step in most data-based disciplines. Large enough samples ensure representativeness of the population and determine precision estimates. This true for quantitative studies, including those that employ machine learning methods, such as natural language processing, where free-text used to generate predictions classify instances text. Within healthcare domain, lack sufficient corpora previously collected data can be a limiting factor when determining sample sizes new studies. paper tries address issue by making recommendations on text classification tasks domain. Models trained MIMIC-III database critical care records from Beth Israel Deaconess Medical Center were documents having or not Unspecified Essential Hypertension, common diagnosis code database. Simulations performed using various classifiers different class proportions. was repeated comparatively less within diabetes mellitus without mention complication. Smaller resulted better results K-nearest neighbours classifier, whereas larger provided with support vector machines BERT models. Overall, than 1000 provide decent performance metrics. The simulations conducted this study guidelines selecting appropriate proportions, predicting expected performance, building textual data. methodology here modified estimates calculations other datasets.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....