Towards a scientific workflow featuring Natural Language Processing for the digitisation of natural history collections
Optical character recognition
Component (thermodynamics)
DOI:
10.3897/rio.6.e55789
Publication Date:
2020-07-03T14:30:49Z
AUTHORS (8)
ABSTRACT
We describe an effective approach to automated text digitisation with respect natural history specimen labels. These labels contain much useful data about the including its collector, country of origin, and collection date. Our automatically extracting these takes form a pipeline. Recommendations are made for pipeline's component parts based on some state-of-the-art technologies. Optical Character Recognition (OCR) can be used digitise images specimens. However, recognising quickly accurately from challenge OCR. show that OCR performance improved by prior segmentation into their parts. This ensures only text-bearing submitted processing as opposed whole images, which inevitably non-textual information may lead false positive readings. In our testing Tesseract version 4.0.0 offers promising recognition accuracy segmented images. Not all is printed. Handwritten varies more does not conform standard shapes sizes individual characters, poses additional Recently, deep learning has allowed significant advances in this area. Google's Cloud Vision, learning, trained large-scale datasets, shown quite adept at task. take us way towards negating need humans routinely transcribe handwritten text. Determining countries collectors specimens been goal previous research activities. also focuses two pieces information. An area Natural Language Processing (NLP) known Named Entity (NER) matured enough semi-automate experiments demonstrated existing approaches recognise location person names within extracted via 4.0.0. Potentially, NER could conjunction other online services, such those Biodiversity Heritage Library map named entities biodiversity literature (https://www.biodiversitylibrary.org/docs/api3.html). have highlighted main recommendations potential pipeline components. The document provides guidance selecting appropriate software solutions. include automatic language identification, terminology extraction, integrating components scientific workflow automate overall process.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (51)
CITATIONS (5)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....