BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Biomedical text mining Named Entity Recognition Relationship extraction Text corpus Representation F1 score
DOI: 10.1093/bioinformatics/btz682 Publication Date: 2019-09-05T19:27:43Z
ABSTRACT
Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows. With progress in natural language processing (NLP), extracting valuable information from literature has gained popularity among researchers, and deep learning boosted development effective models. However, directly applying advancements NLP to often yields unsatisfactory results due a word distribution shift general domain corpora corpora. In this article, we investigate how recently introduced pre-trained model BERT can be adapted for We introduce BioBERT (Bidirectional Encoder Representations Transformers Text Mining), which domain-specific representation on large-scale almost same architecture across tasks, largely outperforms previous state-of-the-art models variety tasks when While obtains performance comparable that models, significantly them following three representative tasks: named entity recognition (0.62% F1 score improvement), relation extraction (2.80% improvement) question answering (12.24% MRR improvement). Our analysis show pre-training helps it understand complex texts. make weights freely available at https://github.com/naver/biobert-pretrained, source code fine-tuning https://github.com/dmis-lab/biobert.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (38)
CITATIONS (3624)