Jouni Luoma

ORCID: 0000-0001-9286-1868
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Natural Language Processing Techniques
  • Topic Modeling
  • Biomedical Text Mining and Ontologies
  • Text Readability and Simplification
  • Web Data Mining and Analysis
  • Artificial Intelligence in Healthcare
  • Bioinformatics and Genomic Networks
  • Computational Drug Discovery Methods
  • Data Quality and Management
  • Forest Management and Policy
  • Speech Recognition and Synthesis
  • Genomics and Phylogenetic Studies
  • Radiomics and Machine Learning in Medical Imaging
  • Semantic Web and Ontologies
  • Medical Imaging Techniques and Applications
  • Medical Image Segmentation Techniques

University of Turku
2020-2024

Deep learning-based language models pretrained on large unannotated text corpora have been demonstrated to allow efficient transfer learning for natural processing, with recent approaches such as the transformer-based BERT model advancing state of art across a variety tasks. While most work these has focused high-resource languages, in particular English, number efforts introduced multilingual that can be fine-tuned address tasks different languages. However, we still lack thorough...

10.48550/arxiv.1912.07076 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Named entity recognition (NER) is frequently addressed as a sequence classification task with each input consisting of one sentence text. It nevertheless clear that useful information for NER often found also elsewhere in Recent self-attention models like BERT can both capture long-distance relationships and represent inputs several sentences. This creates opportunities adding cross-sentence natural language processing tasks. paper presents systematic study exploring the use using five...

10.18653/v1/2020.coling-main.78 article EN Proceedings of the 17th international conference on Computational linguistics - 2020-01-01

Abstract It is getting increasingly challenging to efficiently exploit drug-related information described in the growing amount of scientific literature. Indeed, for drug–gene/protein interactions, challenge even bigger, considering scattered sources and types interactions. However, their systematic, large-scale exploitation key developing tools, impacting knowledge fields as diverse drug design or metabolic pathway research. Previous efforts extraction interactions from literature did not...

10.1093/database/baad080 article EN cc-by Database 2023-01-01

In the field of biomedical text mining, ability to extract relations from literature is crucial for advancing both theoretical research and practical applications. There a notable shortage corpora designed enhance extraction multiple types relations, particularly focusing on proteins protein-containing entities such as complexes families, well chemicals. this work, we present RegulaTome, corpus that overcomes limitations several existing relation (RE) corpora, many which concentrate...

10.1093/database/baae095 article EN cc-by Database 2024-01-01

Abstract Motivation The recognition of mentions species names in text is a critically important task for biomedical mining. While deep learning-based methods have made great advances many named entity tasks, results name remain poor. We hypothesize that this primarily due to the lack appropriate corpora. Results introduce S1000 corpus, comprehensive manual re-annotation and extension S800 corpus. demonstrate makes highly accurate possible (F-score =93.1%), both learning dictionary-based...

10.1093/bioinformatics/btad369 article EN cc-by Bioinformatics 2023-06-01

Abstract Motivation In the field of biomedical text mining, ability to extract relations from literature is crucial for advancing both theoretical research and practical applications. There a notable shortage corpora designed enhance extraction multiple types relations, particularly focusing on proteins protein-containing entities such as complexes families, well chemicals. Results this work we present RegulaTome, corpus that overcomes limitations several existing relation (RE) corpora, many...

10.1101/2024.04.30.591824 preprint EN cc-by bioRxiv (Cold Spring Harbor Laboratory) 2024-05-02

Risto Luukkonen, Ville Komulainen, Jouni Luoma, Anni Eskelinen, Jenna Kanerva, Hanna-Mari Kupari, Filip Ginter, Veronika Laippala, Niklas Muennighoff, Aleksandra Piktus, Thomas Wang, Nouamane Tazi, Teven Scao, Wolf, Osma Suominen, Samuli Sairanen, Mikko Merioksa, Jyrki Heinonen, Aija Vahtola, Samuel Antao, Sampo Pyysalo. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023.

10.18653/v1/2023.emnlp-main.164 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2023-01-01

Abstract We present a Finnish web corpus with multiple text sources and rich additional annotations. The is based in large parts on dedicated Internet crawl, supplementing data from the Common Crawl initiative Wikipedia. size of 6.2 billion tokens 9.5 million source documents. enriched morphological analyses, word lemmas, dependency trees, named entities register (genre) identification. Paragraph-level scores an n-gram language model, as well paragraph duplication rate each document are...

10.21203/rs.3.rs-3138153/v1 preprint EN cc-by Research Square (Research Square) 2023-07-14

Named entity recognition (NER) is frequently addressed as a sequence classification task where each input consists of one sentence text. It nevertheless clear that useful information for the can often be found outside scope single-sentence context. Recently proposed self-attention models such BERT both efficiently capture long-distance relationships in well represent inputs consisting several sentences, creating new opportunitites approaches incorporate cross-sentence natural language...

10.48550/arxiv.2006.01563 preprint EN cc-by arXiv (Cornell University) 2020-01-01

Abstract Motivation The recognition of mentions species names in text is a critically important task for biomedical mining. While deep learning-based methods have made great advances many named entity tasks, results name remain poor. We hypothesize that this primarily due to the lack appropriate corpora. Results introduce S1000 corpus, comprehensive manual re-annotation and extension S800 corpus. demonstrate makes highly accurate possible (F-score=93.1%), both learning dictionary-based...

10.1101/2023.02.20.528934 preprint EN cc-by bioRxiv (Cold Spring Harbor Laboratory) 2023-02-21

Large language models (LLMs) excel in many tasks NLP and beyond, but most open have very limited coverage of smaller languages LLM work tends to focus on where nearly unlimited data is available for pretraining. In this work, we study the challenges creating LLMs Finnish, a spoken by less than 0.1% world population. We compile an extensive dataset Finnish combining web crawls, news, social media eBooks. pursue two approaches pretrain models: 1) train seven monolingual from scratch (186M 13B...

10.48550/arxiv.2311.05640 preprint EN other-oa arXiv (Cornell University) 2023-01-01
Coming Soon ...