- Natural Language Processing Techniques
- Topic Modeling
- Authorship Attribution and Profiling
- Text Readability and Simplification
- Semantic Web and Ontologies
- Ancient Near East History
- Speech Recognition and Synthesis
- Speech and dialogue systems
- Linguistics and language evolution
- semigroups and automata theory
- Lexicography and Language Studies
- Ancient Egypt and Archaeology
- Logic, programming, and type systems
- Algorithms and Data Compression
- Text and Document Classification Technologies
- Advanced Text Analysis Techniques
- Interpreting and Communication in Healthcare
- Privacy, Security, and Data Protection
- European Criminal Justice and Data Protection
- Library Science and Information Systems
- Translation Studies and Practices
- Government, Law, and Information Management
- Mathematics, Computing, and Information Processing
- Music and Audio Processing
- Law, AI, and Intellectual Property
University of Helsinki
2015-2024
Leibniz Institute for the German Language
2021-2022
University of Tartu
2020-2022
Aalto University
2022
TeliaSonera (Finland)
2022
University of Turku
2022
Helsinki Art Museum
2016-2021
Institute for Language and Speech Processing
2021
University of Siena
2021
Institute of the Estonian Language
2021

 Language identification (“LI”) is the problem of determining natural language that a document or part thereof written in. Automatic LI has been extensively researched for over fifty years. Today, key many text processing pipelines, as techniques generally assume input known. Research in this area recently especially active. This article provides brief history research, and an extensive survey features methods used literature. We describe using unified notation, to make relationships...
Abstract The optical character recognition (OCR) quality of the historical part Finnish newspaper and journal corpus is rather low for reliable search scientific research on OCRed data. estimated error rate (CER) corpus, achieved with commercial software, between 8 13%. There have been earlier attempts to train high-quality OCR models open-source like Ocropy ( https://github.com/tmbdev/ocropy ) Tesseract https://github.com/tesseract-ocr/tesseract ), but so far, none methods managed...
The Donate Speech campaign has so far succeeded in gathering approximately 3600 h of ordinary, colloquial Finnish speech into the
This article introduces a corpus of cuneiform texts from which the dataset for use Cuneiform Language Identification (CLI) 2019 shared task was derived as well some preliminary language identification experiments conducted using that corpus. We also describe CLI and how it In addition, we provide baseline results dataset. To best our knowledge, detailed here represent first time automatic methods have been used on data.
We discuss part-of-speech (POS) tagging in presence of large, fine-grained label sets using conditional random fields (CRFs). propose improving accuracy by utilizing dependencies within sub-components the labels. These sub-label are incorporated into CRF model via a (relatively) straightforward feature extraction scheme. Experiments on five languages show that approach can yield significant improvement case labels have sufficiently rich inner structure.
This paper describes a Kone Foundation funded project called "The Finno-Ugric Languages and The Internet" together with some of the achieved results. main activity is to crawl internet gather texts written in small Uralic languages. sentences words found will be assembled into freely available corpus. Crawling done using open source crawler Heritrix, which developed by Internet Archive. Heritrix crawls through pages passes language identifier. We are state art identifier, has been further...
Abstract Sentiment analysis and opinion mining are essential tasks with many prominent application areas, e.g., when researching popular opinions on products or brands. Sentiments expressed in social media can be used brand name monitoring indicating fake news. In our survey of previous work, we note that there is no large-scale data set sentiment polarity annotations for Finnish. This publication aims to remedy this shortcoming by introducing a 27,000-sentence annotated independently three...
FinnWordNet is a WordNet for Finnish that conforms to the frameworkgiven in Fellbaum (1998) and Vossen (1998). FinnWord-Net1 open source currently contains 117,000 synsets. A classicWordNet consists of synsets, or sets partial synonyms whoseshared meaning described exemplified by gloss, commonpart speech hyperonym. Synsets are arrangedin hierarchical orderings according semantic relationslike hyponymy/hyperonymy. Together part andhyperonym fix word constrain possibletranslations given...
This paper presents two systems for spelling correction formulated as a sequence labeling task. One of the is an unstructured classifier and other one structured. Both are implemented using weighted finite-state methods. The structured system delivers stateof-the-art results on task tweet normalization when compared with recent AliSeTra introduced by Eger et al. (2016) even though presented in simpler than because it does not include model input segmentation. In addition to experiments...