NFDI4DS | UHH-SEMS - Publication Details

Krister Lindén

ORCID: 0000-0003-2337-303X

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5001408607

Research Areas

Natural Language Processing Techniques
Topic Modeling
Authorship Attribution and Profiling
Text Readability and Simplification
Semantic Web and Ontologies
Ancient Near East History
Speech Recognition and Synthesis
Speech and dialogue systems
Linguistics and language evolution
semigroups and automata theory
Lexicography and Language Studies
Ancient Egypt and Archaeology
Logic, programming, and type systems
Algorithms and Data Compression
Text and Document Classification Technologies
Advanced Text Analysis Techniques
Interpreting and Communication in Healthcare
Privacy, Security, and Data Protection
European Criminal Justice and Data Protection
Library Science and Information Systems
Translation Studies and Practices
Government, Law, and Information Management
Mathematics, Computing, and Information Processing
Music and Audio Processing
Law, AI, and Intellectual Property

University of Helsinki
2015-2024

Leibniz Institute for the German Language
2021-2022

University of Tartu
2020-2022

Aalto University
2022

TeliaSonera (Finland)
2022

University of Turku
2022

Helsinki Art Museum
2016-2021

Institute for Language and Speech Processing
2021

University of Siena
2021

Institute of the Estonian Language
2021

Automatic Language Identification in Texts: A Survey

OPENALEX - Publications

Tommi Jauhiainen Marco Lui Marcos Zampieri Timothy Baldwin Krister Lindén

 Language identification (“LI”) is the problem of determining natural language that a document or part thereof written in. Automatic LI has been extensively researched for over fifty years. Today, key many text processing pipelines, as techniques generally assume input known. Research in this area recently especially active. This article provides brief history research, and an extensive survey features methods used literature. We describe using unified notation, to make relationships...

10.1613/jair.1.11675 article EN cc-by Journal of Artificial Intelligence Research 2019-08-25

Optical character recognition with neural networks and post-correction with finite state methods

OPENALEX - Publications

Senka Drobac Krister Lindén

Abstract The optical character recognition (OCR) quality of the historical part Finnish newspaper and journal corpus is rather low for reliable search scientific research on OCRed data. estimated error rate (CER) corpus, achieved with commercial software, between 8 13%. There have been earlier attempts to train high-quality OCR models open-source like Ocropy ( https://github.com/tmbdev/ocropy ) Tesseract https://github.com/tesseract-ocr/tesseract ), but so far, none methods managed...

10.1007/s10032-020-00359-9 article EN cc-by International Journal on Document Analysis and Recognition (IJDAR) 2020-08-20

A Finnish news corpus for named entity recognition

OPENALEX - Publications

Teemu Ruokolainen Pekka Kauppinen Miikka Silfverberg Krister Lindén

10.1007/s10579-019-09471-7 article EN Language Resources and Evaluation 2019-08-01

Lahjoita puhetta: a large-scale corpus of spoken Finnish with some benchmarks

OPENALEX - Publications

Anssi Moisio Dejan Porjazovski Aku Rouhe Yaroslav Getman Anja Virkkunen and 5 more

The Donate Speech campaign has so far succeeded in gathering approximately 3600 h of ordinary, colloquial Finnish speech into the

10.1007/s10579-022-09606-3 article EN cc-by Language Resources and Evaluation 2022-08-09

FinnPos: an open-source morphological tagging and lemmatization toolkit for Finnish

OPENALEX - Publications

Miikka Silfverberg Teemu Ruokolainen Krister Lindén Mikko Kurimo

10.1007/s10579-015-9326-3 article EN Language Resources and Evaluation 2015-12-14

Language and Dialect Identification of Cuneiform Texts

OPENALEX - Publications

Tommi Jauhiainen Heidi Jauhiainen Tero Alstola Krister Lindén

This article introduces a corpus of cuneiform texts from which the dataset for use Cuneiform Language Identification (CLI) 2019 shared task was derived as well some preliminary language identification experiments conducted using that corpus. We also describe CLI and how it In addition, we provide baseline results dataset. To best our knowledge, detailed here represent first time automatic methods have been used on data.

10.18653/v1/w19-1409 article EN 2019-01-01

Is it possible to create a very large wordnet in 100 days? An evaluation

OPENALEX - Publications

Krister Lindén Jyrki Niemi

10.1007/s10579-013-9245-0 article EN Language Resources and Evaluation 2013-07-10

Part-of-Speech Tagging using Conditional Random Fields: Exploiting Sub-Label Dependencies for Improved Accuracy

OPENALEX - Publications

Miikka Silfverberg Teemu Ruokolainen Krister Lindén Mikko Kurimo

We discuss part-of-speech (POS) tagging in presence of large, fine-grained label sets using conditional random fields (CRFs). propose improving accuracy by utilizing dependencies within sub-components the labels. These sub-label are incorporated into CRF model via a (relatively) straightforward feature extraction scheme. Experiments on five languages show that approach can yield significant improvement case labels have sufficiently rich inner structure.

10.3115/v1/p14-2043 article EN 2014-01-01

The Finno-Ugric Languages and The Internet Project

OPENALEX - Publications

Heidi Jauhiainen Tommi Jauhiainen Krister Lindén

This paper describes a Kone Foundation funded project called "The Finno-Ugric Languages and The Internet" together with some of the achieved results. main activity is to crawl internet gather texts written in small Uralic languages. sentences words found will be assembled into freely available corpus. Crawling done using open source crawler Heritrix, which developed by Internet Archive. Heritrix crawls through pages passes language identifier. We are state art identifier, has been further...

10.7557/5.3471 article EN cc-by Septentrio Conference Series 2015-06-17

FinnSentiment: a Finnish social media corpus for sentiment polarity annotation

OPENALEX - Publications

Krister Lindén Tommi Jauhiainen Sam Hardwick

Abstract Sentiment analysis and opinion mining are essential tasks with many prominent application areas, e.g., when researching popular opinions on products or brands. Sentiments expressed in social media can be used brand name monitoring indicating fake news. In our survey of previous work, we note that there is no large-scale data set sentiment polarity annotations for Finnish. This publication aims to remedy this shortcoming by introducing a 27,000-sentence annotated independently three...

10.1007/s10579-023-09644-5 article EN cc-by Language Resources and Evaluation 2023-03-03

Finn WordNet - WordNet på finska via översättning

OPENALEX - Publications

Krister Lindén Lauri Carlson

FinnWordNet is a WordNet for Finnish that conforms to the frameworkgiven in Fellbaum (1998) and Vossen (1998). FinnWord-Net1 open source currently contains 117,000 synsets. A classicWordNet consists of synsets, or sets partial synonyms whoseshared meaning described exemplified by gloss, commonpart speech hyperonym. Synsets are arrangedin hierarchical orderings according semantic relationslike hyponymy/hyperonymy. Together part andhyperonym fix word constrain possibletranslations given...

10.7146/ln.v0i17.18627 article EN LexicoNordica 2010-01-01

Data-Driven Spelling Correction using Weighted Finite-State Methods

OPENALEX - Publications

Miikka Silfverberg Pekka Kauppinen Krister Lindén

This paper presents two systems for spelling correction formulated as a sequence labeling task. One of the is an unstructured classifier and other one structured. Both are implemented using weighted finite-state methods. The structured system delivers stateof-the-art results on task tweet normalization when compared with recent AliSeTra introduced by Eger et al. (2016) even though presented in simpler than because it does not include model input segmentation. In addition to experiments...

10.18653/v1/w16-2406 article EN 2016-01-01

Coming Soon ...