Krister Lindén

ORCID: 0000-0003-2337-303X
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Natural Language Processing Techniques
  • Topic Modeling
  • Authorship Attribution and Profiling
  • Text Readability and Simplification
  • Semantic Web and Ontologies
  • Ancient Near East History
  • Speech Recognition and Synthesis
  • Speech and dialogue systems
  • Linguistics and language evolution
  • semigroups and automata theory
  • Lexicography and Language Studies
  • Ancient Egypt and Archaeology
  • Logic, programming, and type systems
  • Algorithms and Data Compression
  • Text and Document Classification Technologies
  • Advanced Text Analysis Techniques
  • Interpreting and Communication in Healthcare
  • Privacy, Security, and Data Protection
  • European Criminal Justice and Data Protection
  • Library Science and Information Systems
  • Translation Studies and Practices
  • Government, Law, and Information Management
  • Mathematics, Computing, and Information Processing
  • Music and Audio Processing
  • Law, AI, and Intellectual Property

University of Helsinki
2015-2024

Leibniz Institute for the German Language
2021-2022

University of Tartu
2020-2022

Aalto University
2022

TeliaSonera (Finland)
2022

University of Turku
2022

Helsinki Art Museum
2016-2021

Institute for Language and Speech Processing
2021

University of Siena
2021

Institute of the Estonian Language
2021


 Language identification (“LI”) is the problem of determining natural language that a document or part thereof written in. Automatic LI has been extensively researched for over fifty years. Today, key many text processing pipelines, as techniques generally assume input known. Research in this area recently especially active. This article provides brief history research, and an extensive survey features methods used literature. We describe using unified notation, to make relationships...

10.1613/jair.1.11675 article EN cc-by Journal of Artificial Intelligence Research 2019-08-25

Abstract The optical character recognition (OCR) quality of the historical part Finnish newspaper and journal corpus is rather low for reliable search scientific research on OCRed data. estimated error rate (CER) corpus, achieved with commercial software, between 8 13%. There have been earlier attempts to train high-quality OCR models open-source like Ocropy ( https://github.com/tmbdev/ocropy ) Tesseract https://github.com/tesseract-ocr/tesseract ), but so far, none methods managed...

10.1007/s10032-020-00359-9 article EN cc-by International Journal on Document Analysis and Recognition (IJDAR) 2020-08-20

The Donate Speech campaign has so far succeeded in gathering approximately 3600 h of ordinary, colloquial Finnish speech into the

10.1007/s10579-022-09606-3 article EN cc-by Language Resources and Evaluation 2022-08-09

This article introduces a corpus of cuneiform texts from which the dataset for use Cuneiform Language Identification (CLI) 2019 shared task was derived as well some preliminary language identification experiments conducted using that corpus. We also describe CLI and how it In addition, we provide baseline results dataset. To best our knowledge, detailed here represent first time automatic methods have been used on data.

10.18653/v1/w19-1409 article EN 2019-01-01

10.1007/s10579-013-9245-0 article EN Language Resources and Evaluation 2013-07-10

We discuss part-of-speech (POS) tagging in presence of large, fine-grained label sets using conditional random fields (CRFs). propose improving accuracy by utilizing dependencies within sub-components the labels. These sub-label are incorporated into CRF model via a (relatively) straightforward feature extraction scheme. Experiments on five languages show that approach can yield significant improvement case labels have sufficiently rich inner structure.

10.3115/v1/p14-2043 article EN 2014-01-01

This paper describes a Kone Foundation funded project called "The Finno-Ugric Languages and The Internet" together with some of the achieved results. main activity is to crawl internet gather texts written in small Uralic languages. sentences words found will be assembled into freely available corpus. Crawling done using open source crawler Heritrix, which developed by Internet Archive. Heritrix crawls through pages passes language identifier. We are state art identifier, has been further...

10.7557/5.3471 article EN cc-by Septentrio Conference Series 2015-06-17

Abstract Sentiment analysis and opinion mining are essential tasks with many prominent application areas, e.g., when researching popular opinions on products or brands. Sentiments expressed in social media can be used brand name monitoring indicating fake news. In our survey of previous work, we note that there is no large-scale data set sentiment polarity annotations for Finnish. This publication aims to remedy this shortcoming by introducing a 27,000-sentence annotated independently three...

10.1007/s10579-023-09644-5 article EN cc-by Language Resources and Evaluation 2023-03-03

FinnWordNet is a WordNet for Finnish that conforms to the frameworkgiven in Fellbaum (1998) and Vossen (1998). FinnWord-Net1 open source currently contains 117,000 synsets. A classicWordNet consists of synsets, or sets partial synonyms whoseshared meaning described exemplified by gloss, commonpart speech hyperonym. Synsets are arrangedin hierarchical orderings according semantic relationslike hyponymy/hyperonymy. Together part andhyperonym fix word constrain possibletranslations given...

10.7146/ln.v0i17.18627 article EN LexicoNordica 2010-01-01

This paper presents two systems for spelling correction formulated as a sequence labeling task. One of the is an unstructured classifier and other one structured. Both are implemented using weighted finite-state methods. The structured system delivers stateof-the-art results on task tweet normalization when compared with recent AliSeTra introduced by Eger et al. (2016) even though presented in simpler than because it does not include model input segmentation. In addition to experiments...

10.18653/v1/w16-2406 article EN 2016-01-01
Coming Soon ...