- Natural Language Processing Techniques
- Topic Modeling
- Text Readability and Simplification
- Speech and dialogue systems
- Authorship Attribution and Profiling
- Web Data Mining and Analysis
- Discourse Analysis in Language Studies
- Advanced Text Analysis Techniques
- Advanced Database Systems and Queries
- Language, Metaphor, and Cognition
- Lexicography and Language Studies
- Sentiment Analysis and Opinion Mining
- Advanced Computational Techniques and Applications
- EFL/ESL Teaching and Learning
- Digital Communication and Language
University of Turku
2021-2025
University of Oulu
2020-2025
Abstract The pervasiveness of the internet has given web language use a central role in society. However, lack multilingual corpora and scalable methods led to focus on English research. To address this gap, present paper sets itself register research tradition explores French Swedish registers from cross-linguistic angle. Methodologically we combine keyword analysis with deep learning, suggesting an approach that enables computational comparisons across languages. Specifically, extract...
Abstract This article introduces the Finnish Corpus of Online Registers (FinCORE) representing full range registers – situationally defined text varieties such as news and blogs on Internet. The extreme language use found online has challenged study registers. It been unclear what entire Internet includes, if they can be sufficiently to allow for their analysis or classification, previous studies focusing restricted sets English. FinCORE features 10,754 texts from unrestricted web, manually...
Liina Repo, Valtteri Skantsi, Samuel Rönnqvist, Saara Hellström, Miika Oinonen, Anna Salmela, Douglas Biber, Jesse Egbert, Sampo Pyysalo, Veronika Laippala. Proceedings of the 16th Conference European Chapter Association for Computational Linguistics: Student Research Workshop. 2021.
In this paper, we introduce the first fully manually annotated paraphrase corpus for Finnish containing 53,572 pairs harvested from alternative subtitles and news headings. Out of all in our 98% are classified to be paraphrases at least their given context, if not contexts. Additionally, establish a manual candidate selection method demonstrate its feasibility high quality terms both cost quality.
Abstract A register, defined as a text variety with specific situational characteristics and communicative purpose ( Biber & Conrad 2019 ), is also recognized cultural construct Egbert 2023 ). Registers merit thorough investigation due to their pivotal role in reflecting linguistic landscapes. However, existing studies predominantly focus on Indo-European languages. This study investigates Turkish web registers through the introduction of Corpus Online (TurCORE). Comprising 2,780 texts,...
Abstract In this paper, we study natural language paraphrasing from both corpus creation and modeling points of view. We focus in particular on the methodology that allows extraction challenging examples paraphrase pairs their textual context, leading to a dataset potentially more suitable for evaluating models’ ability represent meaning, especially document when compared with those gathered using various sentence-level heuristics. To end, introduce Turku Paraphrase Corpus, first...
Abstract We present a Finnish web corpus with multiple text sources and rich additional annotations. The is based in large parts on dedicated Internet crawl, supplementing data from the Common Crawl initiative Wikipedia. size of 6.2 billion tokens 9.5 million source documents. enriched morphological analyses, word lemmas, dependency trees, named entities register (genre) identification. Paragraph-level scores an n-gram language model, as well paragraph duplication rate each document are...
This document describes the annotation guidelines used to construct Turku Paraphrase Corpus. These were developed together with corpus annotation, revising and extending regularly during work. Our paraphrase scheme uses base scale 1-4, where labels 1 2 are for negative candidates (not paraphrases), while 3 4 paraphrases at least in given context if not everywhere. In addition labeling, is enriched additional subcategories (flags) categorizing different types of inside two positive labels,...
Artikkelissa tarkastellaan kielitaidon taitotasoittaista kehittymistä potentiaalisten esiintymien analyysin (Potential Occasion Analysis, Thewissen, 2015) avulla. Kehittymistä analysoidaan tarkkuuden näkökulmasta, ja sitä mitataan kohdekielen muoto- käyttökonventioista poikkeavien muotojen määrällä. Tutkimus on korpuspohjaista virheanalyysia (Corpus-aided Error Dagneaux, Dennes & Granger, 1998), se perustuu taitotasoilla havaittujen, yhdeksään virheluokkaan sijoittuvien virheiden määrien...
We explore cross-lingual transfer of register classification for web documents. Registers, that is, text varieties such as blogs or news are one the primary predictors linguistic variation and thus affect automatic processing language. introduce two new annotated corpora, FreCORE SweCORE, French Swedish. demonstrate deep pre-trained language models perform strongly in these languages outperform previous state-of-the-art English Finnish. Specifically, we show 1) zero-shot from large CORE...