Valtteri Skantsi

ORCID: 0000-0002-1230-9983
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Natural Language Processing Techniques
  • Topic Modeling
  • Text Readability and Simplification
  • Speech and dialogue systems
  • Authorship Attribution and Profiling
  • Web Data Mining and Analysis
  • Discourse Analysis in Language Studies
  • Advanced Text Analysis Techniques
  • Advanced Database Systems and Queries
  • Language, Metaphor, and Cognition
  • Lexicography and Language Studies
  • Sentiment Analysis and Opinion Mining
  • Advanced Computational Techniques and Applications
  • EFL/ESL Teaching and Learning
  • Digital Communication and Language

University of Turku
2021-2025

University of Oulu
2020-2025

Abstract The pervasiveness of the internet has given web language use a central role in society. However, lack multilingual corpora and scalable methods led to focus on English research. To address this gap, present paper sets itself register research tradition explores French Swedish registers from cross-linguistic angle. Methodologically we combine keyword analysis with deep learning, suggesting an approach that enables computational comparisons across languages. Specifically, extract...

10.1515/cllt-2024-0070 article EN Corpus Linguistics and Linguistic Theory 2025-01-13

Abstract This article introduces the Finnish Corpus of Online Registers (FinCORE) representing full range registers – situationally defined text varieties such as news and blogs on Internet. The extreme language use found online has challenged study registers. It been unclear what entire Internet includes, if they can be sufficiently to allow for their analysis or classification, previous studies focusing restricted sets English. FinCORE features 10,754 texts from unrestricted web, manually...

10.1017/s0332586523000021 article EN cc-by Nordic Journal of Linguistics 2023-03-13

Liina Repo, Valtteri Skantsi, Samuel Rönnqvist, Saara Hellström, Miika Oinonen, Anna Salmela, Douglas Biber, Jesse Egbert, Sampo Pyysalo, Veronika Laippala. Proceedings of the 16th Conference European Chapter Association for Computational Linguistics: Student Research Workshop. 2021.

10.18653/v1/2021.eacl-srw.24 article EN cc-by 2021-01-01

In this paper, we introduce the first fully manually annotated paraphrase corpus for Finnish containing 53,572 pairs harvested from alternative subtitles and news headings. Out of all in our 98% are classified to be paraphrases at least their given context, if not contexts. Additionally, establish a manual candidate selection method demonstrate its feasibility high quality terms both cost quality.

10.48550/arxiv.2103.13103 preprint EN cc-by-sa arXiv (Cornell University) 2021-01-01

Abstract A register, defined as a text variety with specific situational characteristics and communicative purpose ( Biber & Conrad 2019 ), is also recognized cultural construct Egbert 2023 ). Registers merit thorough investigation due to their pivotal role in reflecting linguistic landscapes. However, existing studies predominantly focus on Indo-European languages. This study investigates Turkish web registers through the introduction of Corpus Online (TurCORE). Comprising 2,780 texts,...

10.1075/rs.24002.ert article EN Register Studies 2024-12-17

Abstract In this paper, we study natural language paraphrasing from both corpus creation and modeling points of view. We focus in particular on the methodology that allows extraction challenging examples paraphrase pairs their textual context, leading to a dataset potentially more suitable for evaluating models’ ability represent meaning, especially document when compared with those gathered using various sentence-level heuristics. To end, introduce Turku Paraphrase Corpus, first...

10.1017/s1351324923000086 article EN cc-by Natural Language Engineering 2023-03-16

Abstract We present a Finnish web corpus with multiple text sources and rich additional annotations. The is based in large parts on dedicated Internet crawl, supplementing data from the Common Crawl initiative Wikipedia. size of 6.2 billion tokens 9.5 million source documents. enriched morphological analyses, word lemmas, dependency trees, named entities register (genre) identification. Paragraph-level scores an n-gram language model, as well paragraph duplication rate each document are...

10.21203/rs.3.rs-3138153/v1 preprint EN cc-by Research Square (Research Square) 2023-07-14

This document describes the annotation guidelines used to construct Turku Paraphrase Corpus. These were developed together with corpus annotation, revising and extending regularly during work. Our paraphrase scheme uses base scale 1-4, where labels 1 2 are for negative candidates (not paraphrases), while 3 4 paraphrases at least in given context if not everywhere. In addition labeling, is enriched additional subcategories (flags) categorizing different types of inside two positive labels,...

10.48550/arxiv.2108.07499 preprint EN cc-by arXiv (Cornell University) 2021-01-01

Artikkelissa tarkastellaan kielitaidon taitotasoittaista kehittymistä potentiaalisten esiintymien analyysin (Potential Occasion Analysis, Thewissen, 2015) avulla. Kehittymistä analysoidaan tarkkuuden näkökulmasta, ja sitä mitataan kohdekielen muoto- käyttökonventioista poikkeavien muotojen määrällä. Tutkimus on korpuspohjaista virheanalyysia (Corpus-aided Error Dagneaux, Dennes & Granger, 1998), se perustuu taitotasoilla havaittujen, yhdeksään virheluokkaan sijoittuvien virheiden määrien...

10.23997/pk.76601 article FI Puhe ja kieli 2020-01-10

We explore cross-lingual transfer of register classification for web documents. Registers, that is, text varieties such as blogs or news are one the primary predictors linguistic variation and thus affect automatic processing language. introduce two new annotated corpora, FreCORE SweCORE, French Swedish. demonstrate deep pre-trained language models perform strongly in these languages outperform previous state-of-the-art English Finnish. Specifically, we show 1) zero-shot from large CORE...

10.48550/arxiv.2102.07396 preprint EN other-oa arXiv (Cornell University) 2021-01-01
Coming Soon ...