NFDI4DS | UHH-SEMS - Publication Details

Valtteri Skantsi

ORCID: 0000-0002-1230-9983

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5066222419

Research Areas

Natural Language Processing Techniques
Topic Modeling
Text Readability and Simplification
Speech and dialogue systems
Authorship Attribution and Profiling
Web Data Mining and Analysis
Discourse Analysis in Language Studies
Advanced Text Analysis Techniques
Advanced Database Systems and Queries
Language, Metaphor, and Cognition
Lexicography and Language Studies
Sentiment Analysis and Opinion Mining
Advanced Computational Techniques and Applications
EFL/ESL Teaching and Learning
Digital Communication and Language

University of Turku
2021-2025

University of Oulu
2020-2025

From keywords to key embeddings – contrasting French and Swedish web registers using multilingual deep learning

OPENALEX - Publications

Saara Hellström Valtteri Skantsi Anna Salmela Veronika Laippala

Abstract The pervasiveness of the internet has given web language use a central role in society. However, lack multilingual corpora and scalable methods led to focus on English research. To address this gap, present paper sets itself register research tradition explores French Swedish registers from cross-linguistic angle. Methodologically we combine keyword analysis with deep learning, suggesting an approach that enables computational comparisons across languages. Specifically, extract...

10.1515/cllt-2024-0070 article EN Corpus Linguistics and Linguistic Theory 2025-01-13

Analyzing the unrestricted web: The finnish corpus of online registers

OPENALEX - Publications

Valtteri Skantsi Veronika Laippala

Abstract This article introduces the Finnish Corpus of Online Registers (FinCORE) representing full range registers – situationally defined text varieties such as news and blogs on Internet. The extreme language use found online has challenged study registers. It been unclear what entire Internet includes, if they can be sufficiently to allow for their analysis or classification, previous studies focusing restricted sets English. FinCORE features 10,754 texts from unrestricted web, manually...

10.1017/s0332586523000021 article EN cc-by Nordic Journal of Linguistics 2023-03-13

Beyond the English Web: Zero-Shot Cross-Lingual and Lightweight Monolingual Classification of Registers

OPENALEX - Publications

Liina Repo Valtteri Skantsi Samuel Rönnqvist Saara Hellström Miika Oinonen and 5 more

Liina Repo, Valtteri Skantsi, Samuel Rönnqvist, Saara Hellström, Miika Oinonen, Anna Salmela, Douglas Biber, Jesse Egbert, Sampo Pyysalo, Veronika Laippala. Proceedings of the 16th Conference European Chapter Association for Computational Linguistics: Student Research Workshop. 2021.

10.18653/v1/2021.eacl-srw.24 article EN cc-by 2021-01-01

Finnish Paraphrase Corpus

OPENALEX - Publications

Jenna Kanerva Filip Ginter Li-Hsin Chang Iiro Rastas Valtteri Skantsi and 5 more

In this paper, we introduce the first fully manually annotated paraphrase corpus for Finnish containing 53,572 pairs harvested from alternative subtitles and news headings. Out of all in our 98% are classified to be paraphrases at least their given context, if not contexts. Additionally, establish a manual candidate selection method demonstrate its feasibility high quality terms both cost quality.

10.48550/arxiv.2103.13103 preprint EN cc-by-sa arXiv (Cornell University) 2021-01-01

Linguistic variation beyond the Indo-European web

OPENALEX - Publications

Selcen Erten-Johansson Valtteri Skantsi Sampo Pyysalo Veronika Laippala

Abstract A register, defined as a text variety with specific situational characteristics and communicative purpose ( Biber & Conrad 2019 ), is also recognized cultural construct Egbert 2023 ). Registers merit thorough investigation due to their pivotal role in reflecting linguistic landscapes. However, existing studies predominantly focus on Indo-European languages. This study investigates Turkish web registers through the introduction of Corpus Online (TurCORE). Comprising 2,780 texts,...

10.1075/rs.24002.ert article EN Register Studies 2024-12-17

Towards diverse and contextually anchored paraphrase modeling: A dataset and baselines for Finnish

OPENALEX - Publications

Jenna Kanerva Filip Ginter Li-Hsin Chang Iiro Rastas Valtteri Skantsi and 6 more

Abstract In this paper, we study natural language paraphrasing from both corpus creation and modeling points of view. We focus in particular on the methodology that allows extraction challenging examples paraphrase pairs their textual context, leading to a dataset potentially more suitable for evaluating models’ ability represent meaning, especially document when compared with those gathered using various sentence-level heuristics. To end, introduce Turku Paraphrase Corpus, first...

10.1017/s1351324923000086 article EN cc-by Natural Language Engineering 2023-03-16

Finnish Internet Parsebank

OPENALEX - Publications

Juhani Luotolahti Jenna Kanerva Jouni Luoma Valtteri Skantsi Sampo Pyysalo and 2 more

Abstract We present a Finnish web corpus with multiple text sources and rich additional annotations. The is based in large parts on dedicated Internet crawl, supplementing data from the Common Crawl initiative Wikipedia. size of 6.2 billion tokens 9.5 million source documents. enriched morphological analyses, word lemmas, dependency trees, named entities register (genre) identification. Paragraph-level scores an n-gram language model, as well paragraph duplication rate each document are...

10.21203/rs.3.rs-3138153/v1 preprint EN cc-by Research Square (Research Square) 2023-07-14

Annotation Guidelines for the Turku Paraphrase Corpus

OPENALEX - Publications

Jenna Kanerva Filip Ginter Li‐Hsin Chang Iiro Rastas Valtteri Skantsi and 6 more

This document describes the annotation guidelines used to construct Turku Paraphrase Corpus. These were developed together with corpus annotation, revising and extending regularly during work. Our paraphrase scheme uses base scale 1-4, where labels 1 2 are for negative candidates (not paraphrases), while 3 4 paraphrases at least in given context if not everywhere. In addition labeling, is enriched additional subcategories (flags) categorizing different types of inside two positive labels,...

10.48550/arxiv.2108.07499 preprint EN cc-by arXiv (Cornell University) 2021-01-01

Korpusavusteinen virheanalyysi tarkkuuden kehityksestä EVK:n taitotasoilla A2–B2

OPENALEX - Publications

Sisko Brunni Jarmo Harri Jantunen Valtteri Skantsi

Artikkelissa tarkastellaan kielitaidon taitotasoittaista kehittymistä potentiaalisten esiintymien analyysin (Potential Occasion Analysis, Thewissen, 2015) avulla. Kehittymistä analysoidaan tarkkuuden näkökulmasta, ja sitä mitataan kohdekielen muoto- käyttökonventioista poikkeavien muotojen määrällä. Tutkimus on korpuspohjaista virheanalyysia (Corpus-aided Error Dagneaux, Dennes & Granger, 1998), se perustuu taitotasoilla havaittujen, yhdeksään virheluokkaan sijoittuvien virheiden määrien...

10.23997/pk.76601 article FI Puhe ja kieli 2020-01-10

Beyond the English Web: Zero-Shot Cross-Lingual and Lightweight Monolingual Classification of Registers

OPENALEX - Publications

Liina Repo Valtteri Skantsi Samuel Rönnqvist Saara Hellström Miika Oinonen and 5 more

We explore cross-lingual transfer of register classification for web documents. Registers, that is, text varieties such as blogs or news are one the primary predictors linguistic variation and thus affect automatic processing language. introduce two new annotated corpora, FreCORE SweCORE, French Swedish. demonstrate deep pre-trained language models perform strongly in these languages outperform previous state-of-the-art English Finnish. Specifically, we show 1) zero-shot from large CORE...

10.48550/arxiv.2102.07396 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Coming Soon ...