- Natural Language Processing Techniques
- Topic Modeling
- Phonetics and Phonology Research
- Speech Recognition and Synthesis
- Text Readability and Simplification
- Multimodal Machine Learning Applications
- Speech and Audio Processing
- Speech and dialogue systems
- Sentiment Analysis and Opinion Mining
- Authorship Attribution and Profiling
- Education and Teacher Training
- Ethics and bioethics in healthcare
- Higher Education Teaching and Evaluation
- Hate Speech and Cyberbullying Detection
- Ultrasonics and Acoustic Wave Propagation
- Computational and Text Analysis Methods
- Music and Audio Processing
- Comparative International Legal Studies
- Children's Physical and Motor Development
- Neurobiology of Language and Bilingualism
- Educational Practices and Policies
- Ergonomics and Musculoskeletal Disorders
- Early Childhood Education and Development
- Domain Adaptation and Few-Shot Learning
- Misinformation and Its Impacts
University of Groningen
2020-2025
Association for Computational Linguistics
2024
The transformer-based pre-trained language model BERT has helped to improve state-of-the-art performance on many natural processing (NLP) tasks. Using the same architecture and parameters, we developed evaluated a monolingual Dutch called BERTje. Compared multilingual model, which includes but is only based Wikipedia text, BERTje large diverse dataset of 2.4 billion tokens. consistently outperforms equally-sized downstream NLP tasks (part-of-speech tagging, named-entity recognition, semantic...
Peeking into the inner workings of BERT has shown that its layers resemble classical NLP pipeline, with progressively more complex tasks being concentrated in later layers. To investigate to what extent these results also hold for a language other than English, we probe Dutch BERT-based model and multilingual tasks. In addition, through deeper analysis part-of-speech tagging, show within given task, information is spread over different parts network pipeline might not be as neat it seems....
Cross-lingual transfer learning with large multilingual pre-trained models can be an effective approach for low-resource languages no labeled training data. Existing evaluations of zero-shot cross-lingual generalisability use datasets English data, and test data in a selection target languages. We explore more extensive setup 65 different source 105 part-of-speech tagging. Through our analysis, we show that pre-training both language, as well matching language families, writing systems, word...
Variation in speech is often quantified by comparing phonetic transcriptions of the same utterance. However, manually transcribing time-consuming and error prone. As an alternative, therefore, we investigate extraction acoustic embeddings from several self-supervised neural models. We use these representations to compute word-based pronunciation differences between non-native native speakers English, Norwegian dialect speakers. For comparison with earlier studies, evaluate how well match...
Automatic Speech Recognition (ASR) performance for low-resource languages is still far behind that of higher-resource such as English, due to a lack sufficient labeled data. State-of-the-art methods deploy self-supervised transfer learning where model pre-trained on large amounts data fine-tuned using little in target language. In this paper, we present and examine method fine-tuning an SSL-based order improve the Frisian its regional dialects (Clay Frisian, Wood South Frisian). We show ASR...
Large generative language models have been very successful for English, but other languages lag behind, in part due to data and computational limitations.We propose a method that may overcome these problems by adapting existing pre-trained new languages.Specifically, we describe the adaptation of English GPT-2 Italian Dutch retraining lexical embeddings without tuning Transformer layers.As result, obtain are aligned with original embeddings.Additionally, scale up complexity transforming...
For many (minority) languages, the resources needed to train large models are not available. We investigate performance of zero-shot transfer learning with as little data possible, and influence language similarity in this process. retrain lexical layers four BERT-based using from two low-resource target varieties, while Transformer independently fine-tuned on a POS-tagging task model's source language. By combining new layers, we achieve high for both languages. With similarity, 10MB...
The COVID-19 pandemic has witnessed the implementations of exceptional measures by governments across world to counteract its impact. This work presents initial results an on-going project, EXCEPTIUS, aiming automatically identify, classify and com- pare against 32 countries in Europe. To this goal, we created a corpus legal documents with sentence-level annotations eight different classes that are im- plemented these countries. We evalu- ated multiple multi-label classifiers on manu- ally...
Abstract This paper contributes to ongoing scholarly debates on the merits and limitations of computational legal text analysis by reflecting results a research project documenting exceptional COVID‐19 management measures in Europe. The variety adopted countries characterized different systems natural languages, as well rapid evolution such measures, pose considerable challenges manual textual methods traditionally used social sciences. To address these challenges, we develop supervised...
We introduce the Dutch Model Benchmark: DUMB. The benchmark includes a diverse set of datasets for low-, medium- and high-resource tasks. total nine tasks four that were previously not available in Dutch. Instead relying on mean score across tasks, we propose Relative Error Reduction (RER), which compares DUMB performance language models to strong baseline can be referred future even when assessing different sets models. Through comparison 14 pre-trained (mono- multi-lingual, varying sizes),...
Acoustic-to-articulatory inversion (AAI) is the process of inferring vocal tract movements from acoustic speech signals. Despite its diverse potential applications, AAI research in languages other than English scarce due to challenges collecting articulatory data. In recent years, self-supervised learning (SSL) based representations have shown great for addressing low-resource tasks. We utilize wav2vec 2.0 and data training systems investigates their effectiveness a different language:...
We introduce the Dutch Model Benchmark: DUMB. The benchmark includes a diverse set of datasets for low-, medium- and high-resource tasks. total nine tasks four that were previously not available in Dutch. Instead relying on mean score across tasks, we propose Relative Error Reduction (RER), which compares DUMB performance language models to strong baseline can be referred future even when assessing different sets models. Through comparison 14 pre-trained (mono- multi-lingual, varying sizes),...