- Natural Language Processing Techniques
- Topic Modeling
- Authorship Attribution and Profiling
- Advanced Text Analysis Techniques
- Speech Recognition and Synthesis
- Speech and dialogue systems
- Language and cultural evolution
- Biomedical Text Mining and Ontologies
- Education, Psychology, and Social Research
- Semantic Web and Ontologies
- Text Readability and Simplification
- Text and Document Classification Technologies
- Software Engineering Research
- Sentiment Analysis and Opinion Mining
- Linguistics, Language Diversity, and Identity
- Web Data Mining and Analysis
- Discourse Analysis and Cultural Communication
- Advanced Database Systems and Queries
- Advanced Proteomics Techniques and Applications
- Expert finding and Q&A systems
- Mental Health via Writing
- Educational Technology and Assessment
- Spam and Phishing Detection
- Computational and Text Analysis Methods
- Theology and Canon Law Studies
University of West Bohemia
2015-2023
Pilsen Tools (Czechia)
2023
In this paper, we describe our method for detection of lexical semantic change, i.e., word sense changes over time. We examine differences between specific words in two corpora, chosen from different time periods, English, German, Latin, and Swedish. Our was created the SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. ranked 1st Sub-task binary change detection, 4th 2: detection. present which is completely unsupervised language independent. It consists preparing a vector...
This paper describes the training process of first Czech monolingual language representation models based on BERT and ALBERT architectures.We pre-train our more than 340K sentences, which is 50 times multilingual that include data.We outperform 9 out 11 datasets.In addition, we establish new state-of-the-art results nine datasets.At end, discuss properties upon results.We publish all pretrained fine-tuned freely for research community.
Zdeněk Žabokrtský, Miloslav Konopik, Anna Nedoluzhko, Michal Novák, Maciej Ogrodniczuk, Martin Popel, Ondrej Prazak, Jakub Sido, Daniel Zeman. Proceedings of the CRAC 2023 Shared Task on Multilingual Coreference Resolution. 2023.
We introduce a system focused on solving SemEval 2016 Task 2 ‐ Interpretable Semantic Textual Similarity. The explores machine learning and rule-based approaches to the task. focus experiment with wide variety of algorithms as well several types features. core our consists in exploiting distributional semantics compare similarity sentence chunks. won competition “Gold standard chunk scenario”. have not participated “System
This paper presents an overview of the shared task on multilingual coreference resolution associated with CRAC 2022 workshop. Shared participants were supposed to develop trainable systems capable identifying mentions and clustering them according identity coreference. The public edition CorefUD 1.0, which contains 13 datasets for 10 languages, was used as source training evaluation data. CoNLL score in previous coreference-oriented tasks main metric. There 8 prediction submitted by 5...
In this paper, we present coreference resolution experiments with a newly created multilingual corpus CorefUD (Nedoluzhko et al., 2021).We focus on the following languages: Czech, Russian, Polish, German, Spanish, and Catalan.In addition to monolingual experiments, combine training data in train two joined models -for Slavic languages for all together.We rely an end-to-end deep learning model that slightly adapted corpus.Our results show can profit from harmonized annotations, using helps...
In this paper, we introduce a cross-lingual Semantic Role Labeling (SRL) system with language independent features based upon Universal Dependencies.We propose two methods to convert SRL annotations from monolingual dependency trees into universal trees.Our is derived and supervised learning that utilizes maximum entropy classifier.We design experiments verify whether the Dependencies are suitable for SRL.The results very promising they open new interesting research paths future.
This paper describes our approach to the CRAC 2022 Shared Task on Multilingual Coreference Resolution. Our model is based a state-of-the-art end-to-end coreference resolution system. Apart from joined multilingual training, we improved results with mention head prediction. We also tried integrate dependency information into model. system ended up in $3^{rd}$ place. Moreover, reached best performance two datasets out of 13.
Coreference resolution, the task of identifying expressions in text that refer to same entity, is a critical component various natural language processing (NLP) applications. This paper presents our end-to-end neural coreference resolution system, utilizing CorefUD 1.1 dataset, which spans 17 datasets across 12 languages. Our model based on system. We first establish strong baseline models, including monolingual and cross-lingual variations, then propose several extensions enhance...
The paper presents an overview of the third edition shared task on multilingual coreference resolution, held as part CRAC 2024 workshop. Similarly to previous two editions, participants were challenged develop systems capable identifying mentions and clustering them based identity coreference. This year's took another step towards real-world application by not providing with gold slots for zero anaphora, increasing task's complexity realism. In addition, was expanded include a more diverse...
This paper introduces a Czech dataset for semantic similarity and relatedness.The contains word pairs with hand annotated scores that indicate the relatedness of words.The 953 compiled from 9 different sources.It words their contexts taken real text corpora including extra examples when are ambiguous.The is by 5 independent annotators.The average Spearman correlation coefficient annotation agreement r = 0.81.We provide reference evaluation experiments several methods computing relatedness.
Abstract This paper describes a novel dataset consisting of sentences with two different semantic similarity annotations; and without surrounding context. The data originate from the journalistic domain in Czech language. final contains 138,556 human annotations divided into train test sets. In total, 485 journalism students participated creation process. To increase reliability set, we compute as an average 9 individual annotation scores. We evaluate quality by measuring inter...
This paper describes the process of collecting, maintaining and exploiting an English dataset web discussions. The consists many discussions with hand-annotated posts in context a tree structure page. Each post username, date, text, citations used by its author. contains 79 different websites at least 500 pages from each. page HTML tags texts taken selected pages. In paper, we also describe algorithms trained on dataset. employ basic architectures (such as bag words SVM classifier LSTM...