- Natural Language Processing Techniques
- Topic Modeling
- Text Readability and Simplification
- Web Data Mining and Analysis
- Information Retrieval and Search Behavior
- Software Engineering Research
- Semantic Web and Ontologies
- Advanced Graph Neural Networks
- Authorship Attribution and Profiling
- Business Process Modeling and Analysis
- Advanced Data Processing Techniques
- Scientific Computing and Data Management
- Interpreting and Communication in Healthcare
- Speech and dialogue systems
- Romani and Gypsy Studies
- Computational Physics and Python Applications
- Neural Networks and Applications
- Software System Performance and Reliability
- Advanced Text Analysis Techniques
- Fault Detection and Control Systems
O2 Czech Republic (Czechia)
2024
Snam (Italy)
2022-2024
Charles University
2018-2022
Center for Applied Linguistics
2019-2021
Grammatical error correction in English is a long studied problem with many existing systems and datasets. However, there has been only limited research on of other languages. In this paper, we present new dataset AKCES-GEC grammatical for Czech. We then make experiments Czech, German Russian show that when utilizing synthetic parallel corpus, Transformer neural machine translation model can reach state-of-the-art results these published under CC BY-NC-SA 4.0 license at...
We introduce a large and diverse Czech corpus annotated for grammatical error correction (GEC) with the aim to contribute still scarce data resources in this domain languages other than English. The Grammar Error Correction Corpus (GECCC) offers variety of four domains, covering distributions ranging from high density essays written by non-native speakers, website texts, where errors are expected be much less common. compare several GEC systems, including Transformer-based ones, setting...
Web search engines focus on serving highly relevant results within hundreds of milliseconds. Pre-trained language transformer models such as BERT are therefore hard to use in this scenario due their high computational demands. We present our real-time approach the document ranking problem leveraging a BERT-based siamese architecture. The model is already deployed commercial engine and it improves production performance by more than 3%. For further research evaluation, we release DaReCzech,...
Sensitivity of deep-neural models to input noise is known be a challenging problem. In NLP, model performance often deteriorates with naturally occurring noise, such as spelling errors. To mitigate this issue, may leverage artificially noised data. However, the amount and type generated has so far been determined arbitrarily. We therefore propose errors statistically from grammatical-error-correction corpora. present thorough evaluation several state-of-the-art NLP systems' robustness in...
We propose a new architecture for diacritics restoration based on contextualized embeddings, namely BERT, and we evaluate it 12 languages with diacritics.Furthermore, conduct detailed error analysis Czech, morphologically rich language high level of diacritization.Notably, manually annotate all mispredictions, showing that roughly 44% them are actually not errors, but either plausible variants (19%), or the system corrections erroneous data (25%).Finally, categorize real errors in detail.We...
This article focuses on the development and evaluation of Small-sized Czech sentence embedding models. Small models are important components for real-time industry applications in resource-constrained environments. Given limited availability labeled data, alternative approaches, including pre-training, knowledge distillation, unsupervised contrastive fine-tuning, investigated. Comprehensive intrinsic extrinsic analyses conducted, showcasing competitive performance our compared to...
Our submitted models are NMT systems based on the Transformer model, which we improve by incorporating several enhancements: applying dropout to whole source and target words, weighting subwords, averaging model checkpoints, using trained iteratively for correcting intermediate translations. The system in Restricted Track is provided corpora with oversampled “cleaner” sentences reaches 59.39 F0.5 score test set. Low-Resource from Wikipedia revision histories 44.13 score. Finally, finetune...
We propose a character-based non-autoregressive GEC approach, with automatically generated character transformations. Recently, per-word classification of correction edits has proven an efficient, parallelizable alternative to current encoder-decoder systems. show that word replacement may be suboptimal and lead explosion rules for spelling, diacritization errors in morphologically rich languages, method generating transformations from corpus. Finally, we train transformation models Czech,...
We present CWRCzech, Click Web Ranking dataset for Czech, a 100M query-document Czech click relevance ranking with user behavior data collected from search engine logs of Seznam$.$cz. To the best our knowledge, CWRCzech is largest raw text published so far. It provides document positions in results as well information about behavior: 27.6M clicked documents and 10.8M dwell times. In addition, we also publish manually annotated test task, containing nearly 50k pairs, each by at least 2...
We present CWRCzech, Click Web Ranking dataset for Czech, a 100M query-document Czech click relevance ranking with user behavior data collected from search engine logs of Seznam.cz.To the best our knowledge, CWRCzech is largest raw text published so far.It provides document positions in results as well information about behavior: 27.6M clicked documents and 10.8M dwell times.In addition, we also publish manually annotated test task, containing nearly 50k pairs, each by at least 2...
This article focuses on the development and evaluation of Small-sized Czech sentence embedding models. Small models are important components for real-time industry applications in resource-constrained environments. Given limited availability labeled data, alternative approaches, including pre-training, knowledge distillation, unsupervised contrastive fine-tuning, investigated. Comprehensive intrinsic extrinsic analyses conducted, showcasing competitive performance our compared to...
Grammatical error correction in English is a long studied problem with many existing systems and datasets. However, there has been only limited research on of other languages. In this paper, we present new dataset AKCES-GEC grammatical for Czech. We then make experiments Czech, German Russian show that when utilizing synthetic parallel corpus, Transformer neural machine translation model can reach state-of-the-art results these published under CC BY-NC-SA 4.0 license at...
In this paper, we describe our systems submitted to the Building Educational Applications (BEA) 2019 Shared Task (Bryant et al., 2019). We participated in all three tracks. Our models are NMT based on Transformer model, which improve by incorporating several enhancements: applying dropout whole source and target words, weighting subwords, averaging model checkpoints, using trained iteratively for correcting intermediate translations. The system Restricted Track is provided corpora with...