NFDI4DS | UHH-SEMS - Publication Details

Grammatical Error Correction in Low-Resource Scenarios

OPENALEX - Publications

Jakub Náplava Milan Straka

Grammatical error correction in English is a long studied problem with many existing systems and datasets. However, there has been only limited research on of other languages. In this paper, we present new dataset AKCES-GEC grammatical for Czech. We then make experiments Czech, German Russian show that when utilizing synthetic parallel corpus, Transformer neural machine translation model can reach state-of-the-art results these published under CC BY-NC-SA 4.0 license at...

10.18653/v1/d19-5545 article EN cc-by 2019-01-01

Czech Grammar Error Correction with a Large and Diverse Corpus

OPENALEX - Publications

Jakub Náplava Milan Straka Jana Straková Alexandr Rosen

We introduce a large and diverse Czech corpus annotated for grammatical error correction (GEC) with the aim to contribute still scarce data resources in this domain languages other than English. The Grammar Error Correction Corpus (GECCC) offers variety of four domains, covering distributions ranging from high density essays written by non-native speakers, website texts, where errors are expected be much less common. compare several GEC systems, including Transformer-based ones, setting...

10.1162/tacl_a_00470 article EN cc-by Transactions of the Association for Computational Linguistics 2022-01-01

Siamese BERT-Based Model for Web Search Relevance Ranking Evaluated on a New Czech Dataset

OPENALEX - Publications

Matěj Kocián Jakub Náplava Daniel Štancl Vladimír Kadlec

Web search engines focus on serving highly relevant results within hundreds of milliseconds. Pre-trained language transformer models such as BERT are therefore hard to use in this scenario due their high computational demands. We present our real-time approach the document ranking problem leveraging a BERT-based siamese architecture. The model is already deployed commercial engine and it improves production performance by more than 3%. For further research evaluation, we release DaReCzech,...

10.1609/aaai.v36i11.21502 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2022-06-28

Understanding Model Robustness to User-generated Noisy Texts

OPENALEX - Publications

Jakub Náplava Martin Popel Milan Straka Jana Straková

Sensitivity of deep-neural models to input noise is known be a challenging problem. In NLP, model performance often deteriorates with naturally occurring noise, such as spelling errors. To mitigate this issue, may leverage artificially noised data. However, the amount and type generated has so far been determined arbitrarily. We therefore propose errors statistically from grammatical-error-correction corpora. present thorough evaluation several state-of-the-art NLP systems' robustness in...

10.18653/v1/2021.wnut-1.38 preprint EN cc-by 2021-01-01

Diacritics Restoration using BERT with Analysis on Czech language

OPENALEX - Publications

Jakub Náplava Milan Straka Jana Straková

We propose a new architecture for diacritics restoration based on contextualized embeddings, namely BERT, and we evaluate it 12 languages with diacritics.Furthermore, conduct detailed error analysis Czech, morphologically rich language high level of diacritization.Notably, manually annotate all mispredictions, showing that roughly 44% them are actually not errors, but either plausible variants (19%), or the system corrections erroneous data (25%).Finally, categorize real errors in detail.We...

10.14712/00326585.013 article EN The Prague Bulletin of Mathematical Linguistics 2021-04-01

Some Like It Small: Czech Semantic Embedding Models for Industry Applications

OPENALEX - Publications

Jiří Bednář Jakub Náplava Petra Barančíková Ondřej Lisický

This article focuses on the development and evaluation of Small-sized Czech sentence embedding models. Small models are important components for real-time industry applications in resource-constrained environments. Given limited availability labeled data, alternative approaches, including pre-training, knowledge distillation, unsupervised contrastive fine-tuning, investigated. Comprehensive intrinsic extrinsic analyses conducted, showcasing competitive performance our compared to...

10.1609/aaai.v38i21.30307 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2024-03-24

CUNI System for the Building Educational Applications 2019 Shared Task: Grammatical Error Correction

OPENALEX - Publications

Jakub Náplava Milan Straka

Our submitted models are NMT systems based on the Transformer model, which we improve by incorporating several enhancements: applying dropout to whole source and target words, weighting subwords, averaging model checkpoints, using trained iteratively for correcting intermediate translations. The system in Restricted Track is provided corpora with oversampled “cleaner” sentences reaches 59.39 F0.5 score test set. Low-Resource from Wikipedia revision histories 44.13 score. Finally, finetune...

10.18653/v1/w19-4419 article EN cc-by 2019-01-01

Character Transformations for Non-Autoregressive GEC Tagging

OPENALEX - Publications

Milan Straka Jakub Náplava Jana Straková

We propose a character-based non-autoregressive GEC approach, with automatically generated character transformations. Recently, per-word classification of correction edits has proven an efficient, parallelizable alternative to current encoder-decoder systems. show that word replacement may be suboptimal and lead explosion rules for spelling, diacritization errors in morphologically rich languages, method generating transformations from corpus. Finally, we train transformation models Czech,...

10.18653/v1/2021.wnut-1.46 article EN cc-by 2021-01-01

CWRCzech: 100M Query-Document Czech Click Dataset and Its Application to Web Relevance Ranking

OPENALEX - Publications

Josef Vonášek Milan Straka Rostislav Krč Lenka Lasoňová Ekaterina Egorova and 2 more

We present CWRCzech, Click Web Ranking dataset for Czech, a 100M query-document Czech click relevance ranking with user behavior data collected from search engine logs of Seznam$.$cz. To the best our knowledge, CWRCzech is largest raw text published so far. It provides document positions in results as well information about behavior: 27.6M clicked documents and 10.8M dwell times. In addition, we also publish manually annotated test task, containing nearly 50k pairs, each by at least 2...

10.1145/3626772.3657851 preprint EN arXiv (Cornell University) 2024-05-31

CWRCzech: 100M Query-Document Czech Click Dataset and Its Application to Web Relevance Ranking

OPENALEX - Publications

Josef Vonášek Milan Straka Rostislav Krč Lenka Lasoňová Ekaterina Egorova and 2 more

We present CWRCzech, Click Web Ranking dataset for Czech, a 100M query-document Czech click relevance ranking with user behavior data collected from search engine logs of Seznam.cz.To the best our knowledge, CWRCzech is largest raw text published so far.It provides document positions in results as well information about behavior: 27.6M clicked documents and 10.8M dwell times.In addition, we also publish manually annotated test task, containing nearly 50k pairs, each by at least 2...

10.1145/3626772.3657851 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2024-07-10

Some Like It Small: Czech Semantic Embedding Models for Industry Applications

OPENALEX - Publications

J. Bednář Jakub Náplava Petra Barančíková Ondřej Lisický

This article focuses on the development and evaluation of Small-sized Czech sentence embedding models. Small models are important components for real-time industry applications in resource-constrained environments. Given limited availability labeled data, alternative approaches, including pre-training, knowledge distillation, unsupervised contrastive fine-tuning, investigated. Comprehensive intrinsic extrinsic analyses conducted, showcasing competitive performance our compared to...

10.48550/arxiv.2311.13921 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Grammatical Error Correction in Low-Resource Scenarios

OPENALEX - Publications

Jakub Náplava Milan Straka

Grammatical error correction in English is a long studied problem with many existing systems and datasets. However, there has been only limited research on of other languages. In this paper, we present new dataset AKCES-GEC grammatical for Czech. We then make experiments Czech, German Russian show that when utilizing synthetic parallel corpus, Transformer neural machine translation model can reach state-of-the-art results these published under CC BY-NC-SA 4.0 license at...

10.48550/arxiv.1910.00353 preprint EN other-oa arXiv (Cornell University) 2019-01-01

CUNI System for the Building Educational Applications 2019 Shared Task: Grammatical Error Correction

OPENALEX - Publications

Jakub Náplava Milan Straka

In this paper, we describe our systems submitted to the Building Educational Applications (BEA) 2019 Shared Task (Bryant et al., 2019). We participated in all three tracks. Our models are NMT based on Transformer model, which improve by incorporating several enhancements: applying dropout whole source and target words, weighting subwords, averaging model checkpoints, using trained iteratively for correcting intermediate translations. The system Restricted Track is provided corpora with...

10.48550/arxiv.1909.05553 preprint EN other-oa arXiv (Cornell University) 2019-01-01