Jakub Náplava

ORCID: 0000-0003-2259-1377
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Natural Language Processing Techniques
  • Topic Modeling
  • Text Readability and Simplification
  • Web Data Mining and Analysis
  • Information Retrieval and Search Behavior
  • Software Engineering Research
  • Semantic Web and Ontologies
  • Advanced Graph Neural Networks
  • Authorship Attribution and Profiling
  • Business Process Modeling and Analysis
  • Advanced Data Processing Techniques
  • Scientific Computing and Data Management
  • Interpreting and Communication in Healthcare
  • Speech and dialogue systems
  • Romani and Gypsy Studies
  • Computational Physics and Python Applications
  • Neural Networks and Applications
  • Software System Performance and Reliability
  • Advanced Text Analysis Techniques
  • Fault Detection and Control Systems

O2 Czech Republic (Czechia)
2024

Snam (Italy)
2022-2024

Charles University
2018-2022

Center for Applied Linguistics
2019-2021

Grammatical error correction in English is a long studied problem with many existing systems and datasets. However, there has been only limited research on of other languages. In this paper, we present new dataset AKCES-GEC grammatical for Czech. We then make experiments Czech, German Russian show that when utilizing synthetic parallel corpus, Transformer neural machine translation model can reach state-of-the-art results these published under CC BY-NC-SA 4.0 license at...

10.18653/v1/d19-5545 article EN cc-by 2019-01-01

We introduce a large and diverse Czech corpus annotated for grammatical error correction (GEC) with the aim to contribute still scarce data resources in this domain languages other than English. The Grammar Error Correction Corpus (GECCC) offers variety of four domains, covering distributions ranging from high density essays written by non-native speakers, website texts, where errors are expected be much less common. compare several GEC systems, including Transformer-based ones, setting...

10.1162/tacl_a_00470 article EN cc-by Transactions of the Association for Computational Linguistics 2022-01-01

Web search engines focus on serving highly relevant results within hundreds of milliseconds. Pre-trained language transformer models such as BERT are therefore hard to use in this scenario due their high computational demands. We present our real-time approach the document ranking problem leveraging a BERT-based siamese architecture. The model is already deployed commercial engine and it improves production performance by more than 3%. For further research evaluation, we release DaReCzech,...

10.1609/aaai.v36i11.21502 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2022-06-28

Sensitivity of deep-neural models to input noise is known be a challenging problem. In NLP, model performance often deteriorates with naturally occurring noise, such as spelling errors. To mitigate this issue, may leverage artificially noised data. However, the amount and type generated has so far been determined arbitrarily. We therefore propose errors statistically from grammatical-error-correction corpora. present thorough evaluation several state-of-the-art NLP systems' robustness in...

10.18653/v1/2021.wnut-1.38 preprint EN cc-by 2021-01-01

We propose a new architecture for diacritics restoration based on contextualized embeddings, namely BERT, and we evaluate it 12 languages with diacritics.Furthermore, conduct detailed error analysis Czech, morphologically rich language high level of diacritization.Notably, manually annotate all mispredictions, showing that roughly 44% them are actually not errors, but either plausible variants (19%), or the system corrections erroneous data (25%).Finally, categorize real errors in detail.We...

10.14712/00326585.013 article EN ˜The œPrague Bulletin of Mathematical Linguistics 2021-04-01

This article focuses on the development and evaluation of Small-sized Czech sentence embedding models. Small models are important components for real-time industry applications in resource-constrained environments. Given limited availability labeled data, alternative approaches, including pre-training, knowledge distillation, unsupervised contrastive fine-tuning, investigated. Comprehensive intrinsic extrinsic analyses conducted, showcasing competitive performance our compared to...

10.1609/aaai.v38i21.30307 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2024-03-24

Our submitted models are NMT systems based on the Transformer model, which we improve by incorporating several enhancements: applying dropout to whole source and target words, weighting subwords, averaging model checkpoints, using trained iteratively for correcting intermediate translations. The system in Restricted Track is provided corpora with oversampled “cleaner” sentences reaches 59.39 F0.5 score test set. Low-Resource from Wikipedia revision histories 44.13 score. Finally, finetune...

10.18653/v1/w19-4419 article EN cc-by 2019-01-01

We propose a character-based non-autoregressive GEC approach, with automatically generated character transformations. Recently, per-word classification of correction edits has proven an efficient, parallelizable alternative to current encoder-decoder systems. show that word replacement may be suboptimal and lead explosion rules for spelling, diacritization errors in morphologically rich languages, method generating transformations from corpus. Finally, we train transformation models Czech,...

10.18653/v1/2021.wnut-1.46 article EN cc-by 2021-01-01

We present CWRCzech, Click Web Ranking dataset for Czech, a 100M query-document Czech click relevance ranking with user behavior data collected from search engine logs of Seznam$.$cz. To the best our knowledge, CWRCzech is largest raw text published so far. It provides document positions in results as well information about behavior: 27.6M clicked documents and 10.8M dwell times. In addition, we also publish manually annotated test task, containing nearly 50k pairs, each by at least 2...

10.1145/3626772.3657851 preprint EN arXiv (Cornell University) 2024-05-31

We present CWRCzech, Click Web Ranking dataset for Czech, a 100M query-document Czech click relevance ranking with user behavior data collected from search engine logs of Seznam.cz.To the best our knowledge, CWRCzech is largest raw text published so far.It provides document positions in results as well information about behavior: 27.6M clicked documents and 10.8M dwell times.In addition, we also publish manually annotated test task, containing nearly 50k pairs, each by at least 2...

10.1145/3626772.3657851 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2024-07-10

This article focuses on the development and evaluation of Small-sized Czech sentence embedding models. Small models are important components for real-time industry applications in resource-constrained environments. Given limited availability labeled data, alternative approaches, including pre-training, knowledge distillation, unsupervised contrastive fine-tuning, investigated. Comprehensive intrinsic extrinsic analyses conducted, showcasing competitive performance our compared to...

10.48550/arxiv.2311.13921 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Grammatical error correction in English is a long studied problem with many existing systems and datasets. However, there has been only limited research on of other languages. In this paper, we present new dataset AKCES-GEC grammatical for Czech. We then make experiments Czech, German Russian show that when utilizing synthetic parallel corpus, Transformer neural machine translation model can reach state-of-the-art results these published under CC BY-NC-SA 4.0 license at...

10.48550/arxiv.1910.00353 preprint EN other-oa arXiv (Cornell University) 2019-01-01

In this paper, we describe our systems submitted to the Building Educational Applications (BEA) 2019 Shared Task (Bryant et al., 2019). We participated in all three tracks. Our models are NMT based on Transformer model, which improve by incorporating several enhancements: applying dropout whole source and target words, weighting subwords, averaging model checkpoints, using trained iteratively for correcting intermediate translations. The system Restricted Track is provided corpora with...

10.48550/arxiv.1909.05553 preprint EN other-oa arXiv (Cornell University) 2019-01-01
Coming Soon ...