NFDI4DS | UHH-SEMS - Publication Details

Milan Straka

ORCID: 0000-0003-3295-5576

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5060081515

Research Areas

Natural Language Processing Techniques
Topic Modeling
Text Readability and Simplification
Semantic Web and Ontologies
Biomedical Text Mining and Ontologies
Speech and dialogue systems
Translation Studies and Practices
Authorship Attribution and Profiling
linguistics and terminology studies
Linguistics, Language Diversity, and Identity
Lexicography and Language Studies
Algorithms and Data Compression
Speech and Audio Processing
Advanced Text Analysis Techniques
Literature, Language, and Rhetoric Studies
Advanced Data Storage Technologies
Software Engineering Research
Music and Audio Processing
Data Quality and Management
Web Data Mining and Analysis
Mathematics, Computing, and Information Processing
Distributed systems and fault tolerance
Neural Networks and Applications
Advanced Graph Neural Networks
Control and Dynamics of Mobile Robots

Charles University
2013-2024

University of Žilina
2023

Center for Applied Linguistics
2018-2021

University of Copenhagen
2019

Linköping University
2019

University of Oslo
2019

Hebrew University of Jerusalem
2019

Brandeis University
2019

Czech Academy of Sciences, Czech Language Institute
2019

Slovenská Elektrizačná Prenosová Sústava (Slovakia)
2018

Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe

OPENALEX - Publications

Milan Straka Jana Straková

Many natural language processing tasks, including the most advanced ones, routinely start by several basic steps – tokenization and segmentation, likely also POS tagging lemmatization, commonly parsing as well. A multilingual pipeline performing these can be trained using Universal Dependencies project, which contains annotations of described tasks for 50 languages in latest release UD 2.0. We present an update to UDPipe, a simple-to-use CoNLL-U version 2.0 files, performs multiple without...

10.18653/v1/k17-3009 article EN cc-by 2017-01-01

CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

OPENALEX - Publications

Daniel Zeman Martin Popel Milan Straka Jan Hajič Joakim Nivre and 57 more

Daniel Zeman, Martin Popel, Milan Straka, Jan Hajič, Joakim Nivre, Filip Ginter, Juhani Luotolahti, Sampo Pyysalo, Slav Petrov, Potthast, Francis Tyers, Elena Badmaeva, Memduh Gokirmak, Anna Nedoluzhko, Silvie Cinková, Hajič jr., Jaroslava Hlaváčová, Václava Kettnerová, Zdeňka Urešová, Jenna Kanerva, Stina Ojala, Missilä, Christopher D. Manning, Sebastian Schuster, Siva Reddy, Dima Taji, Nizar Habash, Herman Leung, Marie-Catherine de Marneffe, Manuela Sanguinetti, Maria Simi, Hiroshi...

10.18653/v1/k17-3001 article EN cc-by 2017-01-01

Neural Architectures for Nested NER through Linearization

OPENALEX - Publications

Jana Straková Milan Straka Jan Hajič

We propose two neural network architectures for nested named entity recognition (NER), a setting in which entities may overlap and also be labeled with more than one label. encode the labels using linearized scheme. In our first proposed approach, are modeled as multilabels corresponding to Cartesian product of standard LSTM-CRF architecture. second one, NER is viewed sequence-to-sequence problem, input sequence consists tokens output labels, hard attention on word whose label being...

10.18653/v1/p19-1527 article EN 2019-01-01

75 Languages, 1 Model: Parsing Universal Dependencies Universally

OPENALEX - Publications

Dan Kondratyuk Milan Straka

Dan Kondratyuk, Milan Straka. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint (EMNLP-IJCNLP). 2019.

10.18653/v1/d19-1279 article EN cc-by 2019-01-01

Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition

OPENALEX - Publications

Jana Straková Milan Straka Jan Hajič

We present two recently released opensource taggers: NameTag is a free software for named entity recognition (NER) which achieves state-of-the-art performance on Czech; MorphoDiTa (Morphological Dictionary and Tagger) performs morphological analysis (with lemmatization), generation, tagging tokenization with results Czech throughput around 10-200K words per second. The taggers can be trained any language annotated data exist, but they are specifically designed to efficient inflective...

10.3115/v1/p14-5003 article EN 2014-01-01

OPENALEX - Publications

Milan Straka

UDPipe is a trainable pipeline which performs sentence segmentation, tokenization, POS tagging, lemmatization and dependency parsing. We present prototype for 2.0 evaluate it in the CoNLL 2018 UD Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, employs three metrics submission ranking. Out of 26 participants, placed first MLAS ranking, third LAS ranking BLEX In extrinsic parser evaluation EPE 2018, system ranked overall score.

10.18653/v1/k18-2020 article EN cc-by Proceedings of the اولین کنفرانس بین المللی پیشرفت های نوین در مهندسی عمران 2018-01-01

MRP 2019: Cross-Framework Meaning Representation Parsing

OPENALEX - Publications

Stephan Oepen Omri Abend Jan Hajič Daniel Hershcovich Marco Kuhlmann and 5 more

Stephan Oepen, Omri Abend, Jan Hajic, Daniel Hershcovich, Marco Kuhlmann, Tim O'Gorman, Nianwen Xue, Jayeol Chun, Milan Straka, Zdenka Uresova. Proceedings of the Shared Task on Cross-Framework Meaning Representation Parsing at 2019 Conference Natural Language Learning. 2019.

10.18653/v1/k19-2001 article EN cc-by 2019-01-01

OPENALEX - Publications

Daniel Zeman Jan Hajič Martin Popel Martin Potthast Milan Straka and 3 more

Daniel Zeman, Jan Hajič, Martin Popel, Potthast, Milan Straka, Filip Ginter, Joakim Nivre, Slav Petrov. Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. 2018.

10.18653/v1/k18-2001 article EN cc-by Proceedings of the اولین کنفرانس بین المللی پیشرفت های نوین در مهندسی عمران 2018-01-01

Grammatical Error Correction in Low-Resource Scenarios

OPENALEX - Publications

Jakub Náplava Milan Straka

Grammatical error correction in English is a long studied problem with many existing systems and datasets. However, there has been only limited research on of other languages. In this paper, we present new dataset AKCES-GEC grammatical for Czech. We then make experiments Czech, German Russian show that when utilizing synthetic parallel corpus, Transformer neural machine translation model can reach state-of-the-art results these published under CC BY-NC-SA 4.0 license at...

10.18653/v1/d19-5545 article EN cc-by 2019-01-01

Evaluating Contextualized Embeddings on 54 Languages in POS Tagging, Lemmatization and Dependency Parsing

OPENALEX - Publications

Milan Straka Jana Straková Jan Hajič

We present an extensive evaluation of three recently proposed methods for contextualized embeddings on 89 corpora in 54 languages the Universal Dependencies 2.3 tasks: POS tagging, lemmatization, and dependency parsing. Employing BERT, Flair ELMo as pretrained embedding inputs a strong baseline UDPipe 2.0, one best-performing systems CoNLL 2018 Shared Task overall winner EPE 2018, we one-to-one comparison word methods, well with word2vec-like end-to-end character-level embeddings. report...

10.48550/arxiv.1908.07448 preprint EN other-oa arXiv (Cornell University) 2019-01-01

LemmaTag: Jointly Tagging and Lemmatizing for Morphologically Rich Languages with BRNNs

OPENALEX - Publications

Daniel Kondratyuk Tomáš Gavenčiak Milan Straka Jan Hajič

We present LemmaTag, a featureless neural network architecture that jointly generates part-of-speech tags and lemmas for sentences by using bidirectional RNNs with character-level word-level embeddings. demonstrate both tasks benefit from sharing the encoding part of network, predicting tag subcategories, tagger output as an input to lemmatizer. evaluate our model across several languages complex morphology, which surpasses state-of-the-art accuracy in tagging lemmatization Czech, German, Arabic.

10.18653/v1/d18-1532 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2018-01-01

Neural Networks for Multi-Word Expression Detection

OPENALEX - Publications

Natalia Klyueva Antoine Doucet Milan Straka

In this paper we describe the MUMULS system that participated to 2017 shared task on automatic identification of verbal multiword expressions (VMWEs). The was implemented using a supervised approach based recurrent neural networks open source library TensorFlow. model trained data set containing annotated VMWEs as well morphological and syntactic information. performed in 15 languages, it one few systems could categorize type nearly all languages.

10.18653/v1/w17-1707 preprint EN cc-by 2017-01-01

Czech Grammar Error Correction with a Large and Diverse Corpus

OPENALEX - Publications

Jakub Náplava Milan Straka Jana Straková Alexandr Rosen

We introduce a large and diverse Czech corpus annotated for grammatical error correction (GEC) with the aim to contribute still scarce data resources in this domain languages other than English. The Grammar Error Correction Corpus (GECCC) offers variety of four domains, covering distributions ranging from high density essays written by non-native speakers, website texts, where errors are expected be much less common. compare several GEC systems, including Transformer-based ones, setting...

10.1162/tacl_a_00470 article EN cc-by Transactions of the Association for Computational Linguistics 2022-01-01

Factors influencing the perceived value of travel time in European urban areas

OPENALEX - Publications

Ghadir Pourhashem Christina Georgouli Eva Malichová Milan Straka Tatiana Kováčiková

Abstract This research aims at expanding the scope of travel satisfaction by incorporating subjective elements in evaluation worthwhileness time proposed H2020 MoTiV project, using a European-wide mobility dataset collected 2019. Trip characteristics, mood, socio-demographic experience factors, activities and weather were analysed to explore their influence on travellers’ perception time. The analysis was performed separately for five different transport mode categories Structural Equation...

10.1007/s11116-023-10376-2 article EN cc-by Transportation 2023-02-25

UDPipe at SIGMORPHON 2019: Contextualized Embeddings, Regularization with Morphological Categories, Corpora Merging

OPENALEX - Publications

Milan Straka Jana Straková Jan Hajič

We present our contribution to the SIGMORPHON 2019 Shared Task: Crosslinguality and Context in Morphology, Task 2: contextual morphological analysis lemmatization. submitted a modification of UDPipe 2.0, one best-performing systems CoNLL 2018 Multilingual Parsing from Raw Text Universal Dependencies an overall winner The on Extrinsic Parser Evaluation. As first improvement, we use pretrained contextualized embeddings (BERT) as additional inputs network; secondly, individual features...

10.18653/v1/w19-4212 article EN cc-by 2019-01-01

ÚFAL at MRP 2020: Permutation-invariant Semantic Parsing in PERIN

OPENALEX - Publications

David Samuel Milan Straka

We present PERIN, a novel permutation-invariant approach to sentence-to-graph semantic parsing. PERIN is versatile, cross-framework and language independent architecture for universal modeling of structures. Our system participated in the CoNLL 2020 shared task, Cross-Framework Meaning Representation Parsing (MRP 2020), where it was evaluated on five different frameworks (AMR, DRG, EDS, PTG UCCA) across four languages. one winners task. The source code pretrained models are available at...

10.18653/v1/2020.conll-shared.5 article EN cc-by 2020-01-01

Coming Soon ...