Milan Straka

ORCID: 0000-0003-3295-5576
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Natural Language Processing Techniques
  • Topic Modeling
  • Text Readability and Simplification
  • Semantic Web and Ontologies
  • Biomedical Text Mining and Ontologies
  • Speech and dialogue systems
  • Translation Studies and Practices
  • Authorship Attribution and Profiling
  • linguistics and terminology studies
  • Linguistics, Language Diversity, and Identity
  • Lexicography and Language Studies
  • Algorithms and Data Compression
  • Speech and Audio Processing
  • Advanced Text Analysis Techniques
  • Literature, Language, and Rhetoric Studies
  • Advanced Data Storage Technologies
  • Software Engineering Research
  • Music and Audio Processing
  • Data Quality and Management
  • Web Data Mining and Analysis
  • Mathematics, Computing, and Information Processing
  • Distributed systems and fault tolerance
  • Neural Networks and Applications
  • Advanced Graph Neural Networks
  • Control and Dynamics of Mobile Robots

Charles University
2013-2024

University of Žilina
2023

Center for Applied Linguistics
2018-2021

University of Copenhagen
2019

Linköping University
2019

University of Oslo
2019

Hebrew University of Jerusalem
2019

Brandeis University
2019

Czech Academy of Sciences, Czech Language Institute
2019

Slovenská Elektrizačná Prenosová Sústava (Slovakia)
2018

Many natural language processing tasks, including the most advanced ones, routinely start by several basic steps – tokenization and segmentation, likely also POS tagging lemmatization, commonly parsing as well. A multilingual pipeline performing these can be trained using Universal Dependencies project, which contains annotations of described tasks for 50 languages in latest release UD 2.0. We present an update to UDPipe, a simple-to-use CoNLL-U version 2.0 files, performs multiple without...

10.18653/v1/k17-3009 article EN cc-by 2017-01-01

Daniel Zeman, Martin Popel, Milan Straka, Jan Hajič, Joakim Nivre, Filip Ginter, Juhani Luotolahti, Sampo Pyysalo, Slav Petrov, Potthast, Francis Tyers, Elena Badmaeva, Memduh Gokirmak, Anna Nedoluzhko, Silvie Cinková, Hajič jr., Jaroslava Hlaváčová, Václava Kettnerová, Zdeňka Urešová, Jenna Kanerva, Stina Ojala, Missilä, Christopher D. Manning, Sebastian Schuster, Siva Reddy, Dima Taji, Nizar Habash, Herman Leung, Marie-Catherine de Marneffe, Manuela Sanguinetti, Maria Simi, Hiroshi...

10.18653/v1/k17-3001 article EN cc-by 2017-01-01

We propose two neural network architectures for nested named entity recognition (NER), a setting in which entities may overlap and also be labeled with more than one label. encode the labels using linearized scheme. In our first proposed approach, are modeled as multilabels corresponding to Cartesian product of standard LSTM-CRF architecture. second one, NER is viewed sequence-to-sequence problem, input sequence consists tokens output labels, hard attention on word whose label being...

10.18653/v1/p19-1527 article EN 2019-01-01

Dan Kondratyuk, Milan Straka. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint (EMNLP-IJCNLP). 2019.

10.18653/v1/d19-1279 article EN cc-by 2019-01-01

We present two recently released opensource taggers: NameTag is a free software for named entity recognition (NER) which achieves state-of-the-art performance on Czech; MorphoDiTa (Morphological Dictionary and Tagger) performs morphological analysis (with lemmatization), generation, tagging tokenization with results Czech throughput around 10-200K words per second. The taggers can be trained any language annotated data exist, but they are specifically designed to efficient inflective...

10.3115/v1/p14-5003 article EN 2014-01-01

UDPipe is a trainable pipeline which performs sentence segmentation, tokenization, POS tagging, lemmatization and dependency parsing. We present prototype for 2.0 evaluate it in the CoNLL 2018 UD Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, employs three metrics submission ranking. Out of 26 participants, placed first MLAS ranking, third LAS ranking BLEX In extrinsic parser evaluation EPE 2018, system ranked overall score.

10.18653/v1/k18-2020 article EN cc-by Proceedings of the اولین کنفرانس بین المللی پیشرفت های نوین در مهندسی عمران 2018-01-01

Stephan Oepen, Omri Abend, Jan Hajic, Daniel Hershcovich, Marco Kuhlmann, Tim O'Gorman, Nianwen Xue, Jayeol Chun, Milan Straka, Zdenka Uresova. Proceedings of the Shared Task on Cross-Framework Meaning Representation Parsing at 2019 Conference Natural Language Learning. 2019.

10.18653/v1/k19-2001 article EN cc-by 2019-01-01

Daniel Zeman, Jan Hajič, Martin Popel, Potthast, Milan Straka, Filip Ginter, Joakim Nivre, Slav Petrov. Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. 2018.

10.18653/v1/k18-2001 article EN cc-by Proceedings of the اولین کنفرانس بین المللی پیشرفت های نوین در مهندسی عمران 2018-01-01

Grammatical error correction in English is a long studied problem with many existing systems and datasets. However, there has been only limited research on of other languages. In this paper, we present new dataset AKCES-GEC grammatical for Czech. We then make experiments Czech, German Russian show that when utilizing synthetic parallel corpus, Transformer neural machine translation model can reach state-of-the-art results these published under CC BY-NC-SA 4.0 license at...

10.18653/v1/d19-5545 article EN cc-by 2019-01-01

We present an extensive evaluation of three recently proposed methods for contextualized embeddings on 89 corpora in 54 languages the Universal Dependencies 2.3 tasks: POS tagging, lemmatization, and dependency parsing. Employing BERT, Flair ELMo as pretrained embedding inputs a strong baseline UDPipe 2.0, one best-performing systems CoNLL 2018 Shared Task overall winner EPE 2018, we one-to-one comparison word methods, well with word2vec-like end-to-end character-level embeddings. report...

10.48550/arxiv.1908.07448 preprint EN other-oa arXiv (Cornell University) 2019-01-01

We present LemmaTag, a featureless neural network architecture that jointly generates part-of-speech tags and lemmas for sentences by using bidirectional RNNs with character-level word-level embeddings. demonstrate both tasks benefit from sharing the encoding part of network, predicting tag subcategories, tagger output as an input to lemmatizer. evaluate our model across several languages complex morphology, which surpasses state-of-the-art accuracy in tagging lemmatization Czech, German, Arabic.

10.18653/v1/d18-1532 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2018-01-01

In this paper we describe the MUMULS system that participated to 2017 shared task on automatic identification of verbal multiword expressions (VMWEs). The was implemented using a supervised approach based recurrent neural networks open source library TensorFlow. model trained data set containing annotated VMWEs as well morphological and syntactic information. performed in 15 languages, it one few systems could categorize type nearly all languages.

10.18653/v1/w17-1707 preprint EN cc-by 2017-01-01

We introduce a large and diverse Czech corpus annotated for grammatical error correction (GEC) with the aim to contribute still scarce data resources in this domain languages other than English. The Grammar Error Correction Corpus (GECCC) offers variety of four domains, covering distributions ranging from high density essays written by non-native speakers, website texts, where errors are expected be much less common. compare several GEC systems, including Transformer-based ones, setting...

10.1162/tacl_a_00470 article EN cc-by Transactions of the Association for Computational Linguistics 2022-01-01

Abstract This research aims at expanding the scope of travel satisfaction by incorporating subjective elements in evaluation worthwhileness time proposed H2020 MoTiV project, using a European-wide mobility dataset collected 2019. Trip characteristics, mood, socio-demographic experience factors, activities and weather were analysed to explore their influence on travellers’ perception time. The analysis was performed separately for five different transport mode categories Structural Equation...

10.1007/s11116-023-10376-2 article EN cc-by Transportation 2023-02-25

We present our contribution to the SIGMORPHON 2019 Shared Task: Crosslinguality and Context in Morphology, Task 2: contextual morphological analysis lemmatization. submitted a modification of UDPipe 2.0, one best-performing systems CoNLL 2018 Multilingual Parsing from Raw Text Universal Dependencies an overall winner The on Extrinsic Parser Evaluation. As first improvement, we use pretrained contextualized embeddings (BERT) as additional inputs network; secondly, individual features...

10.18653/v1/w19-4212 article EN cc-by 2019-01-01

We present PERIN, a novel permutation-invariant approach to sentence-to-graph semantic parsing. PERIN is versatile, cross-framework and language independent architecture for universal modeling of structures. Our system participated in the CoNLL 2020 shared task, Cross-Framework Meaning Representation Parsing (MRP 2020), where it was evaluated on five different frameworks (AMR, DRG, EDS, PTG UCCA) across four languages. one winners task. The source code pretrained models are available at...

10.18653/v1/2020.conll-shared.5 article EN cc-by 2020-01-01
Coming Soon ...