- Natural Language Processing Techniques
- Topic Modeling
- Semantic Web and Ontologies
- Authorship Attribution and Profiling
- Hand Gesture Recognition Systems
- Speech Recognition and Synthesis
- Hearing Impairment and Communication
- Linguistic Variation and Morphology
- linguistics and terminology studies
- Spanish Linguistics and Language Studies
- Distributed and Parallel Computing Systems
- Speech and dialogue systems
- Microbial infections and disease research
- Scientific Computing and Data Management
- Biomedical Text Mining and Ontologies
- Vector-Borne Animal Diseases
- Lexicography and Language Studies
- Human Pose and Action Recognition
- Multilingual Education and Policy
- Historical Linguistics and Language Studies
- Animal Disease Management and Epidemiology
- Linguistic Education and Pedagogy
- Sentiment Analysis and Opinion Mining
- Second Language Acquisition and Learning
- Hate Speech and Cyberbullying Detection
University of Zurich
2009-2023
University of Crete
2011
Eurospider Information Technology (Switzerland)
2003
Text normalization is the task of mapping non-canonical language, typical speech transcription and computer-mediated communication, to a standardized writing. It an up-stream necessary enable subsequent direct employment standard natural language processing tools indispensable for languages such as Swiss German, with strong regional variation no written standard. has been addressed variety methods, most successfully character-level statistical machine translation (CSMT). In meantime, changed...
Mathias Müller, Malihe Alikhani, Eleftherios Avramidis, Richard Bowden, Annelies Braffort, Necati Cihan Camgöz, Sarah Ebling, Cristina España-Bonet, Anne Göhring, Roman Grundkiewicz, Mert Inan, Zifan Jiang, Oscar Koller, Amit Moryossef, Annette Rios, Dimitar Shterionov, Sandra Sidler-Miserez, Katja Tissi, Davy Van Landuyt. Proceedings of the Eighth Conference on Machine Translation. 2023.
Most treebank work in the past has focused on European and Asian languages. The Wikipedia Treebank page lists treebanks (or projects) for about 20 modern languages (ranging from Basque to Swedish), five (Chinese, Japanese, Hindi, Korean, Thai), two ancient (Greek Latin), plus Arabic Hebrew. Almost no treebanking been done African or American indigenous languages.1 In we have explored parallel English, German Swedish [7]. Now would like explore what extent our tools guidelines will when...
In this paper we argue that harmonization is not the preferred way to produce a gold standard in all cases. Neither does majority vote based an appropriate centroid, nor would mere centroid be good basis for training system reproduces prototypical user reactions given some understanding task. We discuss these claims context of sentiment inference.
This article presents a French-German parallel corpus of more than 4 million tokens which we have compiled as part the digitization large multilingual heritage alpine texts. is valuable resource for cultural and cross-linguistic studies well development domain-specific machine translation systems. We turned small fraction into high-quality treebank with manually checked syntactic annotations cross-language word phrase alignments. first freely available treebank. It complements other...
Abstract Text normalization is the task of mapping noncanonical language, typical speech transcription and computer-mediated communication, to a standardized writing. This especially important for languages such as Swiss German, with strong regional variation no written standard. In this paper, we propose novel solution normalizing German WhatsApp messages using encoder–decoder neural machine translation (NMT) framework. We enhance performance plain character-level NMT model integration...
This paper describes the development of Spanish-German dictionary used in our hybrid MT system.The compilation process relies entirely on open source tools and freely available language resources.Our bilingual around 33,700 entries may thus be used, distributed further enhanced as convenient.
Pathology data have been reported to be important for surveillance, as they are crucial correctly recognizing and identifying new or re-emerging diseases in animal populations. However, there no reports the literature of necropsy being compared complemented with other data. In our study, we cattle extracted from 3 laboratories Swiss fallen stock clinical collected by association Cattle Breeders. The objective was assess completeness, validity representativeness data, well evaluate potential...
We have implemented a rule-based prototype of Spanish-to-Cuzco Quechua MT system enhanced through the addition statistical components. The greatest difficulty during translation process is to generate correct verb form in subordinated clauses. has several rules that decide which should be used given context. However, matching context order apply rule depends crucially on parsing quality Spanish input. As heavily conjunction clause and semantics main verb, we extracted this information from...
This paper describes the opportunities that arise from automatic word alignment for bilingual concordances and contrastive language studies. We introduce our parallel corpus of Alpine texts in French German web-based search system. explain how we have reduced number erroneous alignments output by distinguishing between dominant miscellaneous translations. are currently process extending system to a new pair Spanish-Quechua. poses special problems because scarcity resources Quechua but also...
Parallel treebanking is greatly facilitated by automatic word alignment. We work on building a trilingual treebank for German, Spanish and Quechua. ran different alignment experiments parallel Spanish-Quechua texts, measured the quality, compared these results to figures we obtained aligning comparable corpus of Spanish-German texts. This preliminary has shown us best segmentation use agglutinative language Quechua with respect also acquired first impression about how well can be aligned...
Sign language translation systems are complex and require many components. As a result, it is very hard to compare methods across publications. We present an open-source implementation of text-to-gloss-to-pose-to-video pipeline approach, demonstrating conversion from German Swiss Language, French Language Switzerland, Italian Switzerland. propose three different components for the text-to-gloss translation: lemmatizer, rule-based word reordering dropping component, neural machine system....
In this paper, we introduce the first corpus specifying negative entities within sentences. We discuss indicators for their presence, namely particular verbs, but also linguistic conditions when prediction should be suppressed. further show that a fine-tuned Bert-based baseline model outperforms an over-generating rule-based approach which is not aware of these restrictions. If perfect filter were applied, both would on par.
We introduce deInStance, a corpus of 1000 politicians’ answers in German (de) containing sentences labeled with explicitly expressed and inferred stances - pro con relations by 3 annotators. They achieved an acceptable inter-rater agreement given the inherent subjective nature task. A first baseline, fine-tuned BERT-based token classifier, F1-scores around 70% . Our focus is on difficult subclass comprising only non-polar words, but still (implicit) or perspective writer.