- Natural Language Processing Techniques
- Topic Modeling
- Semantic Web and Ontologies
- Speech and dialogue systems
- Lexicography and Language Studies
- Speech Recognition and Synthesis
- Text Readability and Simplification
- linguistics and terminology studies
- Algorithms and Data Compression
- Multi-Agent Systems and Negotiation
- Service-Oriented Architecture and Web Services
- Artificial Intelligence in Law
- Advanced Text Analysis Techniques
- Translation Studies and Practices
- Biomedical Text Mining and Ontologies
- Authorship Attribution and Profiling
- Power Systems and Technologies
- Web Data Mining and Analysis
- Linguistic research and analysis
- Robotic Path Planning Algorithms
- Linguistics, Language Diversity, and Identity
- Robotics and Automated Systems
- Information Retrieval and Search Behavior
- Constraint Satisfaction and Optimization
- Mathematics, Computing, and Information Processing
Romanian Academy
2014-2024
Artificial Intelligence Research Institute
2004-2023
Academy of Romanian Scientists
2010
University of Sheffield
2010
University of Stuttgart
2008
Alexandru Ioan Cuza University
2004
National Institute for Research and Development in Informatics - ICI Bucharest
1989-1994
We present a new, unique and freely available parallel corpus containing European Union (EU) documents of mostly legal nature. It is in all 20 official EUanguages, with additional being the languages EU candidate countries. The consists almost 8,000 per language, an average size nearly 9 million words language. Pair-wise paragraph alignment information produced by two different aligners (Vanilla HunAlign) for 190+ language pair combinations. Most texts have been manually classified according...
This paper describes an experiment that uses translation equivalents derived from parallel corpora to determine sense distinctions can be used for automatic sense-tagging and other disambiguation tasks. Our results show cross-lingual information are at least as reliable those made by human annotators. Because our approach is fully automated through all its steps, it could provide means obtain large samples of "sense-tagged" data without the high cost annotation.
Transformer models produce advanced text representations that have been used to break through the hard challenge of natural language understanding. Using Transformer’s attention mechanism, which acts as a learning memory, trained on tens billions words, word sense disambiguation (WSD) algorithm can now construct more faithful vectorial representation context be disambiguated. Working with set 34 lemmas nouns, verbs, adjectives and adverbs selected from National Reference Corpus Romanian...
The EU Copernicus project Multext-East has created a multi-lingual corpus of text and speech data, covering the six languages project: Bulgarian, Czech, Estonian, Hungarian, Romanian, Slovene. In addition, wordform lexicons for each were developed. includes parallel component consisting Orwell's Nineteen Eighty-Four, with versions in all tagged part-of-speech aligned to English (also POS). We describe encoding format data architecture designed especially this corpus, which is generally...
The paper presents a method for word sense disambiguation based on parallel corpora. exploits recent advances in alignment and clustering automatic extraction of translation equivalents being supported by available aligned wordnets the languages corpus. are to Princeton Wordnet, according principles established EuroWordNet. evaluation WSD system, implementing described herein showed very encouraging results. same system used validation mode, can be check spot errors multilingually as BalkaNet
Ensuring proper punctuation and letter casing is a key pre-processing step towards applying complex natural language processing algorithms. This especially significant for textual sources where are missing, such as the raw output of automatic speech recognition systems. Additionally, short text messages micro-blogging platforms offer unreliable often wrong casing. survey offers an overview both historical state-of-the-art techniques restoring correcting word Furthermore, current challenges...
Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, barriers impacting business, cross-lingual cross-cultural communication are still omnipresent. Language Technologies (LTs) powerful means to break down these barriers. While last decade has seen various initiatives that created multitude approaches technologies tailored Europe's specific needs, there an immense level fragmentation. At same time, AI...
The paper presents a statistical approach to automatic building of translation lexicons from parallel corpora. We briefly describe the pre-processing steps, baseline iterative method, and actual algorithm. evaluation for two algorithms is presented in some detail terms precision, recall processing time. conclude by presenting our applications multilingual extracted method described herein.