Dan Tufiş

ORCID: 0000-0002-8280-9852
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Natural Language Processing Techniques
  • Topic Modeling
  • Semantic Web and Ontologies
  • Speech and dialogue systems
  • Lexicography and Language Studies
  • Speech Recognition and Synthesis
  • Text Readability and Simplification
  • linguistics and terminology studies
  • Algorithms and Data Compression
  • Multi-Agent Systems and Negotiation
  • Service-Oriented Architecture and Web Services
  • Artificial Intelligence in Law
  • Advanced Text Analysis Techniques
  • Translation Studies and Practices
  • Biomedical Text Mining and Ontologies
  • Authorship Attribution and Profiling
  • Power Systems and Technologies
  • Web Data Mining and Analysis
  • Linguistic research and analysis
  • Robotic Path Planning Algorithms
  • Linguistics, Language Diversity, and Identity
  • Robotics and Automated Systems
  • Information Retrieval and Search Behavior
  • Constraint Satisfaction and Optimization
  • Mathematics, Computing, and Information Processing

Romanian Academy
2014-2024

Artificial Intelligence Research Institute
2004-2023

Academy of Romanian Scientists
2010

University of Sheffield
2010

University of Stuttgart
2008

Alexandru Ioan Cuza University
2004

National Institute for Research and Development in Informatics - ICI Bucharest
1989-1994

We present a new, unique and freely available parallel corpus containing European Union (EU) documents of mostly legal nature. It is in all 20 official EUanguages, with additional being the languages EU candidate countries. The consists almost 8,000 per language, an average size nearly 9 million words language. Pair-wise paragraph alignment information produced by two different aligners (Vanilla HunAlign) for 190+ language pair combinations. Most texts have been manually classified according...

10.48550/arxiv.cs/0609058 preprint EN other-oa arXiv (Cornell University) 2006-01-01

This paper describes an experiment that uses translation equivalents derived from parallel corpora to determine sense distinctions can be used for automatic sense-tagging and other disambiguation tasks. Our results show cross-lingual information are at least as reliable those made by human annotators. Because our approach is fully automated through all its steps, it could provide means obtain large samples of "sense-tagged" data without the high cost annotation.

10.3115/1118675.1118683 article EN 2002-01-01

Transformer models produce advanced text representations that have been used to break through the hard challenge of natural language understanding. Using Transformer’s attention mechanism, which acts as a learning memory, trained on tens billions words, word sense disambiguation (WSD) algorithm can now construct more faithful vectorial representation context be disambiguated. Working with set 34 lemmas nouns, verbs, adjectives and adverbs selected from National Reference Corpus Romanian...

10.3390/make7010010 article EN cc-by Machine Learning and Knowledge Extraction 2025-01-18

The EU Copernicus project Multext-East has created a multi-lingual corpus of text and speech data, covering the six languages project: Bulgarian, Czech, Estonian, Hungarian, Romanian, Slovene. In addition, wordform lexicons for each were developed. includes parallel component consisting Orwell's Nineteen Eighty-Four, with versions in all tagged part-of-speech aligned to English (also POS). We describe encoding format data architecture designed especially this corpus, which is generally...

10.3115/980845.980897 article EN 1998-01-01

The paper presents a method for word sense disambiguation based on parallel corpora. exploits recent advances in alignment and clustering automatic extraction of translation equivalents being supported by available aligned wordnets the languages corpus. are to Princeton Wordnet, according principles established EuroWordNet. evaluation WSD system, implementing described herein showed very encouraging results. same system used validation mode, can be check spot errors multilingually as BalkaNet

10.3115/1220355.1220547 article EN 2004-01-01

Ensuring proper punctuation and letter casing is a key pre-processing step towards applying complex natural language processing algorithms. This especially significant for textual sources where are missing, such as the raw output of automatic speech recognition systems. Additionally, short text messages micro-blogging platforms offer unreliable often wrong casing. survey offers an overview both historical state-of-the-art techniques restoring correcting word Furthermore, current challenges...

10.1007/s10462-021-10051-x article EN cc-by-nc-nd Artificial Intelligence Review 2021-07-23

Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, barriers impacting business, cross-lingual cross-cultural communication are still omnipresent. Language Technologies (LTs) powerful means to break down these barriers. While last decade has seen various initiatives that created multitude approaches technologies tailored Europe's specific needs, there an immense level fragmentation. At same time, AI...

10.48550/arxiv.2003.13833 preprint EN other-oa arXiv (Cornell University) 2020-01-01

The paper presents a statistical approach to automatic building of translation lexicons from parallel corpora. We briefly describe the pre-processing steps, baseline iterative method, and actual algorithm. evaluation for two algorithms is presented in some detail terms precision, recall processing time. conclude by presenting our applications multilingual extracted method described herein.

10.3115/1072228.1072230 article EN Proceedings of the 17th international conference on Computational linguistics - 2002-01-01

10.1007/s10579-013-9230-7 article EN Language Resources and Evaluation 2013-05-08
Coming Soon ...