NFDI4DS | UHH-SEMS - Publication Details

Dan Tufiş

ORCID: 0000-0002-8280-9852

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5062940611

Research Areas

Natural Language Processing Techniques
Topic Modeling
Semantic Web and Ontologies
Speech and dialogue systems
Lexicography and Language Studies
Speech Recognition and Synthesis
Text Readability and Simplification
linguistics and terminology studies
Algorithms and Data Compression
Multi-Agent Systems and Negotiation
Service-Oriented Architecture and Web Services
Artificial Intelligence in Law
Advanced Text Analysis Techniques
Translation Studies and Practices
Biomedical Text Mining and Ontologies
Authorship Attribution and Profiling
Power Systems and Technologies
Web Data Mining and Analysis
Linguistic research and analysis
Robotic Path Planning Algorithms
Linguistics, Language Diversity, and Identity
Robotics and Automated Systems
Information Retrieval and Search Behavior
Constraint Satisfaction and Optimization
Mathematics, Computing, and Information Processing

Romanian Academy
2014-2024

Artificial Intelligence Research Institute
2004-2023

Academy of Romanian Scientists
2010

University of Sheffield
2010

University of Stuttgart
2008

Alexandru Ioan Cuza University
2004

National Institute for Research and Development in Informatics - ICI Bucharest
1989-1994

The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages

OPENALEX - Publications

Ralf Steinberger Bruno Pouliquen Anna Widiger Camelia Ignat Tomaž Erjavec and 2 more

We present a new, unique and freely available parallel corpus containing European Union (EU) documents of mostly legal nature. It is in all 20 official EUanguages, with additional being the languages EU candidate countries. The consists almost 8,000 per language, an average size nearly 9 million words language. Pair-wise paragraph alignment information produced by two different aligners (Vanilla HunAlign) for 190+ language pair combinations. Most texts have been manually classified according...

10.48550/arxiv.cs/0609058 preprint EN other-oa arXiv (Cornell University) 2006-01-01

Sense discrimination with parallel corpora

OPENALEX - Publications

Nancy Ide Tomaž Erjavec Dan Tufiş

This paper describes an experiment that uses translation equivalents derived from parallel corpora to determine sense distinctions can be used for automatic sense-tagging and other disambiguation tasks. Our results show cross-lingual information are at least as reliable those made by human annotators. Because our approach is fully automated through all its steps, it could provide means obtain large samples of "sense-tagged" data without the high cost annotation.

10.3115/1118675.1118683 article EN 2002-01-01

Unsupervised Word Sense Disambiguation Using Transformer’s Attention Mechanism

OPENALEX - Publications

Radu Ion Vasile Păiș Verginica Barbu Mititelu Elena Irimia Maria Mitrofan and 2 more

Transformer models produce advanced text representations that have been used to break through the hard challenge of natural language understanding. Using Transformer’s attention mechanism, which acts as a learning memory, trained on tens billions words, word sense disambiguation (WSD) algorithm can now construct more faithful vectorial representation context be disambiguated. Working with set 34 lemmas nouns, verbs, adjectives and adverbs selected from National Reference Corpus Romanian...

10.3390/make7010010 article EN cc-by Machine Learning and Knowledge Extraction 2025-01-18

Multext-East

OPENALEX - Publications

Ludmiła Dimitrova Nancy Ide Vladimír Petkevič Tomaž Erjavec Heiki Jaan Kaalep and 1 more

The EU Copernicus project Multext-East has created a multi-lingual corpus of text and speech data, covering the six languages project: Bulgarian, Czech, Estonian, Hungarian, Romanian, Slovene. In addition, wordform lexicons for each were developed. includes parallel component consisting Orwell's Nineteen Eighty-Four, with versions in all tagged part-of-speech aligned to English (also POS). We describe encoding format data architecture designed especially this corpus, which is generally...

10.3115/980845.980897 article EN 1998-01-01

Fine-grained word sense disambiguation based on parallel corpora, word alignment, word clustering and aligned wordnets

OPENALEX - Publications

Dan Tufiş Radu Ion Nancy Ide

The paper presents a method for word sense disambiguation based on parallel corpora. exploits recent advances in alignment and clustering automatic extraction of translation equivalents being supported by available aligned wordnets the languages corpus. are to Princeton Wordnet, according principles established EuroWordNet. evaluation WSD system, implementing described herein showed very encouraging results. same system used validation mode, can be check spot errors multilingually as BalkaNet

10.3115/1220355.1220547 article EN 2004-01-01

Capitalization and punctuation restoration: a survey

OPENALEX - Publications

Vasile Păiș Dan Tufiş

Ensuring proper punctuation and letter casing is a key pre-processing step towards applying complex natural language processing algorithms. This especially significant for textual sources where are missing, such as the raw output of automatic speech recognition systems. Additionally, short text messages micro-blogging platforms offer unreliable often wrong casing. survey offers an overview both historical state-of-the-art techniques restoring correcting word Furthermore, current challenges...

10.1007/s10462-021-10051-x article EN cc-by-nc-nd Artificial Intelligence Review 2021-07-23

The European Language Technology Landscape in 2020: Language-Centric and Human-Centric AI for Cross-Cultural Communication in Multilingual Europe

OPENALEX - Publications

Georg Rehm Katrin Marheinecke Stefanie Hegele Stelios Piperidis Kalina Bontcheva and 42 more

Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, barriers impacting business, cross-lingual cross-cultural communication are still omnipresent. Language Technologies (LTs) powerful means to break down these barriers. While last decade has seen various initiatives that created multitude approaches technologies tailored Europe's specific needs, there an immense level fragmentation. At same time, AI...

10.48550/arxiv.2003.13833 preprint EN other-oa arXiv (Cornell University) 2020-01-01

A cheap and fast way to build useful translation lexicons

OPENALEX - Publications

Dan Tufiş

The paper presents a statistical approach to automatic building of translation lexicons from parallel corpora. We briefly describe the pre-processing steps, baseline iterative method, and actual algorithm. evaluation for two algorithms is presented in some detail terms precision, recall processing time. conclude by presenting our applications multilingual extracted method described herein.

10.3115/1072228.1072230 article EN Proceedings of the 17th international conference on Computational linguistics - 2002-01-01

The Romanian wordnet in a nutshell

OPENALEX - Publications

Dan Tufiş Verginica Barbu Mititelu Dan Ştefănescu Radu Ion

10.1007/s10579-013-9230-7 article EN Language Resources and Evaluation 2013-05-08

Experiments with a differential semantics annotation for WordNet 3.0

OPENALEX - Publications

Dan Tufiş Dan Ştefănescu

10.1016/j.dss.2012.05.026 article EN Decision Support Systems 2012-05-23

Coming Soon ...