- Natural Language Processing Techniques
- Topic Modeling
- Algorithms and Data Compression
- Multimodal Machine Learning Applications
- Advanced Text Analysis Techniques
- Speech Recognition and Synthesis
- semigroups and automata theory
- Machine Learning in Bioinformatics
- Machine Learning and Algorithms
- DNA and Biological Computing
- Web Data Mining and Analysis
- Biomedical Text Mining and Ontologies
- Machine Learning and Data Classification
- Sentiment Analysis and Opinion Mining
- Speech and dialogue systems
- Semantic Web and Ontologies
- Neural Networks and Applications
- Video Analysis and Summarization
- Network Packet Processing and Optimization
- Linguistic research and analysis
- Statistical and Computational Modeling
- Blind Source Separation Techniques
- Media, Gender, and Advertising
- Software Reliability and Analysis Research
- Genomics and Phylogenetic Studies
IT University of Copenhagen
2023
Tokyo Institute of Technology
2023
Administration for Community Living
2023
American Jewish Committee
2023
RIKEN Center for Advanced Intelligence Project
2023
Mongolia International University
2023
Naver (South Korea)
2019-2022
CentraleSupélec
2021
Bar-Ilan University
2021
University of Helsinki
2021
Large language models (LLMs) have been shown to be able perform new tasks based on a few demonstrations or natural instructions. While these capabilities led widespread adoption, most LLMs are developed by resource-rich organizations and frequently kept from the public. As step towards democratizing this powerful technology, we present BLOOM, 176B-parameter open-access model designed built thanks collaboration of hundreds researchers. BLOOM is decoder-only Transformer that was trained ROOTS...
What are the units of text that we want to model? From bytes multi-word expressions, can be analyzed and generated at many granularities. Until recently, most natural language processing (NLP) models operated over words, treating those as discrete atomic tokens, but starting with byte-pair encoding (BPE), subword-based approaches have become dominant in areas, enabling small vocabularies while still allowing for fast inference. Is end road character-level model or byte-level processing? In...
Hady Elsahar, Matthias Gallé. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint (EMNLP-IJCNLP). 2019.
User-generated reviews of products or services provide valuable information to customers. However, it is often impossible read each the potentially thousands reviews: would therefore save time short summaries their contents. We address opinion summarization, a multi-document summarization task, with an unsupervised abstractive neural system. Our system based on (i) language model that meant encode vector space, and generate fluent sentences from same space (ii) clustering step groups...
We consider the problem of multilingual unsupervised machine translation, translating to and from languages that only have monolingual data by using auxiliary parallel language pairs. For this standard procedure so far leverage is _back-translation_, which computationally costly hard tune. In paper we propose instead use _denoising adapters_, adapter layers with a denoising objective, on top pre-trained mBART-50. addition modularity flexibility such an approach show resulting translations...
Byte-Pair Encoding (BPE) is an unsupervised sub-word tokenization technique, commonly used in neural machine translation and other NLP tasks. Its effectiveness makes it a de facto standard, but the reasons for this are not well understood. We link BPE to broader family of dictionary-based compression algorithms compare with members family. Our experiments across datasets, language pairs, models, vocabulary size show that - given fixed budget fewer tokens algorithm needs cover test set,...
We propose a novel adapter layer formalism for adapting multilingual models. They are more parameter-efficient than existing layers while obtaining as good or better performance. The specific to one language (as opposed bilingual adapters) allowing compose them and generalize unseen language-pairs. In this zero-shot setting, they obtain median improvement of +2.77 BLEU points over strong 20-language Transformer baseline trained on TED talks.
We address the problem of unsupervised abstractive summarization collections user generated reviews through self-supervision and control. propose a self-supervised setup that considers an individual document as target summary for set similar documents. This setting makes training simpler than previous approaches by relying only on standard log-likelihood loss mainstream models. hallucinations use control codes, to steer generation towards more coherent relevant summaries.
Alexandre Duval, Thomas Lamson, Gaël de Léséleuc Kérouara, Matthias Gallé. Proceedings of the 16th Conference European Chapter Association for Computational Linguistics: System Demonstrations. 2021.
We release a multilingual neural machine translation model, which can be used to translate text in the biomedical domain. The model from 5 languages (French, German, Italian, Korean and Spanish) into English. It is trained with large amounts of generic data, using domain tags. Our benchmarks show that it performs near state-of-the-art both on news (generic domain) test sets, outperforms existing publicly released models. believe this will help large-scale analysis digital content COVID-19...
Character-based translation has several appealing advantages, but its performance is in general worse than a carefully tuned BPE baseline. In this paper we study the impact of character-based input and output with Transformer architecture. particular, our experiments on EN-DE show that models are more robust their counterpart, both when translating noisy text, text from different domain. To obtain comparable BLEU scores clean, in-domain data close gap BPE-based use known techniques to train...
The smallest grammar problem—namely, finding a context-free that generates exactly one sequence—is of practical and theoretical importance in fields such as Kolmogorov complexity, data compression pattern discovery. We propose new perspective on this problem by splitting it into two tasks: (1) choosing which words will be the constituents (2) searching for given set constituents. show how to solve second task polynomial time parsing longer constituent with smaller ones. algorithms based...
n-gram representations of documents may improve over a simple bag-of-word representation by relaxing the independence assumption word and introducing context. However, this comes at cost adding features which are non-descriptive, increasing dimension vector space model exponentially.
As neural machine translation (NMT) systems become an important part of professional translator pipelines, a growing body work focuses on combining NMT with terminologies. In many scenarios and particularly in cases domain adaptation, one expects the MT output to adhere constraints provided by terminology. this work, we propose metrics measure consistency regards We perform studies COVID-19 over 5 languages, also performing terminology-targeted human evaluation. open-source code for...
The power of natural language generation models has provoked a flurry interest in automatic methods to detect if piece text is human or machine-authored. problem so far been framed standard supervised way and consists training classifier on annotated data predict the origin one given new document. In this paper, we frame an unsupervised distributional way: assume that have access large collection unannotated documents, big fraction which machine-generated. We propose method those...
The BigScience Workshop was a value-driven initiative that spanned one and half years of interdisciplinary research culminated in the creation ROOTS, 1.6TB multilingual dataset used to train BLOOM, largest language models date. In addition technical outcomes artifacts, workshop fostered multidisciplinary collaborations around large models, datasets, their analysis. This turn led wide range publications spanning topics from ethics law, data governance, modeling choices distributed training....
Identifying the language of social media messages is an important first step in linguistic processing. Existing models for Twitter focus on content analysis, which successful dissimilar pairs. We propose a label propagation approach that takes graph tweet authors into account as well to better tease apart similar languages. This results state-of-the-art shared task performance $76.63\%$, $1.4\%$ higher than top system.