Matthias Gallé

ORCID: 0000-0001-5677-5911
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Natural Language Processing Techniques
  • Topic Modeling
  • Algorithms and Data Compression
  • Multimodal Machine Learning Applications
  • Advanced Text Analysis Techniques
  • Speech Recognition and Synthesis
  • semigroups and automata theory
  • Machine Learning in Bioinformatics
  • Machine Learning and Algorithms
  • DNA and Biological Computing
  • Web Data Mining and Analysis
  • Biomedical Text Mining and Ontologies
  • Machine Learning and Data Classification
  • Sentiment Analysis and Opinion Mining
  • Speech and dialogue systems
  • Semantic Web and Ontologies
  • Neural Networks and Applications
  • Video Analysis and Summarization
  • Network Packet Processing and Optimization
  • Linguistic research and analysis
  • Statistical and Computational Modeling
  • Blind Source Separation Techniques
  • Media, Gender, and Advertising
  • Software Reliability and Analysis Research
  • Genomics and Phylogenetic Studies

IT University of Copenhagen
2023

Tokyo Institute of Technology
2023

Administration for Community Living
2023

American Jewish Committee
2023

RIKEN Center for Advanced Intelligence Project
2023

Mongolia International University
2023

Naver (South Korea)
2019-2022

CentraleSupélec
2021

Bar-Ilan University
2021

University of Helsinki
2021

Teven Le Scao Angela Fan Christopher Akiki Ellie Pavlick Suzana Ilić and 95 more Daniel Hesslow Roman Castagné Alexandra Sasha Luccioni François Yvon Matthias Gallé Jonathan Tow Alexander M. Rush Stella Biderman Albert Webson Pawan Sasanka Ammanamanchi Thomas J. Wang Benoît Sagot Niklas Muennighoff A. Villanova del Moral Olatunji Ruwase Rachel Bawden Stas Bekman Angelina McMillan-Major Iz Beltagy Huu Du Nguyen Lucile Saulnier Samson Tan Pedro Ortiz Suarez Victor Sanh Hugo Laurençon Yacine Jernite Julien Launay Margaret Mitchell Colin Raffel Aaron Gokaslan Adi Simhi Aitor Soroa Alham Fikri Aji Amit Alfassy Anna Rogers Ariel Kreisberg Nitzav Canwen Xu Chenghao Mou Chris Chinenye Emezue Christopher Klamm Colin Leong Daniel van Strien David Ifeoluwa Adelani Dragomir Radev Eduardo González Ponferrada Efrat Levkovizh Ethan Kim Eyal Bar Natan Francesco De Toni Gérard Dupont Germán Kruszewski Giada Pistilli Hady Elsahar Hamza Benyamina Hieu Tran Ian Yu Idris Abdulmumin Isaac Johnson Itziar González-Dios Javier de la Rosa Jenny Chim Jesse Dodge Jianguo Zhu Jonathan Chang Jörg Frohberg Joseph Tobing Joydeep Bhattacharjee Khalid Almubarak Kimbo Chen Kyle Lo Leandro von Werra Leon Weber Long Phan Loubna Ben Allal Ludovic Tanguy Manan Dey Manuel Romero Muñoz Maraim Masoud María Grandury Mario Šaško Max Tze Han Huang Maximin Coavoux Mayank Singh Mike Tian-Jian Jiang Minh Chien Vu Mohammad Ali Jauhar Mustafa Ghaleb Nishant Subramani Nora Kassner Nurulaqilla Khamis Olivier Nguyen Omar Espejel Ona De Gibert Paulo Villegas Peter Henderson

Large language models (LLMs) have been shown to be able perform new tasks based on a few demonstrations or natural instructions. While these capabilities led widespread adoption, most LLMs are developed by resource-rich organizations and frequently kept from the public. As step towards democratizing this powerful technology, we present BLOOM, 176B-parameter open-access model designed built thanks collaboration of hundreds researchers. BLOOM is decoder-only Transformer that was trained ROOTS...

10.48550/arxiv.2211.05100 preprint EN cc-by arXiv (Cornell University) 2022-01-01

What are the units of text that we want to model? From bytes multi-word expressions, can be analyzed and generated at many granularities. Until recently, most natural language processing (NLP) models operated over words, treating those as discrete atomic tokens, but starting with byte-pair encoding (BPE), subword-based approaches have become dominant in areas, enabling small vocabularies while still allowing for fast inference. Is end road character-level model or byte-level processing? In...

10.48550/arxiv.2112.10508 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Hady Elsahar, Matthias Gallé. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint (EMNLP-IJCNLP). 2019.

10.18653/v1/d19-1222 article EN cc-by 2019-01-01

User-generated reviews of products or services provide valuable information to customers. However, it is often impossible read each the potentially thousands reviews: would therefore save time short summaries their contents. We address opinion summarization, a multi-document summarization task, with an unsupervised abstractive neural system. Our system based on (i) language model that meant encode vector space, and generate fluent sentences from same space (ii) clustering step groups...

10.18653/v1/d19-5405 preprint EN cc-by 2019-01-01

We consider the problem of multilingual unsupervised machine translation, translating to and from languages that only have monolingual data by using auxiliary parallel language pairs. For this standard procedure so far leverage is _back-translation_, which computationally costly hard tune. In paper we propose instead use _denoising adapters_, adapter layers with a denoising objective, on top pre-trained mBART-50. addition modularity flexibility such an approach show resulting translations...

10.18653/v1/2021.emnlp-main.533 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2021-01-01

Byte-Pair Encoding (BPE) is an unsupervised sub-word tokenization technique, commonly used in neural machine translation and other NLP tasks. Its effectiveness makes it a de facto standard, but the reasons for this are not well understood. We link BPE to broader family of dictionary-based compression algorithms compare with members family. Our experiments across datasets, language pairs, models, vocabulary size show that - given fixed budget fewer tokens algorithm needs cover test set,...

10.18653/v1/d19-1141 article EN cc-by 2019-01-01

We propose a novel adapter layer formalism for adapting multilingual models. They are more parameter-efficient than existing layers while obtaining as good or better performance. The specific to one language (as opposed bilingual adapters) allowing compose them and generalize unseen language-pairs. In this zero-shot setting, they obtain median improvement of +2.77 BLEU points over strong 20-language Transformer baseline trained on TED talks.

10.18653/v1/2020.emnlp-main.361 preprint EN cc-by 2020-01-01

We address the problem of unsupervised abstractive summarization collections user generated reviews through self-supervision and control. propose a self-supervised setup that considers an individual document as target summary for set similar documents. This setting makes training simpler than previous approaches by relying only on standard log-likelihood loss mainstream models. hallucinations use control codes, to steer generation towards more coherent relevant summaries.

10.18653/v1/2021.eacl-main.141 preprint EN cc-by 2021-01-01

Alexandre Duval, Thomas Lamson, Gaël de Léséleuc Kérouara, Matthias Gallé. Proceedings of the 16th Conference European Chapter Association for Computational Linguistics: System Demonstrations. 2021.

10.18653/v1/2021.eacl-demos.33 article EN cc-by 2021-01-01

We release a multilingual neural machine translation model, which can be used to translate text in the biomedical domain. The model from 5 languages (French, German, Italian, Korean and Spanish) into English. It is trained with large amounts of generic data, using domain tags. Our benchmarks show that it performs near state-of-the-art both on news (generic domain) test sets, outperforms existing publicly released models. believe this will help large-scale analysis digital content COVID-19...

10.18653/v1/2020.nlpcovid19-2.16 article EN cc-by 2020-01-01

Character-based translation has several appealing advantages, but its performance is in general worse than a carefully tuned BPE baseline. In this paper we study the impact of character-based input and output with Transformer architecture. particular, our experiments on EN-DE show that models are more robust their counterpart, both when translating noisy text, text from different domain. To obtain comparable BLEU scores clean, in-domain data close gap BPE-based use known techniques to train...

10.48550/arxiv.1911.04997 preprint EN cc-by-nc-sa arXiv (Cornell University) 2019-01-01

The smallest grammar problem—namely, finding a context-free that generates exactly one sequence—is of practical and theoretical importance in fields such as Kolmogorov complexity, data compression pattern discovery. We propose new perspective on this problem by splitting it into two tasks: (1) choosing which words will be the constituents (2) searching for given set constituents. show how to solve second task polynomial time parsing longer constituent with smaller ones. algorithms based...

10.3390/a4040262 article EN cc-by Algorithms 2011-10-26

n-gram representations of documents may improve over a simple bag-of-word representation by relaxing the independence assumption word and introducing context. However, this comes at cost adding features which are non-descriptive, increasing dimension vector space model exponentially.

10.1145/2484028.2484142 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2013-07-28

As neural machine translation (NMT) systems become an important part of professional translator pipelines, a growing body work focuses on combining NMT with terminologies. In many scenarios and particularly in cases domain adaptation, one expects the MT output to adhere constraints provided by terminology. this work, we propose metrics measure consistency regards We perform studies COVID-19 over 5 languages, also performing terminology-targeted human evaluation. open-source code for...

10.48550/arxiv.2106.11891 preprint EN other-oa arXiv (Cornell University) 2021-01-01

The power of natural language generation models has provoked a flurry interest in automatic methods to detect if piece text is human or machine-authored. problem so far been framed standard supervised way and consists training classifier on annotated data predict the origin one given new document. In this paper, we frame an unsupervised distributional way: assume that have access large collection unannotated documents, big fraction which machine-generated. We propose method those...

10.48550/arxiv.2111.02878 preprint EN other-oa arXiv (Cornell University) 2021-01-01

The BigScience Workshop was a value-driven initiative that spanned one and half years of interdisciplinary research culminated in the creation ROOTS, 1.6TB multilingual dataset used to train BLOOM, largest language models date. In addition technical outcomes artifacts, workshop fostered multidisciplinary collaborations around large models, datasets, their analysis. This turn led wide range publications spanning topics from ethics law, data governance, modeling choices distributed training....

10.48550/arxiv.2212.04960 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Identifying the language of social media messages is an important first step in linguistic processing. Existing models for Twitter focus on content analysis, which successful dissimilar pairs. We propose a label propagation approach that takes graph tweet authors into account as well to better tease apart similar languages. This results state-of-the-art shared task performance $76.63\%$, $1.4\%$ higher than top system.

10.48550/arxiv.1607.05408 preprint EN other-oa arXiv (Cornell University) 2016-01-01
Coming Soon ...