NFDI4DS | UHH-SEMS - Publication Details

Matthias Gallé

ORCID: 0000-0001-5677-5911

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5074485795

Research Areas

Natural Language Processing Techniques
Topic Modeling
Algorithms and Data Compression
Multimodal Machine Learning Applications
Advanced Text Analysis Techniques
Speech Recognition and Synthesis
semigroups and automata theory
Machine Learning in Bioinformatics
Machine Learning and Algorithms
DNA and Biological Computing
Web Data Mining and Analysis
Biomedical Text Mining and Ontologies
Machine Learning and Data Classification
Sentiment Analysis and Opinion Mining
Speech and dialogue systems
Semantic Web and Ontologies
Neural Networks and Applications
Video Analysis and Summarization
Network Packet Processing and Optimization
Linguistic research and analysis
Statistical and Computational Modeling
Blind Source Separation Techniques
Media, Gender, and Advertising
Software Reliability and Analysis Research
Genomics and Phylogenetic Studies

IT University of Copenhagen
2023

Tokyo Institute of Technology
2023

Administration for Community Living
2023

American Jewish Committee
2023

RIKEN Center for Advanced Intelligence Project
2023

Mongolia International University
2023

Naver (South Korea)
2019-2022

CentraleSupélec
2021

Bar-Ilan University
2021

University of Helsinki
2021

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

OPENALEX - Publications

Teven Le Scao Angela Fan Christopher Akiki Ellie Pavlick Suzana Ilić and 95 more

Large language models (LLMs) have been shown to be able perform new tasks based on a few demonstrations or natural instructions. While these capabilities led widespread adoption, most LLMs are developed by resource-rich organizations and frequently kept from the public. As step towards democratizing this powerful technology, we present BLOOM, 176B-parameter open-access model designed built thanks collaboration of hundreds researchers. BLOOM is decoder-only Transformer that was trained ROOTS...

10.48550/arxiv.2211.05100 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP

OPENALEX - Publications

Sabrina J. Mielke Zaid Alyafeai Elizabeth Salesky Colin Raffel Manan Dey and 6 more

What are the units of text that we want to model? From bytes multi-word expressions, can be analyzed and generated at many granularities. Until recently, most natural language processing (NLP) models operated over words, treating those as discrete atomic tokens, but starting with byte-pair encoding (BPE), subword-based approaches have become dominant in areas, enabling small vocabularies while still allowing for fast inference. Is end road character-level model or byte-level processing? In...

10.48550/arxiv.2112.10508 preprint EN other-oa arXiv (Cornell University) 2021-01-01

To Annotate or Not? Predicting Performance Drop under Domain Shift

OPENALEX - Publications

Hady Elsahar Matthias Gallé

Hady Elsahar, Matthias Gallé. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint (EMNLP-IJCNLP). 2019.

10.18653/v1/d19-1222 article EN cc-by 2019-01-01

Unsupervised Aspect-Based Multi-Document Abstractive Summarization

OPENALEX - Publications

Maximin Coavoux Hady Elsahar Matthias Gallé

User-generated reviews of products or services provide valuable information to customers. However, it is often impossible read each the potentially thousands reviews: would therefore save time short summaries their contents. We address opinion summarization, a multi-document summarization task, with an unsupervised abstractive neural system. Our system based on (i) language model that meant encode vector space, and generate fluent sentences from same space (ii) clustering step groups...

10.18653/v1/d19-5405 preprint EN cc-by 2019-01-01

Multilingual Unsupervised Neural Machine Translation with Denoising Adapters

OPENALEX - Publications

Ahmet Üstün Alexandre Bérard Laurent Besacier Matthias Gallé

We consider the problem of multilingual unsupervised machine translation, translating to and from languages that only have monolingual data by using auxiliary parallel language pairs. For this standard procedure so far leverage is _back-translation_, which computationally costly hard tune. In paper we propose instead use _denoising adapters_, adapter layers with a denoising objective, on top pre-trained mBART-50. addition modularity flexibility such an approach show resulting translations...

10.18653/v1/2021.emnlp-main.533 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2021-01-01

Investigating the Effectiveness of BPE: The Power of Shorter Sequences

OPENALEX - Publications

Matthias Gallé

Byte-Pair Encoding (BPE) is an unsupervised sub-word tokenization technique, commonly used in neural machine translation and other NLP tasks. Its effectiveness makes it a de facto standard, but the reasons for this are not well understood. We link BPE to broader family of dictionary-based compression algorithms compare with members family. Our experiments across datasets, language pairs, models, vocabulary size show that - given fixed budget fewer tokens algorithm needs cover test set,...

10.18653/v1/d19-1141 article EN cc-by 2019-01-01

Monolingual Adapters for Zero-Shot Neural Machine Translation

OPENALEX - Publications

Jerin Philip Alexandre Bérard Matthias Gallé Laurent Besacier

We propose a novel adapter layer formalism for adapting multilingual models. They are more parameter-efficient than existing layers while obtaining as good or better performance. The specific to one language (as opposed bilingual adapters) allowing compose them and generalize unseen language-pairs. In this zero-shot setting, they obtain median improvement of +2.77 BLEU points over strong 20-language Transformer baseline trained on TED talks.

10.18653/v1/2020.emnlp-main.361 preprint EN cc-by 2020-01-01

Self-Supervised and Controlled Multi-Document Opinion Summarization

OPENALEX - Publications

Hady Elsahar Maximin Coavoux Jos Rozen Matthias Gallé

We address the problem of unsupervised abstractive summarization collections user generated reviews through self-supervision and control. propose a self-supervised setup that considers an individual document as target summary for set similar documents. This setting makes training simpler than previous approaches by relying only on standard log-likelihood loss mainstream models. hallucinations use control codes, to steer generation towards more coherent relevant summaries.

10.18653/v1/2021.eacl-main.141 preprint EN cc-by 2021-01-01

Breaking Writer’s Block: Low-cost Fine-tuning of Natural Language Generation Models

OPENALEX - Publications

Alexandre Duval Thomas Lamson Gaël de Léséleuc de Kérouara Matthias Gallé

Alexandre Duval, Thomas Lamson, Gaël de Léséleuc Kérouara, Matthias Gallé. Proceedings of the 16th Conference European Chapter Association for Computational Linguistics: System Demonstrations. 2021.

10.18653/v1/2021.eacl-demos.33 article EN cc-by 2021-01-01

Searching for smallest grammars on large sequences and application to DNA

OPENALEX - Publications

Rafael Carrascosa F Coste Matthias Gallé Gabriel Infante-López

10.1016/j.jda.2011.04.006 article EN publisher-specific-oa Journal of Discrete Algorithms 2011-05-02

A Multilingual Neural Machine Translation Model for Biomedical Data

OPENALEX - Publications

Alexandre Bérard Zae Myung Kim Vassilina Nikoulina Eunjeong Lucy Park Matthias Gallé

We release a multilingual neural machine translation model, which can be used to translate text in the biomedical domain. The model from 5 languages (French, German, Italian, Korean and Spanish) into English. It is trained with large amounts of generic data, using domain tags. Our benchmarks show that it performs near state-of-the-art both on news (generic domain) test sets, outperforms existing publicly released models. believe this will help large-scale analysis digital content COVID-19...

10.18653/v1/2020.nlpcovid19-2.16 article EN cc-by 2020-01-01

Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs

OPENALEX - Publications

Arash Ahmadian Chris Cremer Matthias Gallé Marzieh Fadaee Julia Kreutzer and 3 more

10.18653/v1/2024.acl-long.662 article EN 2024-01-01

On Leakage of Code Generation Evaluation Datasets

OPENALEX - Publications

Alexandre Matton Tom Sherborne Dennis Aumiller Elena Tommasone Milad Alizadeh and 5 more

10.18653/v1/2024.findings-emnlp.772 article EN 2024-01-01

Character-based NMT with Transformer

OPENALEX - Publications

Rohit Gupta Laurent Besacier Marc Dymetman Matthias Gallé

Character-based translation has several appealing advantages, but its performance is in general worse than a carefully tuned BPE baseline. In this paper we study the impact of character-based input and output with Transformer architecture. particular, our experiments on EN-DE show that models are more robust their counterpart, both when translating noisy text, text from different domain. To obtain comparable BLEU scores clean, in-domain data close gap BPE-based use known techniques to train...

10.48550/arxiv.1911.04997 preprint EN cc-by-nc-sa arXiv (Cornell University) 2019-01-01

The Smallest Grammar Problem as Constituents Choice and Minimal Grammar Parsing

OPENALEX - Publications

Rafael Carrascosa F Coste Matthias Gallé Gabriel Infante-López

The smallest grammar problem—namely, finding a context-free that generates exactly one sequence—is of practical and theoretical importance in fields such as Kolmogorov complexity, data compression pattern discovery. We propose new perspective on this problem by splitting it into two tasks: (1) choosing which words will be the constituents (2) searching for given set constituents. show how to solve second task polynomial time parsing longer constituent with smaller ones. algorithms based...

10.3390/a4040262 article EN cc-by Algorithms 2011-10-26

The bag-of-repeats representation of documents

OPENALEX - Publications

Matthias Gallé

n-gram representations of documents may improve over a simple bag-of-word representation by relaxing the independence assumption word and introducing context. However, this comes at cost adding features which are non-descriptive, increasing dimension vector space model exponentially.

10.1145/2484028.2484142 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2013-07-28

On the Evaluation of Machine Translation for Terminology Consistency

OPENALEX - Publications

Md Mahfuz ibn Alam Antonios Anastasopoulos Laurent Besacier James H. Cross Matthias Gallé and 2 more

As neural machine translation (NMT) systems become an important part of professional translator pipelines, a growing body work focuses on combining NMT with terminologies. In many scenarios and particularly in cases domain adaptation, one expects the MT output to adhere constraints provided by terminology. this work, we propose metrics measure consistency regards We perform studies COVID-19 over 5 languages, also performing terminology-targeted human evaluation. open-source code for...

10.48550/arxiv.2106.11891 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Unsupervised and Distributional Detection of Machine-Generated Text

OPENALEX - Publications

Matthias Gallé Jos Rozen Germán Kruszewski Hady Elsahar

The power of natural language generation models has provoked a flurry interest in automatic methods to detect if piece text is human or machine-authored. problem so far been framed standard supervised way and consists training classifier on annotated data predict the origin one given new document. In this paper, we frame an unsupervised distributional way: assume that have access large collection unannotated documents, big fraction which machine-generated. We propose method those...

10.48550/arxiv.2111.02878 preprint EN other-oa arXiv (Cornell University) 2021-01-01

BigScience: A Case Study in the Social Construction of a Multilingual Large Language Model

OPENALEX - Publications

Christopher Akiki Giada Pistilli Margot Mieskes Matthias Gallé Thomas Wolf and 2 more

The BigScience Workshop was a value-driven initiative that spanned one and half years of interdisciplinary research culminated in the creation ROOTS, 1.6TB multilingual dataset used to train BLOOM, largest language models date. In addition technical outcomes artifacts, workshop fostered multidisciplinary collaborations around large models, datasets, their analysis. This turn led wide range publications spanning topics from ethics law, data governance, modeling choices distributed training....

10.48550/arxiv.2212.04960 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Discriminating between similar languages in Twitter using label propagation

OPENALEX - Publications

Will Radford Matthias Gallé

Identifying the language of social media messages is an important first step in linguistic processing. Existing models for Twitter focus on content analysis, which successful dissimilar pairs. We propose a label propagation approach that takes graph tweet authors into account as well to better tease apart similar languages. This results state-of-the-art shared task performance $76.63\%$, $1.4\%$ higher than top system.

10.48550/arxiv.1607.05408 preprint EN other-oa arXiv (Cornell University) 2016-01-01

Coming Soon ...

ORKG

DBLP

CEUR

MyBinder

Matthias Gallé