- Topic Modeling
- Natural Language Processing Techniques
- Text Readability and Simplification
- Misinformation and Its Impacts
- Hate Speech and Cyberbullying Detection
- Biomedical Text Mining and Ontologies
- Sentiment Analysis and Opinion Mining
- Spam and Phishing Detection
- Speech and dialogue systems
- Vaccine Coverage and Hesitancy
- Advanced Text Analysis Techniques
- Semantic Web and Ontologies
- Software Engineering Research
- Multimodal Machine Learning Applications
- Text and Document Classification Technologies
- Interpreting and Communication in Healthcare
- Neurobiology of Language and Bilingualism
- Machine Learning and Data Classification
- Language, Metaphor, and Cognition
- Translation Studies and Practices
- Social Media and Politics
- Media Influence and Politics
- Adversarial Robustness in Machine Learning
- Rough Sets and Fuzzy Logic
- Water Systems and Optimization
University of Sheffield
2016-2025
Computational Physics (United States)
2014
Hospital Universitário da Universidade de São Paulo
2013
Universidade de São Paulo
2010
Université Paris-Saclay
1999
Centre National de la Recherche Scientifique
1999
Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Martin Popel, Matt Post, Raphael Rubino, Carolina Scarton, Lucia Specia, Marco Turchi, Karin Verspoor, Marcos Zampieri. Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers. 2016.
Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Barry Haddow, Matthias Huck, Chris Hokamp, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Matt Post, Carolina Scarton, Lucia Specia, Marco Turchi. Proceedings of the Tenth Workshop on Statistical Machine Translation. 2015.
This paper presents QUEST++ , an open source tool for quality estimation which can predict texts at word, sentence and document level.It also provides pipelined processing, whereby predictions made a lower level (e.g. words) be used as input to build models higher (e.g.sentences).QUEST++ allows the extraction of variety features, machine learning algorithms test models.Results on recent datasets show that achieves state-of-the-art performance.
Sentence Simplification (SS) aims to modify a sentence in order make it easier read and understand. In do so, several rewriting transformations can be performed such as replacement, reordering, splitting. Executing these while keeping sentences grammatical, preserving their main idea, generating simpler output, is challenging still far from solved problem. this article, we survey research on SS, focusing approaches that attempt learn how simplify using corpora of aligned original-simplified...
In order to simplify a sentence, human editors perform multiple rewriting transformations: they split it into several shorter sentences, paraphrase words (i.e. replacing complex or phrases by simpler synonyms), reorder components, and/or delete information deemed unnecessary. Despite these varied range of possible text alterations, current models for automatic sentence simplification are evaluated using datasets that focused on single transformation, such as lexical paraphrasing splitting....
Abstract In order to simplify sentences, several rewriting operations can be performed, such as replacing complex words per simpler synonyms, deleting unnecessary information, and splitting long sentences. Despite this multi-operation nature, evaluation of automatic simplification systems relies on metrics that moderately correlate with human judgments the simplicity achieved by executing specific (e.g., gain based lexical replacements). article, we investigate how well existing assess...
Harish Tayyar Madabushi, Edward Gow-Smith, Marcos Garcia, Carolina Scarton, Marco Idiart, Aline Villavicencio. Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022). 2022.
Lay summarisation aims to jointly summarise and simplify a given text, thus making its content more comprehensible non-experts.Automatic approaches for lay can provide significant value in broadening access scientific literature, enabling greater degree of both interdisciplinary knowledge sharing public understanding when it comes research findings. However, current corpora this task are limited their size scope, hindering the development broadly applicable data-driven approaches. Aiming...
Text simplification (TS) is a monolingual text-to-text transformation task where an original (complex) text transformed into target (simpler) text. Most recent work based on sequence-to-sequence neural models similar to those used for machine translation (MT). Different from MT, TS data comprises more elaborate transformations, such as sentence splitting. It can also contain multiple simplifications of the same targeting different audiences, school grade levels. We explore these two features...
Marcos Garcia, Tiago Kramer Vieira, Carolina Scarton, Marco Idiart, Aline Villavicencio. Proceedings of the 16th Conference European Chapter Association for Computational Linguistics: Main Volume. 2021.
Adapters and Low-Rank Adaptation (LoRA) are parameter-efficient fine-tuning techniques designed to make the training of language models more efficient. Previous results demonstrated that these methods can even improve performance on some classification tasks. This paper complements existing research by investigating how influence computation costs compared full fine-tuning. We focus specifically multilingual text tasks (genre, framing, persuasion detection; with different input lengths,...
Fernando Alva-Manchego, Louis Martin, Carolina Scarton, Lucia Specia. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint (EMNLP-IJCNLP): System Demonstrations. 2019.
Hate speech and toxic comments are a common concern of social media platform users. Although these are, fortunately, the minority in platforms, they still capable causing harm. Therefore, identifying is an important task for studying preventing proliferation toxicity media. Previous work automatically detecting focus mainly English, with very few languages like Brazilian Portuguese. In this paper, we propose new large-scale dataset Portuguese tweets annotated as either or non-toxic different...
Despite their success in a variety of NLP tasks, pre-trained language models, due to heavy reliance on compositionality, fail effectively capturing the meanings multiword expressions (MWEs), especially idioms. Therefore, datasets and methods improve representation MWEs are urgently needed. Existing limited providing degree idiomaticity along with literal and, where applicable, (a single) non-literal interpretation MWEs. This work presents novel dataset naturally occurring sentences...
Many applications within natural language processing involve performing text-to-text transformations, i.e., given a text in as input, systems are required to produce version of this
Hate speech and toxic comments are a common concern of social media platform users. Although these are, fortunately, the minority in platforms, they still capable causing harm. Therefore, identifying is an important task for studying preventing proliferation toxicity media. Previous work automatically detecting focus mainly English, with very few languages like Brazilian Portuguese. In this paper, we propose new large-scale dataset Portuguese tweets annotated as either or non-toxic different...
In this work, we explore the application of Large Language Models to zero-shot Lay Summarisation. We propose a novel two-stage framework for Summarisation based on real-life processes, and find that summaries generated with method are increasingly preferred by human judges larger models. To help establish best practices employing LLMs in settings, also assess ability as judges, finding they able replicate preferences judges. Finally, take initial steps towards Natural Processing (NLP)...