Carolina Scarton

ORCID: 0000-0002-0103-4072
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Natural Language Processing Techniques
  • Topic Modeling
  • Text Readability and Simplification
  • Misinformation and Its Impacts
  • Hate Speech and Cyberbullying Detection
  • Biomedical Text Mining and Ontologies
  • Sentiment Analysis and Opinion Mining
  • Spam and Phishing Detection
  • Speech and dialogue systems
  • Advanced Text Analysis Techniques
  • Vaccine Coverage and Hesitancy
  • Semantic Web and Ontologies
  • Text and Document Classification Technologies
  • Software Engineering Research
  • Multimodal Machine Learning Applications
  • Translation Studies and Practices
  • Neurobiology of Language and Bilingualism
  • Interpreting and Communication in Healthcare
  • Language, Metaphor, and Cognition
  • Machine Learning and Data Classification
  • Social Media and Politics
  • Media Influence and Politics
  • Data Quality and Management
  • Machine Learning in Bioinformatics
  • Opinion Dynamics and Social Influence

University of Sheffield
2016-2025

Computational Physics (United States)
2014

Hospital Universitário da Universidade de São Paulo
2013

Universidade de São Paulo
2010

Université Paris-Saclay
1999

Centre National de la Recherche Scientifique
1999

Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Martin Popel, Matt Post, Raphael Rubino, Carolina Scarton, Lucia Specia, Marco Turchi, Karin Verspoor, Marcos Zampieri. Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers. 2016.

10.18653/v1/w16-2301 article EN 2016-01-01

Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Barry Haddow, Matthias Huck, Chris Hokamp, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Matt Post, Carolina Scarton, Lucia Specia, Marco Turchi. Proceedings of the Tenth Workshop on Statistical Machine Translation. 2015.

10.18653/v1/w15-3001 article EN cc-by 2015-01-01

This paper presents QUEST++ , an open source tool for quality estimation which can predict texts at word, sentence and document level.It also provides pipelined processing, whereby predictions made a lower level (e.g. words) be used as input to build models higher (e.g.sentences).QUEST++ allows the extraction of variety features, machine learning algorithms test models.Results on recent datasets show that achieves state-of-the-art performance.

10.3115/v1/p15-4020 article EN cc-by 2015-01-01

Sentence Simplification (SS) aims to modify a sentence in order make it easier read and understand. In do so, several rewriting transformations can be performed such as replacement, reordering, splitting. Executing these while keeping sentences grammatical, preserving their main idea, generating simpler output, is challenging still far from solved problem. this article, we survey research on SS, focusing approaches that attempt learn how simplify using corpora of aligned original-simplified...

10.1162/coli_a_00370 article EN cc-by-nc-nd Computational Linguistics 2020-01-02

In order to simplify a sentence, human editors perform multiple rewriting transformations: they split it into several shorter sentences, paraphrase words (i.e. replacing complex or phrases by simpler synonyms), reorder components, and/or delete information deemed unnecessary. Despite these varied range of possible text alterations, current models for automatic sentence simplification are evaluated using datasets that focused on single transformation, such as lexical paraphrasing splitting....

10.18653/v1/2020.acl-main.424 preprint EN cc-by 2020-01-01

Abstract In order to simplify sentences, several rewriting operations can be performed, such as replacing complex words per simpler synonyms, deleting unnecessary information, and splitting long sentences. Despite this multi-operation nature, evaluation of automatic simplification systems relies on metrics that moderately correlate with human judgments the simplicity achieved by executing specific (e.g., gain based lexical replacements). article, we investigate how well existing assess...

10.1162/coli_a_00418 article EN cc-by-nc-nd Computational Linguistics 2021-08-13

Harish Tayyar Madabushi, Edward Gow-Smith, Marcos Garcia, Carolina Scarton, Marco Idiart, Aline Villavicencio. Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022). 2022.

10.18653/v1/2022.semeval-1.13 article EN cc-by Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022) 2022-01-01

Lay summarisation aims to jointly summarise and simplify a given text, thus making its content more comprehensible non-experts.Automatic approaches for lay can provide significant value in broadening access scientific literature, enabling greater degree of both interdisciplinary knowledge sharing public understanding when it comes research findings. However, current corpora this task are limited their size scope, hindering the development broadly applicable data-driven approaches. Aiming...

10.18653/v1/2022.emnlp-main.724 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2022-01-01

Text simplification (TS) is a monolingual text-to-text transformation task where an original (complex) text transformed into target (simpler) text. Most recent work based on sequence-to-sequence neural models similar to those used for machine translation (MT). Different from MT, TS data comprises more elaborate transformations, such as sentence splitting. It can also contain multiple simplifications of the same targeting different audiences, school grade levels. We explore these two features...

10.18653/v1/p18-2113 article EN cc-by 2018-01-01

Marcos Garcia, Tiago Kramer Vieira, Carolina Scarton, Marco Idiart, Aline Villavicencio. Proceedings of the 16th Conference European Chapter Association for Computational Linguistics: Main Volume. 2021.

10.18653/v1/2021.eacl-main.310 article EN cc-by 2021-01-01

Adapters and Low-Rank Adaptation (LoRA) are parameter-efficient fine-tuning techniques designed to make the training of language models more efficient. Previous results demonstrated that these methods can even improve performance on some classification tasks. This paper complements existing research by investigating how influence computation costs compared full fine-tuning. We focus specifically multilingual text tasks (genre, framing, persuasion detection; with different input lengths,...

10.1371/journal.pone.0301738 article EN cc-by PLoS ONE 2024-05-03

In this work, we explore the application of Large Language Models to zero-shot Lay Summarisation. We propose a novel two-stage framework for Summarisation based on real-life processes, and find that summaries generated with method are increasingly preferred by human judges larger models. To help establish best practices employing LLMs in settings, also assess ability as judges, finding they able replicate preferences judges. Finally, take initial steps towards Natural Processing (NLP)...

10.48550/arxiv.2501.05224 preprint EN arXiv (Cornell University) 2025-01-09

Classifying the stance of individuals on controversial topics and uncovering their concerns is crucial for social scientists policymakers. Data from Online Social Networks (OSNs), which serve as a proxy to representative sample society, offers an opportunity classify these stances, discover society's regarding topics, track evolution over time. Consequently, classification in OSNs has garnered significant attention researchers. However, most existing methods this task often rely labelled...

10.48550/arxiv.2501.12272 preprint EN arXiv (Cornell University) 2025-01-21

Social media's global reach amplifies the spread of information, highlighting need for robust Natural Language Processing tasks like stance detection across languages and modalities. Prior research predominantly focuses on text-only inputs, leaving multimodal scenarios, such as those involving both images text, relatively underexplored. Meanwhile, prevalence posts has increased significantly in recent years. Although state-of-the-art Vision-Language Models (VLMs) show promise, their...

10.48550/arxiv.2501.17654 preprint EN arXiv (Cornell University) 2025-01-29

Abstract Credibility signals represent a wide range of heuristics typically used by journalists and fact-checkers to assess the veracity online content. Automating extraction credibility presents significant challenges due necessity training high-accuracy, signal-specific extractors, coupled with lack sufficiently large annotated datasets. This paper introduces Pastel ( P rompted we A k S upervision wi T h cr E dibility signa L s), weakly supervised approach that leverages language models...

10.1140/epjds/s13688-025-00534-0 article EN cc-by EPJ Data Science 2025-02-21

Fernando Alva-Manchego, Louis Martin, Carolina Scarton, Lucia Specia. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint (EMNLP-IJCNLP): System Demonstrations. 2019.

10.18653/v1/d19-3009 preprint EN cc-by 2019-01-01

Hate speech and toxic comments are a common concern of social media platform users. Although these are, fortunately, the minority in platforms, they still capable causing harm. Therefore, identifying is an important task for studying preventing proliferation toxicity media. Previous work automatically detecting focus mainly English, with very few languages like Brazilian Portuguese. In this paper, we propose new large-scale dataset Portuguese tweets annotated as either or non-toxic different...

10.18653/v1/2020.aacl-main.91 article EN 2020-01-01

Despite their success in a variety of NLP tasks, pre-trained language models, due to heavy reliance on compositionality, fail effectively capturing the meanings multiword expressions (MWEs), especially idioms. Therefore, datasets and methods improve representation MWEs are urgently needed. Existing limited providing degree idiomaticity along with literal and, where applicable, (a single) non-literal interpretation MWEs. This work presents novel dataset naturally occurring sentences...

10.18653/v1/2021.findings-emnlp.294 preprint EN cc-by 2021-01-01

Many applications within natural language processing involve performing text-to-text transformations, i.e., given a text in as input, systems are required to produce version of this

10.2200/s00854ed1v01y201805hlt039 article EN Synthesis lectures on human language technologies 2018-09-21
Coming Soon ...