NFDI4DS | UHH-SEMS - Publication Details

Carolina Scarton

ORCID: 0000-0002-0103-4072

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5065368839

Research Areas

Topic Modeling
Natural Language Processing Techniques
Text Readability and Simplification
Misinformation and Its Impacts
Hate Speech and Cyberbullying Detection
Biomedical Text Mining and Ontologies
Sentiment Analysis and Opinion Mining
Spam and Phishing Detection
Speech and dialogue systems
Vaccine Coverage and Hesitancy
Advanced Text Analysis Techniques
Semantic Web and Ontologies
Software Engineering Research
Multimodal Machine Learning Applications
Text and Document Classification Technologies
Interpreting and Communication in Healthcare
Neurobiology of Language and Bilingualism
Machine Learning and Data Classification
Language, Metaphor, and Cognition
Translation Studies and Practices
Social Media and Politics
Media Influence and Politics
Adversarial Robustness in Machine Learning
Rough Sets and Fuzzy Logic
Water Systems and Optimization

University of Sheffield
2016-2025

Computational Physics (United States)
2014

Hospital Universitário da Universidade de São Paulo
2013

Universidade de São Paulo
2010

Université Paris-Saclay
1999

Centre National de la Recherche Scientifique
1999

Findings of the 2016 Conference on Machine Translation

OPENALEX - Publications

Ondřej Bojar Rajen Chatterjee Christian Federmann Yvette Graham Barry Haddow and 16 more

Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Martin Popel, Matt Post, Raphael Rubino, Carolina Scarton, Lucia Specia, Marco Turchi, Karin Verspoor, Marcos Zampieri. Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers. 2016.

10.18653/v1/w16-2301 article EN 2016-01-01

Findings of the 2015 Workshop on Statistical Machine Translation

OPENALEX - Publications

Ondřej Bojar Rajen Chatterjee Christian Federmann Barry Haddow Matthias Huck and 9 more

Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Barry Haddow, Matthias Huck, Chris Hokamp, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Matt Post, Carolina Scarton, Lucia Specia, Marco Turchi. Proceedings of the Tenth Workshop on Statistical Machine Translation. 2015.

10.18653/v1/w15-3001 article EN cc-by 2015-01-01

Multi-level Translation Quality Prediction with QuEst++

OPENALEX - Publications

Lucia Specia Gustavo Henrique Paetzold Carolina Scarton

This paper presents QUEST++ , an open source tool for quality estimation which can predict texts at word, sentence and document level.It also provides pipelined processing, whereby predictions made a lower level (e.g. words) be used as input to build models higher (e.g.sentences).QUEST++ allows the extraction of variety features, machine learning algorithms test models.Results on recent datasets show that achieves state-of-the-art performance.

10.3115/v1/p15-4020 article EN cc-by 2015-01-01

Data-Driven Sentence Simplification: Survey and Benchmark

OPENALEX - Publications

Fernando Alva-Manchego Carolina Scarton Lucia Specia

Sentence Simplification (SS) aims to modify a sentence in order make it easier read and understand. In do so, several rewriting transformations can be performed such as replacement, reordering, splitting. Executing these while keeping sentences grammatical, preserving their main idea, generating simpler output, is challenging still far from solved problem. this article, we survey research on SS, focusing approaches that attempt learn how simplify using corpora of aligned original-simplified...

10.1162/coli_a_00370 article EN cc-by-nc-nd Computational Linguistics 2020-01-02

Overview of the BioLaySumm 2024 Shared Task on the Lay Summarization of Biomedical Research Articles

OPENALEX - Publications

Tomas Goldsack Carolina Scarton Matthew Shardlow Chenghua Lin

10.18653/v1/2024.bionlp-1.10 article EN 2024-01-01

ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations

OPENALEX - Publications

Fernando Alva-Manchego Louis Martin Antoine Bordes Carolina Scarton Benoît Sagot and 1 more

In order to simplify a sentence, human editors perform multiple rewriting transformations: they split it into several shorter sentences, paraphrase words (i.e. replacing complex or phrases by simpler synonyms), reorder components, and/or delete information deemed unnecessary. Despite these varied range of possible text alterations, current models for automatic sentence simplification are evaluated using datasets that focused on single transformation, such as lexical paraphrasing splitting....

10.18653/v1/2020.acl-main.424 preprint EN cc-by 2020-01-01

The (Un)Suitability of Automatic Evaluation Metrics for Text Simplification

OPENALEX - Publications

Fernando Alva-Manchego Carolina Scarton Lucia Specia

Abstract In order to simplify sentences, several rewriting operations can be performed, such as replacing complex words per simpler synonyms, deleting unnecessary information, and splitting long sentences. Despite this multi-operation nature, evaluation of automatic simplification systems relies on metrics that moderately correlate with human judgments the simplicity achieved by executing specific (e.g., gain based lexical replacements). article, we investigate how well existing assess...

10.1162/coli_a_00418 article EN cc-by-nc-nd Computational Linguistics 2021-08-13

SemEval-2022 Task 2: Multilingual Idiomaticity Detection and Sentence Embedding

OPENALEX - Publications

Harish Tayyar Madabushi Edward Gow-Smith Marcos García Carolina Scarton Marco Idiart and 1 more

Harish Tayyar Madabushi, Edward Gow-Smith, Marcos Garcia, Carolina Scarton, Marco Idiart, Aline Villavicencio. Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022). 2022.

10.18653/v1/2022.semeval-1.13 article EN cc-by Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022) 2022-01-01

Making Science Simple: Corpora for the Lay Summarisation of Scientific Literature

OPENALEX - Publications

Tomas Goldsack Zhihao Zhang Chenghua Lin Carolina Scarton

Lay summarisation aims to jointly summarise and simplify a given text, thus making its content more comprehensible non-experts.Automatic approaches for lay can provide significant value in broadening access scientific literature, enabling greater degree of both interdisciplinary knowledge sharing public understanding when it comes research findings. However, current corpora this task are limited their size scope, hindering the development broadly applicable data-driven approaches. Aiming...

10.18653/v1/2022.emnlp-main.724 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2022-01-01

Learning Simplifications for Specific Target Audiences

OPENALEX - Publications

Carolina Scarton Lucia Specia

Text simplification (TS) is a monolingual text-to-text transformation task where an original (complex) text transformed into target (simpler) text. Most recent work based on sequence-to-sequence neural models similar to those used for machine translation (MT). Different from MT, TS data comprises more elaborate transformations, such as sentence splitting. It can also contain multiple simplifications of the same targeting different audiences, school grade levels. We explore these two features...

10.18653/v1/p18-2113 article EN cc-by 2018-01-01

Probing for idiomaticity in vector space models

OPENALEX - Publications

Marcos García Tiago Kramer Vieira Carolina Scarton Marco Idiart Aline Villavicencio

Marcos Garcia, Tiago Kramer Vieira, Carolina Scarton, Marco Idiart, Aline Villavicencio. Proceedings of the 16th Conference European Chapter Association for Computational Linguistics: Main Volume. 2021.

10.18653/v1/2021.eacl-main.310 article EN cc-by 2021-01-01

NILC-Metrix: assessing the complexity of written and spoken language in Brazilian Portuguese

OPENALEX - Publications

Sidney Evaldo Leal Magali Sanches Duran Carolina Scarton Nathan Siegle Hartmann Sandra Maria Aluísio

10.1007/s10579-023-09693-w article EN Language Resources and Evaluation 2023-10-17

Comparison between parameter-efficient techniques and full fine-tuning: A case study on multilingual news article classification

OPENALEX - Publications

Olesya Razuvayevskaya Benjamin M. Wu João Leite Freddy Heppell Ivan Srba and 3 more

Adapters and Low-Rank Adaptation (LoRA) are parameter-efficient fine-tuning techniques designed to make the training of language models more efficient. Previous results demonstrated that these methods can even improve performance on some classification tasks. This paper complements existing research by investigating how influence computation costs compared full fine-tuning. We focus specifically multilingual text tasks (genre, framing, persuasion detection; with different input lengths,...

10.1371/journal.pone.0301738 article EN cc-by PLoS ONE 2024-05-03

EASSE: Easier Automatic Sentence Simplification Evaluation

OPENALEX - Publications

Fernando Alva-Manchego Louis Martin Carolina Scarton Lucia Specia

Fernando Alva-Manchego, Louis Martin, Carolina Scarton, Lucia Specia. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint (EMNLP-IJCNLP): System Demonstrations. 2019.

10.18653/v1/d19-3009 preprint EN cc-by 2019-01-01

Toxic Language Detection in Social Media for Brazilian Portuguese: New Dataset and Multilingual Analysis

OPENALEX - Publications

João Augusto Leite Diego Furtado Silva Kalina Bontcheva Carolina Scarton

Hate speech and toxic comments are a common concern of social media platform users. Although these are, fortunately, the minority in platforms, they still capable causing harm. Therefore, identifying is an important task for studying preventing proliferation toxicity media. Previous work automatically detecting focus mainly English, with very few languages like Brazilian Portuguese. In this paper, we propose new large-scale dataset Portuguese tweets annotated as either or non-toxic different...

10.18653/v1/2020.aacl-main.91 article EN 2020-01-01

AStitchInLanguageModels: Dataset and Methods for the Exploration of Idiomaticity in Pre-Trained Language Models

OPENALEX - Publications

Harish Tayyar Madabushi Edward Gow-Smith Carolina Scarton Aline Villavicencio

Despite their success in a variety of NLP tasks, pre-trained language models, due to heavy reliance on compositionality, fail effectively capturing the meanings multiword expressions (MWEs), especially idioms. Therefore, datasets and methods improve representation MWEs are urgently needed. Existing limited providing degree idiomaticity along with literal and, where applicable, (a single) non-literal interpretation MWEs. This work presents novel dataset naturally occurring sentences...

10.18653/v1/2021.findings-emnlp.294 preprint EN cc-by 2021-01-01

Wounds: Biology and Management.

OPENALEX - Publications

François Yvon Lucia Specia Carolina Scarton Gustavo Henrique Paetzold

10.1016/s1072-7515(98)00293-2 article EN Journal of the American College of Surgeons 1999-01-01

Quality Estimation for Machine Translation

OPENALEX - Publications

Lucia Specia Carolina Scarton Gustavo Henrique Paetzold

Many applications within natural language processing involve performing text-to-text transformations, i.e., given a text in as input, systems are required to produce version of this

10.2200/s00854ed1v01y201805hlt039 article EN Synthesis lectures on human language technologies 2018-09-21

Toxic Language Detection in Social Media for Brazilian Portuguese: New Dataset and Multilingual Analysis

OPENALEX - Publications

João Leite Diego Furtado Silva Kalina Bontcheva Carolina Scarton

10.48550/arxiv.2010.04543 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Leveraging Large Language Models for Zero-shot Lay Summarisation in Biomedicine and Beyond

OPENALEX - Publications

Tomas Goldsack Carolina Scarton Chenghua Lin

In this work, we explore the application of Large Language Models to zero-shot Lay Summarisation. We propose a novel two-stage framework for Summarisation based on real-life processes, and find that summaries generated with method are increasingly preferred by human judges larger models. To help establish best practices employing LLMs in settings, also assess ability as judges, finding they able replicate preferences judges. Finally, take initial steps towards Natural Processing (NLP)...

10.48550/arxiv.2501.05224 preprint EN arXiv (Cornell University) 2025-01-09

Coming Soon ...