NFDI4DS | UHH-SEMS - Publication Details

Probing Pre-Trained Language Models for Cross-Cultural Differences in Values

OPENALEX - Publications

Arnav Arora Lucie-Aimée Kaffee Isabelle Augenstein

Language embeds information about social, cultural, and political values people hold. Prior work has explored potentially harmful social biases encoded in Pre-trained Models (PLMs). However, there been no systematic study investigating how embedded these models vary across cultures.In this paper, we introduce probes to which cross-cultural are models, whether they align with existing theories surveys. We find that PLMs capture differences cultures, but those only weakly established discuss...

10.18653/v1/2023.c3nlp-1.12 article EN cc-by 2023-01-01

Neural Wikipedian: Generating Textual Summaries from Knowledge Base Triples

OPENALEX - Publications

Pavlos Vougiouklis Hady Elsahar Lucie-Aimée Kaffee Christophe Gravier Frédérique Laforest and 2 more

Most people need textual or visual interfaces in order to make sense of Semantic Web data. In this paper, we investigate the problem generating natural language summaries for data using neural networks. Our end-to-end trainable architecture encodes information from a set triples into vector fixed dimensionality and generates summary by conditioning output on encoded vector. We explore different approaches that enable our models verbalise entities input generated text. systems are trained...

10.1016/j.websem.2018.07.002 article EN cc-by Journal of Web Semantics 2018-07-30

A Glimpse into Babel

OPENALEX - Publications

Lucie-Aimée Kaffee Alessandro Piscopo Pavlos Vougiouklis Elena Simperl Leslie Carr and 1 more

Multilinguality is an important topic for knowledge bases, especially Wikidata, that was build to serve the multilingual requirements of international community. Its labels are way humans interact with data. In this paper, we explore state languages in Wikidata as now, regard its ontology, and relationship Wikipedia. Furthermore, set multilinguality context real world by comparing it distribution native speakers. We find existing language maldistribution, which less urgent promising results...

10.1145/3125433.3125465 article EN 2017-08-23

Presumed Cultural Identity: How Names Shape LLM Responses

OPENALEX - Publications

Siddhesh Pawar Arnav Arora Lucie-Aimée Kaffee Isabelle Augenstein

Names are deeply tied to human identity. They can serve as markers of individuality, cultural heritage, and personal history. However, using names a core indicator identity lead over-simplification complex identities. When interacting with LLMs, user an important point information for personalisation. enter chatbot conversations through direct input (requested by chatbots), part task contexts such CV reviews, or built-in memory features that store We study biases associated measuring...

10.48550/arxiv.2502.11995 preprint EN arXiv (Cornell University) 2025-02-17

Learning to Generate Wikipedia Summaries for Underserved Languages from Wikidata

OPENALEX - Publications

Lucie-Aimée Kaffee Hady Elsahar Pavlos Vougiouklis Christophe Gravier Frédérique Laforest and 2 more

Lucie-Aimée Kaffee, Hady Elsahar, Pavlos Vougiouklis, Christophe Gravier, Frédérique Laforest, Jonathon Hare, Elena Simperl. Proceedings of the 2018 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 2018.

10.18653/v1/n18-2101 preprint EN cc-by 2018-01-01

Analysis of Editors' Languages in Wikidata

OPENALEX - Publications

Lucie-Aimée Kaffee Elena Simperl

Wikidata is unique as a knowledge base well community given its users contribute together to one cross-lingual project. To create truly multilingual base, variety of languages contributors needed. In this paper, we investigate the language distribution in Wikidata's editors, how it relates content and users' label editing. This gives us an insight into that can help supporting working on projects.

10.1145/3233391.3233965 article EN 2018-07-26

What do Wikidata and Wikipedia Have in Common?

OPENALEX - Publications

Alessandro Piscopo Pavlos Vougiouklis Lucie-Aimée Kaffee Christopher Phethean Jonathon Hare and 1 more

Wikidata is a community-driven knowledge graph, strongly linked to Wikipedia. However, the connection between two projects has been sporadically explored. We investigated relationship in terms of information they contain by looking at their external references. Our findings show that while only small number sources directly reused across and Wikipedia, references often point same domain. Furthermore, appears use less Anglo-American-centred sources. These results deserve further in-depth...

10.1145/3125433.3125445 article EN 2017-08-23

Assessing the Quality of Sources in Wikidata Across Languages: A Hybrid Approach

OPENALEX - Publications

Gabriel Amaral Alessandro Piscopo Lucie-Aimée Kaffee Odinaldo Rodrigues Elena Simperl

Wikidata is one of the most important sources structured data on web, built by a worldwide community volunteers. As secondary source, its contents must be backed credible references; this particularly important, as explicitly encourages editors to add claims for which there no broad consensus, long they are corroborated references. Nevertheless, despite essential link between content and references, Wikidata's ability systematically assess assure quality references remains limited. To end,...

10.1145/3484828 article EN Journal of Data and Information Quality 2021-10-15

Investigating Wit, Creativity, and Detectability of Large Language Models in Domain-Specific Writing Style Adaptation of Reddit’s Showerthoughts

OPENALEX - Publications

Tolga Buz Benjamin Frost Nikola Genchev M. Schneider Lucie-Aimée Kaffee and 1 more

10.18653/v1/2024.starsem-1.23 article EN 2024-01-01

Thorny Roses: Investigating the Dual Use Dilemma in Natural Language Processing

OPENALEX - Publications

Lucie-Aimée Kaffee Arnav Arora Zeerak Talat Isabelle Augenstein

Dual use, the intentional, harmful reuse of technology and scientific artefacts, is an ill-defined problem within context Natural Language Processing (NLP). As large language models (LLMs) have advanced in their capabilities become more accessible, risk intentional misuse becomes prevalent. To prevent such malicious it necessary for NLP researchers practitioners to understand mitigate risks research. Hence, we present NLP-specific definition dual use informed by field. Further, propose a...

10.18653/v1/2023.findings-emnlp.932 article EN cc-by 2023-01-01

References in Wikipedia: The Editors’ Perspective

OPENALEX - Publications

Lucie-Aimée Kaffee Hady Elsahar

References are an essential part of Wikipedia. Each statement in Wikipedia should be referenced. In this paper, we explore the creation and collection references for new articles from editor's perspective. We map out workflow editors when creating a article, emphasising on how they select references.

10.1145/3442442.3452337 article EN Companion Proceedings of the The Web Conference 2018 2021-04-19

Property Label Stability in Wikidata

OPENALEX - Publications

Thomas Pellissier Tanon Lucie-Aimée Kaffee

Stability in Wikidata's schema is essential for the reuse of its data. In this paper, we analyze stability data based on changes labels properties six languages. We find that overall stable, making it a reliable resource external usage.

10.1145/3184558.3191643 preprint EN 2018-01-01

When humans and machines collaborate

OPENALEX - Publications

Lucie-Aimée Kaffee Kemele M. Endris Elena Simperl

The quality and maintainability of a knowledge graph are determined by the process in which it is created. There different approaches to such processes; extraction or conversion available data web (automated as DBpedia from Wikipedia), community-created graphs, often group experts, hybrid where humans maintain alongside bots. We focus this work on approach human edited graphs supported automated tools. In particular, we analyse editing natural language data, i.e. labels. Labels entry point...

10.1145/3306446.3340826 article EN 2019-08-20

Probing Pre-Trained Language Models for Cross-Cultural Differences in Values

OPENALEX - Publications

Arnav Arora Lucie-Aimée Kaffee Isabelle Augenstein

Language embeds information about social, cultural, and political values people hold. Prior work has explored social potentially harmful biases encoded in Pre-Trained models (PTLMs). However, there been no systematic study investigating how embedded these vary across cultures. In this paper, we introduce probes to which cultures are models, whether they align with existing theories cross-cultural value surveys. We find that PTLMs capture differences cultures, but those only weakly...

10.48550/arxiv.2203.13722 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Neural Wikipedian: Generating Textual Summaries from Knowledge Base Triples

OPENALEX - Publications

Pavlos Vougiouklis Hady Elsahar Lucie-Aimée Kaffee Christophe Gravier Frédérique Laforest and 2 more

Most people need textual or visual interfaces in order to make sense of Semantic Web data. In this paper, we investigate the problem generating natural language summaries for data using neural networks. Our end-to-end trainable architecture encodes information from a set triples into vector fixed dimensionality and generates summary by conditioning output on encoded vector. We explore different approaches that enable our models verbalise entities input generated text. systems are trained...

10.2139/ssrn.3248712 article EN SSRN Electronic Journal 2018-01-01

The Human Face of the Web of Data: A Cross-sectional Study of Labels

OPENALEX - Publications

Lucie-Aimée Kaffee Elena Simperl

Labels in the web of data are key element for humans to access data. We introduce a framework measure coverage information with labels. The is based on set metrics including completeness, unambiguity, multilinguality, labeled object usage, and monolingual islands. apply this seven diverse datasets, from data, collaborative knowledge base, open governmental GLAM gain an insight into current state labels multilinguality Comparing differently sourced datasets can help publishers understand what...

10.1016/j.procs.2018.09.007 article EN Procedia Computer Science 2018-01-01

Ranking Knowledge Graphs By Capturing Knowledge about Languages and Labels

OPENALEX - Publications

Lucie-Aimée Kaffee Kemele M. Endris Elena Simperl María-Esther Vidal

Capturing knowledge about the mulitilinguality of a graph is supreme importance to understand its applicability across multiple languages. Several metrics have been proposed for describing at level whole graph. Albeit enabling understanding ecosystem graphs in terms utilized languages, they are unable capture fine-grained description languages which different entities and properties represented. This lack representation prevents comparison existing order decide most appropriate multilingual...

10.1145/3360901.3364443 article EN 2019-09-23

Neural Wikipedian: Generating Textual Summaries from Knowledge Base Triples

OPENALEX - Publications

Pavlos Vougiouklis Hady Elsahar Lucie-Aimée Kaffee Christoph Gravier Frédérique Laforest and 2 more

Most people do not interact with Semantic Web data directly. Unless they have the expertise to understand underlying technology, need textual or visual interfaces help them make sense of it. We explore problem generating natural language summaries for data. This is non-trivial, especially in an open-domain context. To address this problem, we use neural networks. Our system encodes information from a set triples into vector fixed dimensionality and generates summary by conditioning output on...

10.48550/arxiv.1711.00155 preprint EN other-oa arXiv (Cornell University) 2017-01-01

TempEL: Linking Dynamically Evolving and Newly Emerging Entities

OPENALEX - Publications

Klim Zaporojets Lucie-Aimée Kaffee Johannes Deleu Thomas Demeester Chris Develder and 1 more

In our continuously evolving world, entities change over time and new, previously non-existing or unknown, appear. We study how this evolutionary scenario impacts the performance on a well established entity linking (EL) task. For that study, we introduce TempEL, an dataset consists of time-stratified English Wikipedia snapshots from 2013 to 2022, which collect both anchor mentions entities, these target entities' descriptions. By capturing such temporal aspects, newly introduced TempEL...

10.48550/arxiv.2302.02500 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Using natural language generation to bootstrap missing Wikipedia articles: A human-centric perspective

OPENALEX - Publications

Lucie-Aimée Kaffee Pavlos Vougiouklis Elena Simperl

Nowadays natural language generation (NLG) is used in everything from news reporting and chatbots to social media management. Recent advances machine learning have made it possible train NLG systems that seek achieve human-level performance text writing summarisation. In this paper, we propose such a system the context of Wikipedia evaluate with readers editors. Our solution builds upon ArticlePlaceholder, tool 14 under-resourced versions, which displays structured data Wikidata knowledge...

10.3233/sw-210431 article EN other-oa Semantic Web 2021-04-30

Investigating Human Values in Online Communities

OPENALEX - Publications

Nadav Borenstein Arnav Arora Lucie-Aimée Kaffee Isabelle Augenstein

Human values play a vital role as an analytical tool in social sciences, enabling the study of diverse dimensions within society whole and among individual communities. This paper addresses limitations traditional survey-based studies human by proposing computational application Schwartz's framework to Reddit, platform organized into distinct online After ensuring reliability automated value extraction tools for Reddit content, we automatically annotate six million posts across 10,000...

10.48550/arxiv.2402.14177 preprint EN arXiv (Cornell University) 2024-02-21