Lucie-Aimée Kaffee

ORCID: 0000-0002-1514-8505
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Natural Language Processing Techniques
  • Topic Modeling
  • Wikis in Education and Collaboration
  • Semantic Web and Ontologies
  • Advanced Graph Neural Networks
  • Digital Rights Management and Security
  • Ethics and Social Impacts of AI
  • Biomedical Text Mining and Ontologies
  • Data Quality and Management
  • Social Media and Politics
  • Educator Training and Historical Pedagogy
  • Explainable Artificial Intelligence (XAI)
  • Cancer-related gene regulation
  • Hate Speech and Cyberbullying Detection
  • Open Source Software Innovations
  • Computational and Text Analysis Methods
  • Sentiment Analysis and Opinion Mining
  • Multimodal Machine Learning Applications
  • Scientific Computing and Data Management
  • Library Science and Information Systems
  • Data Stream Mining Techniques
  • Translation Studies and Practices
  • Financial Markets and Investment Strategies
  • FinTech, Crowdfunding, Digital Finance
  • Mobile Crowdsensing and Crowdsourcing

Hasso Plattner Institute
2023-2024

University of Potsdam
2024

University of Copenhagen
2023

University of Southampton
2017-2021

Language embeds information about social, cultural, and political values people hold. Prior work has explored potentially harmful social biases encoded in Pre-trained Models (PLMs). However, there been no systematic study investigating how embedded these models vary across cultures.In this paper, we introduce probes to which cross-cultural are models, whether they align with existing theories surveys. We find that PLMs capture differences cultures, but those only weakly established discuss...

10.18653/v1/2023.c3nlp-1.12 article EN cc-by 2023-01-01

Most people need textual or visual interfaces in order to make sense of Semantic Web data. In this paper, we investigate the problem generating natural language summaries for data using neural networks. Our end-to-end trainable architecture encodes information from a set triples into vector fixed dimensionality and generates summary by conditioning output on encoded vector. We explore different approaches that enable our models verbalise entities input generated text. systems are trained...

10.1016/j.websem.2018.07.002 article EN cc-by Journal of Web Semantics 2018-07-30

Multilinguality is an important topic for knowledge bases, especially Wikidata, that was build to serve the multilingual requirements of international community. Its labels are way humans interact with data. In this paper, we explore state languages in Wikidata as now, regard its ontology, and relationship Wikipedia. Furthermore, set multilinguality context real world by comparing it distribution native speakers. We find existing language maldistribution, which less urgent promising results...

10.1145/3125433.3125465 article EN 2017-08-23

Names are deeply tied to human identity. They can serve as markers of individuality, cultural heritage, and personal history. However, using names a core indicator identity lead over-simplification complex identities. When interacting with LLMs, user an important point information for personalisation. enter chatbot conversations through direct input (requested by chatbots), part task contexts such CV reviews, or built-in memory features that store We study biases associated measuring...

10.48550/arxiv.2502.11995 preprint EN arXiv (Cornell University) 2025-02-17

Lucie-Aimée Kaffee, Hady Elsahar, Pavlos Vougiouklis, Christophe Gravier, Frédérique Laforest, Jonathon Hare, Elena Simperl. Proceedings of the 2018 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 2018.

10.18653/v1/n18-2101 preprint EN cc-by 2018-01-01

Wikidata is unique as a knowledge base well community given its users contribute together to one cross-lingual project. To create truly multilingual base, variety of languages contributors needed. In this paper, we investigate the language distribution in Wikidata's editors, how it relates content and users' label editing. This gives us an insight into that can help supporting working on projects.

10.1145/3233391.3233965 article EN 2018-07-26

Wikidata is a community-driven knowledge graph, strongly linked to Wikipedia. However, the connection between two projects has been sporadically explored. We investigated relationship in terms of information they contain by looking at their external references. Our findings show that while only small number sources directly reused across and Wikipedia, references often point same domain. Furthermore, appears use less Anglo-American-centred sources. These results deserve further in-depth...

10.1145/3125433.3125445 article EN 2017-08-23

Wikidata is one of the most important sources structured data on web, built by a worldwide community volunteers. As secondary source, its contents must be backed credible references; this particularly important, as explicitly encourages editors to add claims for which there no broad consensus, long they are corroborated references. Nevertheless, despite essential link between content and references, Wikidata's ability systematically assess assure quality references remains limited. To end,...

10.1145/3484828 article EN Journal of Data and Information Quality 2021-10-15

Dual use, the intentional, harmful reuse of technology and scientific artefacts, is an ill-defined problem within context Natural Language Processing (NLP). As large language models (LLMs) have advanced in their capabilities become more accessible, risk intentional misuse becomes prevalent. To prevent such malicious it necessary for NLP researchers practitioners to understand mitigate risks research. Hence, we present NLP-specific definition dual use informed by field. Further, propose a...

10.18653/v1/2023.findings-emnlp.932 article EN cc-by 2023-01-01

References are an essential part of Wikipedia. Each statement in Wikipedia should be referenced. In this paper, we explore the creation and collection references for new articles from editor's perspective. We map out workflow editors when creating a article, emphasising on how they select references.

10.1145/3442442.3452337 article EN Companion Proceedings of the The Web Conference 2018 2021-04-19

Stability in Wikidata's schema is essential for the reuse of its data. In this paper, we analyze stability data based on changes labels properties six languages. We find that overall stable, making it a reliable resource external usage.

10.1145/3184558.3191643 preprint EN 2018-01-01

The quality and maintainability of a knowledge graph are determined by the process in which it is created. There different approaches to such processes; extraction or conversion available data web (automated as DBpedia from Wikipedia), community-created graphs, often group experts, hybrid where humans maintain alongside bots. We focus this work on approach human edited graphs supported automated tools. In particular, we analyse editing natural language data, i.e. labels. Labels entry point...

10.1145/3306446.3340826 article EN 2019-08-20

Language embeds information about social, cultural, and political values people hold. Prior work has explored social potentially harmful biases encoded in Pre-Trained models (PTLMs). However, there been no systematic study investigating how embedded these vary across cultures. In this paper, we introduce probes to which cultures are models, whether they align with existing theories cross-cultural value surveys. We find that PTLMs capture differences cultures, but those only weakly...

10.48550/arxiv.2203.13722 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Most people need textual or visual interfaces in order to make sense of Semantic Web data. In this paper, we investigate the problem generating natural language summaries for data using neural networks. Our end-to-end trainable architecture encodes information from a set triples into vector fixed dimensionality and generates summary by conditioning output on encoded vector. We explore different approaches that enable our models verbalise entities input generated text. systems are trained...

10.2139/ssrn.3248712 article EN SSRN Electronic Journal 2018-01-01

Labels in the web of data are key element for humans to access data. We introduce a framework measure coverage information with labels. The is based on set metrics including completeness, unambiguity, multilinguality, labeled object usage, and monolingual islands. apply this seven diverse datasets, from data, collaborative knowledge base, open governmental GLAM gain an insight into current state labels multilinguality Comparing differently sourced datasets can help publishers understand what...

10.1016/j.procs.2018.09.007 article EN Procedia Computer Science 2018-01-01

Capturing knowledge about the mulitilinguality of a graph is supreme importance to understand its applicability across multiple languages. Several metrics have been proposed for describing at level whole graph. Albeit enabling understanding ecosystem graphs in terms utilized languages, they are unable capture fine-grained description languages which different entities and properties represented. This lack representation prevents comparison existing order decide most appropriate multilingual...

10.1145/3360901.3364443 article EN 2019-09-23

Most people do not interact with Semantic Web data directly. Unless they have the expertise to understand underlying technology, need textual or visual interfaces help them make sense of it. We explore problem generating natural language summaries for data. This is non-trivial, especially in an open-domain context. To address this problem, we use neural networks. Our system encodes information from a set triples into vector fixed dimensionality and generates summary by conditioning output on...

10.48550/arxiv.1711.00155 preprint EN other-oa arXiv (Cornell University) 2017-01-01

In our continuously evolving world, entities change over time and new, previously non-existing or unknown, appear. We study how this evolutionary scenario impacts the performance on a well established entity linking (EL) task. For that study, we introduce TempEL, an dataset consists of time-stratified English Wikipedia snapshots from 2013 to 2022, which collect both anchor mentions entities, these target entities' descriptions. By capturing such temporal aspects, newly introduced TempEL...

10.48550/arxiv.2302.02500 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Nowadays natural language generation (NLG) is used in everything from news reporting and chatbots to social media management. Recent advances machine learning have made it possible train NLG systems that seek achieve human-level performance text writing summarisation. In this paper, we propose such a system the context of Wikipedia evaluate with readers editors. Our solution builds upon ArticlePlaceholder, tool 14 under-resourced versions, which displays structured data Wikidata knowledge...

10.3233/sw-210431 article EN other-oa Semantic Web 2021-04-30

Human values play a vital role as an analytical tool in social sciences, enabling the study of diverse dimensions within society whole and among individual communities. This paper addresses limitations traditional survey-based studies human by proposing computational application Schwartz's framework to Reddit, platform organized into distinct online After ensuring reliability automated value extraction tools for Reddit content, we automatically annotate six million posts across 10,000...

10.48550/arxiv.2402.14177 preprint EN arXiv (Cornell University) 2024-02-21
Coming Soon ...