- Topic Modeling
- Wikis in Education and Collaboration
- Natural Language Processing Techniques
- Misinformation and Its Impacts
- Synthesis and characterization of novel inorganic/organometallic compounds
- Social Media and Politics
- Hate Speech and Cyberbullying Detection
- Organometallic Complex Synthesis and Catalysis
- Complex Network Analysis Techniques
- Advanced Text Analysis Techniques
- Media Influence and Politics
- Opinion Dynamics and Social Influence
- Sentiment Analysis and Opinion Mining
- Digital Marketing and Social Media
- Multimodal Machine Learning Applications
- Open Source Software Innovations
- Privacy-Preserving Technologies in Data
- Spam and Phishing Detection
- Molecular Junctions and Nanostructures
- Digital Games and Media
- Organoboron and organosilicon chemistry
- Cancer-related gene regulation
- Mobile Crowdsensing and Crowdsourcing
- Web Data Mining and Analysis
- Software Engineering Research
École Polytechnique Fédérale de Lausanne
2017-2025
Swiss Data Science Center
2024
ETH Zurich
2021-2023
University of Cambridge
2021-2023
University of Chicago
2023
Institute of Software
2022
Chinese Academy of Sciences
2022
Vrije Universiteit Amsterdam
2022
Laboratoire d'Informatique Fondamentale de Lille
2021-2022
University of Florida
2022
Non-profits, as well the media, have hypothesized existence of a radicalization pipeline on YouTube, claiming that users systematically progress towards more extreme content platform. Yet, there is to date no substantial quantitative evidence this alleged pipeline. To close gap, we conduct large-scale audit user YouTube. We analyze 330,925 videos posted 349 channels, which broadly classified into four types: Media, Alt-lite, Intellectual Dark Web (I.D.W.), and Alt-right. According...
Vibrant online communities are in constant flux. As members join and depart, the interactional norms evolve, stimulating further changes to membership its social dynamics. Linguistic change --- sense of innovation that becomes accepted as norm is essential this dynamic process: it both facilitates individual expression fosters emergence a collective identity.
Wikipedia is a major source of information for many people. However, false on raises concerns about its credibility. One way in which may be presented the form hoax articles, i.e., articles containing fabricated facts nonexistent entities or events. In this paper we study by focusing that have been created throughout history. We make several contributions. First, assess real-world impact measuring how long they survive before being debunked, pageviews receive, and heavily are referred to...
Over the past few years, massive amounts of world knowledge have been accumulated in publicly available bases, such as Freebase, NELL, and YAGO. Yet despite their seemingly huge size, these bases are greatly incomplete. For example, over 70% people included Freebase no known place birth, 99% ethnicity. In this paper, we propose a way to leverage existing Web-search-based question-answering technology fill gaps targeted way. particular, for each entity attribute, learn best set queries ask,...
Person-to-person evaluations are prevalent in all kinds of discourse and important for establishing reputations, building social bonds, shaping public opinion. Such can be analyzed separately using signed networks textual sentiment analysis, but this misses the rich interactions between language context. To capture such interactions, we develop a model that predicts individual A’s opinion B by synthesizing information from network which A embedded with analysis evaluative texts relating to...
Large language models (LLMs) are remarkable data annotators. They can be used to generate high-fidelity supervised training data, as well survey and experimental data. With the widespread adoption of LLMs, human gold--standard annotations key understanding capabilities LLMs validity their results. However, crowdsourcing, an important, inexpensive way obtain annotations, may itself impacted by crowd workers have financial incentives use increase productivity income. To investigate this...
<title>Abstract</title> Can large language models (LLMs) create tailor-made, convincing arguments to promote false or misleading narratives online? Early work has found that LLMs can generate content perceived on par with, even more persuasive than, human-written messages. However, there is still limited evidence regarding LLMs' capabilities in direct conversations with humans—the scenario these are usually deployed at. In this pre-registered study, we analyze the power of AI-driven...
Navigating information spaces is an essential part of our everyday lives, and in order to design efficient user-friendly systems, it important understand how humans navigate find the they are looking for. We perform a large-scale study human wayfinding, which, given network links between concepts Wikipedia, people play game finding short path from start target concept by following hyperlinks. What distinguishes setup other studies Web-browsing behavior that case graph connections concepts,...
Wikipedia is one of the most popular sites on Web, with millions users relying it to satisfy a broad range information needs every day. Although crucial understand what exactly these are in order be able meet them, little currently known about why visit Wikipedia. The goal this paper fill gap by combining survey readers log-based analysis user activity. Based an initial series surveys, we build taxonomy use cases along several dimensions, capturing users' motivations Wikipedia, depth...
It is urgent to understand how effectively communicate public health messages during the COVID-19 pandemic. Previous work has focused on formulate in terms of style and content, rather than who should send them. In particular, little known about impact spokesperson selection message propagation times crisis. We report effectiveness different figures at promoting social distancing among 12,194 respondents from six countries that were severely affected by pandemic time data collection. Across...
Language models (LMs) have recently shown remarkable performance on reasoning tasks by explicitly generating intermediate inferences, e.g., chain-of-thought prompting. However, these inference steps may be inappropriate deductions from the initial context and lead to incorrect final predictions. Here we introduce REFINER, a framework for finetuning LMs generate while interacting with critic model that provides automated feedback reasoning. Specifically, structured LM uses iteratively improve...
Large language models (LLMs) have great potential for synthetic data generation. This work shows that useful can be synthetically generated even tasks cannot solved directly by LLMs: problems with structured outputs, it is possible to prompt an LLM perform the task in reverse direction, generating plausible input text a target output structure. Leveraging this asymmetry difficulty makes produce large-scale, high-quality complex tasks. We demonstrate effectiveness of approach on closed...
In recent years, critics of online platforms have raised concerns about the ability recommendation algorithms to amplify problematic content, with potentially radicalizing consequences. However, attempts evaluate effect recommenders suffered from a lack appropriate counterfactuals—what user would viewed in absence algorithmic recommendations—and hence cannot disentangle effects algorithm user’s intentions. Here we propose method that call “counterfactual bots” causally estimate role...
Nutrition is a key factor in people's overall health. Hence, understanding the nature and dynamics of population-wide dietary preferences over time space can be valuable public To date, studies have leveraged small samples participants via food intake logs or treatment data. We propose complementary source population data on nutrition obtained Web logs. Our main contribution spatiotemporal analysis through lens gathered by widely distributed Web-browser add-on, using access volume recipes...
Evaluation of cross-lingual encoders is usually performed either via zero-shot transfer in supervised downstream tasks or unsupervised textual similarity. In this paper, we concern ourselves with reference-free machine translation (MT) evaluation where directly compare source texts to (sometimes low-quality) system translations, which represents a natural adversarial setup for multilingual encoders. Reference-free holds the promise web-scale comparison MT systems. We systematically...
Researchers have suggested that "the Manosphere," a conglomerate of men-centered online communities, may serve as gateway to far right movements. In context, this paper quantitatively studies the migratory patterns between variety groups within Manosphere and Alt-right, loosely connected movement has been particularly active in mainstream social networks. Our analysis leverages over 300 million comments spread through Reddit (in 115 subreddits) YouTube 526 channels) investigate whether...
People regularly face tasks that can be understood as navigation in information networks, where the goal is to find a path between two given nodes. In many such situations, navigator only gets local access node currently under inspection and its immediate neighbors. This lack of global about network notwithstanding, humans tend good at finding short paths, despite fact real-world networks are typically very large. One potential reason for this could possess vast amounts background knowledge...
Political polarization appears to be on the rise, as measured by voting behavior, general affect towards opposing partisans and their parties, contents posted consumed online. Research over years has focused role of Web a driver polarization. In order further our understanding factors behind online polarization, in present work we collect analyze browsing histories tens thousands users alongside careful measurements time spent various news sources. We show that consumption follows polarized...
Martin Josifoski, Nicola De Cao, Maxime Peyrard, Fabio Petroni, Robert West. Proceedings of the 2022 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies. 2022.
Online social media platforms use automated moderation systems to remove or reduce the visibility of rule-breaking content. While previous work has documented importance manual content moderation, effects remain largely unknown. Here, in a large study Facebook comments (n = 412M), we used fuzzy regression discontinuity design measure impact on subsequent behavior (number hidden/deleted) and engagement additional posted). We found that comment deletion decreased shorter threads (20 fewer...
Generative language models (LMs) have become omnipresent across data science. For a wide variety of tasks, inputs can be phrased as natural prompts for an LM, from whose output the solution then extracted. LM performance has consistently been increasing with model size - but so monetary cost querying ever larger models. Importantly, however, not all are equally hard: some require LMs obtaining satisfactory solution, whereas others smaller suffice. Based on this fact, we design framework...