- Privacy-Preserving Technologies in Data
- Mobile Crowdsensing and Crowdsourcing
- Privacy, Security, and Data Protection
- Adversarial Robustness in Machine Learning
- Cryptography and Data Security
- Topic Modeling
- Auction Theory and Applications
- Internet Traffic Analysis and Secure E-voting
- Anomaly Detection Techniques and Applications
- Machine Learning and Data Classification
- Stochastic Gradient Optimization Techniques
- Data Quality and Management
- Authorship Attribution and Profiling
- Open Source Software Innovations
- Experimental Behavioral Economics Studies
- Explainable Artificial Intelligence (XAI)
- Machine Learning and Algorithms
- Natural Language Processing Techniques
- Digital and Cyber Forensics
- Data Stream Mining Techniques
- Web Data Mining and Analysis
- Software Testing and Debugging Techniques
- Electronic Health Records Systems
- Advanced Neural Network Applications
- VLSI and Analog Circuit Testing
Amazon (Germany)
2019-2021
Amazon (United States)
2020-2021
University of Southampton
2014-2019
King's College London
2011
Accurately learning from user data while providing quantifiable privacy guarantees provides an opportunity to build better ML models maintaining trust. This paper presents a formal approach carrying out preserving text perturbation using the notion of d_χ-privacy designed achieve geo-indistinguishability in location data. Our applies carefully calibrated noise vector representation words high dimension space as defined by word embedding models. We present proof that satisfies where parameter...
Crowdsourcing via paid microtasks has been successfully applied in a plethora of domains and tasks. Previous efforts for making such crowdsourcing more effective have considered aspects as diverse task workflow design, spam detection, quality control, pricing models. Our work expands upon by examining the potential adding gamification to microtask interfaces means improving both worker engagement effectiveness. We run series experiments image labeling, one most common use cases...
Guaranteeing a certain level of user privacy in an arbitrary piece text is challenging issue. However, with this challenge comes the potential unlocking access to vast data stores for training machine learning models and supporting driven decisions. We address problem through lens dx-privacy, generalization Differential Privacy non Hamming distance metrics. In work, we explore word representations Hyperbolic space as means preserving text. provide proof satisfying then define probability...
Balancing the privacy-utility tradeoff is a crucial requirement of many practical machine learning systems that deal with sensitive customer data. A popular approach for privacy- preserving text analysis noise injection, in which data first mapped into continuous embedding space, perturbed by sampling spherical from an appropriate distribution, and then projected back to discrete vocabulary space. While this allows perturbation admit required metric differential privacy, often utility...
Paid microtask crowdsourcing has traditionally been approached as an individual activity, with units of work created and completed independently by the members crowd. Other forms have, however, embraced more varied models, which allow for a greater level participant interaction collaboration. This article studies feasibility uptake such approach in context paid microtasks. Specifically, we compare engagement, task output, accuracy paired-worker model traditional, single-worker version. Our...
Privacy-preserving data analysis has become essential in Machine Learning (ML), where access to vast amounts of can provide large gains the accuracies tuned models. A proportion user-contributed comes from natural language e.g., text transcriptions voice assistants. It is therefore important for curated datasets preserve privacy users whose collected and models trained on sensitive only retain non-identifying (i.e., generalizable) information. The workshop aims bring together researchers...
Ensuring strong theoretical privacy guarantees on text data is a challenging problem which usually attained at the expense of utility. However, to improve practicality preserving analyses, it essential design algorithms that better optimize this tradeoff. To address challenge, we propose release mechanism takes any (text) embedding vector as input and releases corresponding private vector. The satisfies an extension differential metric spaces. Our idea based first randomly projecting vectors...
Differentially-private mechanisms for text generation typically add carefully calibrated noise to input words and use the nearest neighbor noised as output word. When is small in magnitude, these are susceptible reconstruction of original sensitive text. This because likely be input. To mitigate this empirical privacy risk, we propose a novel class differentially private that parameterizes selection criterion traditional mechanisms. Motivated by Vickrey auction, where only second highest...
Ensuring the privacy of users whose data are used to train Natural Language Processing (NLP) models is necessary build and maintain customer trust. Differential Privacy (DP) has emerged as most successful method protect individuals. However, applying DP NLP domain comes with unique challenges. The previous methods use a generalization for metric spaces, apply privatization by adding noise inputs in space word embeddings. these assume that one specific distance measure being used, ignore...
In this article, we aim to gain a better understanding into how paid microtask crowdsourcing could leverage its appeal and scaling power by using contests boost crowd performance engagement. We introduce our microtask-based annotation platform Wordsmith , which features incentives such as points, leaderboards, badges on top of financial remuneration. Our analysis focuses particular type incentive, contests, means apply in near-real-time scenarios, requesters need labels quickly. model...
Deep Neural Networks, despite their success in diverse domains, are provably sensitive to small perturbations which cause the models return erroneous predictions minor transformations. Recently, it was proposed that this effect can be addressed text domain by optimizing for worst case loss function over all possible word substitutions within training examples. However, approach is prone weighing semantically unlikely replacements higher, resulting accuracy loss. In paper, we study robustness...
Hybrid annotation techniques have emerged as a promising approach to carry out named entity recognition on noisy microposts. In this paper, we identify set of content and crowdsourcing-related features (number type entities in post, average length sentiment tweets, composition skipped time spent complete the tasks, interaction with user interface) analyse their impact correct incorrect human annotations. We then carried further studies extended instructions disambiguation guidelines factors...
In this paper, we address the problem of finding Named Entities in very large micropost datasets. We propose methods to generate a sample representative microposts by discovering tweets that are likely refer new entities. Our approach is able significantly speed-up semantic analysis process discarding retweets, without pre-identifiable entities, as well similar and redundant tweets, while retaining information content.
Ivan Habernal, Fatemehsadat Mireshghallah, Patricia Thaine, Sepideh Ghanavati, Oluwaseyi Feyisetan. Proceedings of the 17th Conference European Chapter Association for Computational Linguistics: Tutorial Abstracts. 2023.
The existing algorithms for identification of neurons responsible undesired and harmful behaviors do not consider the effects confounders such as topic conversation. In this work, we show that can create spurious correlations propose a new causal mediation approach controls impact topic. experiments with two large language models, study localization hypothesis adjusting effect conversation topic, toxicity becomes less localized.
We investigate the use of in-context learning and prompt engineering to estimate contributions training data in outputs instruction-tuned large language models (LLMs). propose two novel approaches: (1) a similarity-based approach that measures difference between LLM with without provided context, (2) mixture distribution model frames problem identifying contribution scores as matrix factorization task. Our empirical comparison demonstrates is more robust retrieval noise learning, providing...
Deep Neural Networks, despite their great success in diverse domains, are provably sensitive to small perturbations on correctly classified examples and lead erroneous predictions. Recently, it was proposed that this behavior can be combatted by optimizing the worst case loss function over all possible substitutions of training examples. However, prone weighing unlikely higher, limiting accuracy gain. In paper, we study adversarial robustness through randomized perturbations, which has two...