Joshua Snoke

ORCID: 0000-0003-0906-4396
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Privacy-Preserving Technologies in Data
  • Privacy, Security, and Data Protection
  • Internet Traffic Analysis and Secure E-voting
  • Cryptography and Data Security
  • Advanced Causal Inference Techniques
  • Healthcare Policy and Management
  • Probability and Risk Models
  • COVID-19 and Mental Health
  • Survey Methodology and Nonresponse
  • Topic Modeling
  • Mobile Crowdsensing and Crowdsourcing
  • Credit Risk and Financial Regulations
  • Public Policy and Administration Research
  • Mental Health Treatment and Access
  • Dispute Resolution and Class Actions
  • Medication Adherence and Compliance
  • Mental Health and Patient Involvement
  • Digital Platforms and Economics
  • Census and Population Estimation
  • Auditing, Earnings Management, Governance
  • Network Security and Intrusion Detection
  • Patient Dignity and Privacy
  • Hate Speech and Cyberbullying Detection
  • Security in Wireless Sensor Networks
  • Insurance and Financial Risk Management

RAND Corporation
2019-2025

Urban Institute
2021

Innovations for Poverty Action
2020

New York University
2020

Pennsylvania State University
2016-2018

Park University
2018

University of Oklahoma Health Sciences Center
2018

Summary Data holders can produce synthetic versions of data sets when concerns about potential disclosure restrict the availability original records. The paper is concerned with methods to judge whether such have a distribution that comparable data: what we term general utility. We consider how utility compares specific utility: similarity results analyses from and data. adapt previous measure utility, propensity score mean-squared error pMSE, case derive its for correct synthesis model used...

10.1111/rssa.12358 article EN cc-by Journal of the Royal Statistical Society Series A (Statistics in Society) 2018-03-07

Abstract We present methodology for creating synthetic data and an application to create a publicly releasable version of the Longitudinal Aging Study in India (LASI). The LASI, health retirement survey, is used research educational purposes, but it can only be shared under restricted access due privacy considerations. novel methods synthesize maintaining three nested levels observation—individuals, couples, households—with both continuous categorical variables survey weights. show that...

10.1093/jssam/smae047 article EN cc-by-nc-nd Journal of Survey Statistics and Methodology 2025-01-09

Differentially private synthetic data generation offers a recent solution to release analytically useful while preserving the privacy of individuals in data. In order utilize these algorithms for public policy decisions, policymakers need an accurate understanding algorithms' comparative performance. Correspondingly, practitioners also require standard metrics evaluating analytic qualities this paper, we present in-depth evaluation several differentially using actual sets created by...

10.29012/jpc.748 article EN cc-by-nc-nd Journal of Privacy and Confidentiality 2021-02-03

Federal administrative data, such as tax are invaluable for research, but because of privacy concerns, access to these data is typically limited select agencies and a few individuals. An alternative sharing microlevel allow individuals query statistics without directly accessing the confidential data. This article studies feasibility using differentially private (DP) methods make certain queries while preserving privacy. We also include new methodological adaptations existing DP regression...

10.1080/01621459.2023.2270795 article EN Journal of the American Statistical Association 2023-10-17

Policymakers often rely on official statistics and administrative data to make essential public policy decisions, such as using tax broaden our understanding of individuals' firms' responses economic incentives through quantitative research. However, direct access federal confidential is limited a select few researchers. Recently, privacy researchers policymakers proposed formally private validation servers provide another tier access, but little known about the expectations needs users for...

10.1162/99608f92.a8fb0371 article EN cc-by 2024-08-21

The authors discuss their experience applying differential privacy with a complex data set the goal of enabling standard approaches to statistical analysis. They highlight lessons learned and roadblocks encountered, distilling them into incompatibilities between current practices in analysis that go beyond issues which can be solved noisy measurements file. how overcoming these require compromise change either our approach or should addressed head-on.

10.29012/jpc.872 article EN cc-by-nc-nd Journal of Privacy and Confidentiality 2024-08-27

This paper focuses on the privacy paradigm of providing access to researchers remotely carry out analyses sensitive data stored behind separate firewalls. We address situation where analysis demands from multiple physically databases which cannot be combined. Motivating this work is a real model based research kinship foster placement that came sources and could only combined through lengthy process with trusted network. develop demonstrate method for accurate calculation multivariate normal...

10.1214/18-aoas1171 article EN The Annals of Applied Statistics 2018-06-01

Suppose you had a data set that contained records of individuals, including demographics such as their age, sex, and race. also these additional in-depth personal inform...

10.1080/09332480.2020.1847947 article EN CHANCE 2020-10-01

We present a method for generating synthetic versions of Twitter data using neural generative models. The goal is protecting individuals in the source from stylometric re-identification attacks while still releasing that carries research value. Specifically, we generate tweet corpora maintain user-level word distributions by augmenting language models with user-specific components. compare our approach to two standard text protection methods: redaction and iterative translation. evaluate...

10.48550/arxiv.1606.01151 preprint EN other-oa arXiv (Cornell University) 2016-01-01

Data holders can produce synthetic versions of datasets when concerns about potential disclosure restrict the availability original records. This paper is concerned with methods to judge whether such data have a distribution that comparable data, what we will term general utility. We consider how utility compares specific utility, similarity results analyses from and data. adapt previous measure propensity score mean-squared-error (pMSE), case derive its for correct synthesis model used...

10.48550/arxiv.1604.06651 preprint EN other-oa arXiv (Cornell University) 2016-01-01

Health care decisions are increasingly informed by clinical decision support algorithms, but these algorithms may perpetuate or increase racial and ethnic disparities in access to quality of health care. Further complicating the problem, data often have missing poor information, which can lead misleading assessments algorithmic bias. We present novel statistical methods that allow for use probabilities racial/ethnic group membership algorithm performance quantify bias results from error...

10.48550/arxiv.2402.13391 preprint EN arXiv (Cornell University) 2024-02-20

Census data are vital to health care research but must also protect respondents' confidentiality. The 2020 decennial employs a new Differential Privacy framework; this study examines its effect on the accuracy of an important tool for measuring disparities, Bayesian Improved Surname and Geocoding (BISG) algorithm, which uses Block Group estimate race ethnicity when self-reported unavailable. Using as our standard, we compared BISG estimates calculated using original 2010 counts with in...

10.1177/10775587241251870 article EN Medical Care Research and Review 2024-05-14

Accessing data collected by federal statistical agencies is essential for public policy research and improving evidence-based decision making, such as evaluating the effectiveness of social programs, understanding demographic shifts, or addressing health challenges. Differentially private interactive systems, validation servers, can form a crucial part data-sharing infrastructure. They may allow researchers to query targeted statistics, providing flexible, efficient access specific insights,...

10.48550/arxiv.2412.11794 preprint EN arXiv (Cornell University) 2024-12-16

Health care decisions are increasingly informed by clinical decision support algorithms, but these algorithms may perpetuate or increase racial and ethnic disparities in access to quality of health care. Further complicating the problem, data often have missing poor information, which can lead misleading assessments algorithmic bias. We present novel statistical methods that allow for use probabilities racial/ethnic group membership algorithm performance quantify bias results from error...

10.1093/biomtc/ujae155 article EN Biometrics 2024-10-03
Coming Soon ...