- Statistical Methods and Bayesian Inference
- Privacy-Preserving Technologies in Data
- Statistical Methods and Inference
- Advanced Causal Inference Techniques
- Firm Innovation and Growth
- Survey Methodology and Nonresponse
- Bayesian Methods and Mixture Models
- Manufacturing Process and Optimization
- Census and Population Estimation
- Advanced Statistical Process Monitoring
- Data Quality and Management
- Global trade and economics
- Data-Driven Disease Surveillance
- Bayesian Modeling and Causal Inference
- Scheduling and Optimization Algorithms
- Industrial Vision Systems and Defect Detection
- Cryptography and Data Security
- Healthcare Policy and Management
- Privacy, Security, and Data Protection
- Advanced Statistical Methods and Models
- Data Analysis with R
- Survey Sampling and Estimation Techniques
- Computational Physics and Python Applications
- Scientific Computing and Data Management
- demographic modeling and climate adaptation
Duke University
2015-2024
United States Census Bureau
2014-2023
Statistical and Applied Mathematical Sciences Institute
2018-2023
Office of the National Coordinator for Health Information Technology
2018
Emory University
2018
Social Science Research Council
2017-2018
National Bureau of Economic Research
2011-2016
University of Minnesota
2011-2016
Colorado State University
2016
University of South Carolina
2012
Multiple imputation is particularly well suited to deal with missing data in large epidemiologic studies, because typically these studies support a wide range of analyses by many users. Some may involve complex modeling, including interactions and nonlinear relations. Identifying such relations encoding them models, for example, the conditional regressions multiple via chained equations, can be daunting tasks numbers categorical continuous variables. The authors present nonparametric...
In many observational studies, analysts estimate treatment effects using propensity scores, e.g. by matching or sub-classifying on the scores. When some values of covariates are missing, can use multiple imputation to fill in missing data, scores based m completed datasets, and effects. We compare two approaches implement this process. first, analyst estimates effect score within each data set, averages estimates. second approach, for record across performs with these averaged effect....
AbstractMultiple imputation was first conceived as a tool that statistical agencies could use to handle nonresponse in large-sample public surveys. In the last two decades, multiple-imputation framework has been adapted for other contexts. For example, individual researchers multiple missing data small samples, disseminate multiply-imputed sets purposes of protecting confidentiality, and survey methodologists epidemiologists correct measurement errors. some these settings, Rubin's original...
Summary The paper presents an illustration and empirical study of releasing multiply imputed, fully synthetic public use microdata. Simulations based on data from the US Current Population Survey are used to evaluate potential validity inferences for a variety descriptive analytic estimands, assess degree protection confidentiality that is afforded by illustrate specification imputation models. Benefits limitations sets discussed.
When releasing data to the public, statistical agencies and survey organizations typically alter values in order protect confidentiality of respondents' identities attribute values. To select among wide variety alteration methods, require tools for evaluating utility proposed releases. Such measures can be combined with disclosure risk gauge risk-utility tradeoffs competing methods. This article presents focused on differences inferences obtained from altered corresponding original data....
"Data Quality and Record Linkage Techniques." Journal of the American Statistical Association, 103(482), p. 881
Abstract. Regularly occurring flood events do have a history in Santiago de Chile, the capital city of Chile and study area for this research. The analysis events, resulting damage its causes are crucial prerequisites development risk prevention measures. goal research is to empirically investigate vulnerability towards floods as one component risk. assessment based on application multi-scale (individual, household, municipal level) set indicators use broad range data. case-specific...
Objective: To focus on the relationship between pregnancy-related anxiety and spontaneous preterm birth. Psychosocial factors have been subject of inquiries about etiology birth; a factor recent interest is maternal prenatal (worries concerns related to pregnancy). Methods: From 1991 1993, total 1820 women completed study questionnaire during their first visit clinics in Baltimore, Maryland. Pregnancy-related was assessed using six questions from Prenatal Social Environment Inventory; scores...
Objectives: Depressive symptoms are common among women, especially those who of childbearing age or pregnant. Prior studies have suggested that an increased burden depressive is associated with diminished health and functional status, but these were primarily middle-aged older adults. In the current study, we investigated relationship between status pregnant women. Methods: Women enrolled in study at their first prenatal visit to hospital-based clinics administered interview contained Center...
When releasing microdata to the public, data disseminators typically alter original protect confidentiality of database subjects' identities and sensitive attributes. However, such alteration negatively impacts utility (quality) released data. In this paper, we present quantitative measures for masked microdata, with aim improving disseminators' evaluations competing masking strategies. The measures, which are global in that they reflect similarities between entire distributions data,...
Dans la plupart des pays, les instituts nationaux de statistique ne publient pas micro-données relatives aux entreprises. Les publier présente en effet un risque trop élevé rupture confidentialité. Ce peut être évité par recours à données synthétiques---des simulées partir modèles statistiques reproduisant loi véritables micro-données. cet article, nous décrivons une application cette stratégie création d'une telle base résultats du recensement économique annuel entreprises américaines....
In many surveys, the data comprise a large number of categorical variables that suffer from item nonresponse. Standard methods for multiple imputation, like log-linear models or sequential regression can fail to capture complex dependencies and be difficult implement effectively in high dimensions. We present fully Bayesian, joint modeling approach imputation based on Dirichlet process mixtures multinomial distributions. The automatically while being computationally expedient. prior...
In causal studies without random assignment of treatment, effects can be estimated using matched treated and control samples, where matches are obtained propensity scores. Propensity score matching reduce bias in treatment effect estimators cases the samples have overlapping covariate distributions. Despite its application many applied problems, there is no universally employed approach to interval estimation when matching. this article, we present evaluate approaches
Reluctance of data owners to share their possibly confidential or proprietary with others who own related databases is a serious impediment conducting mutually beneficial mining analysis. We address the case vertically partitioned -- multiple owners/agencies each possess few attributes every record. focus on agencies wanting conduct linear regression analysis complete records without disclosing values attributes. This paper describes an algorithm that enables such compute exact coefficients...
This article presents several methods for performing linear regression on the union of distributed databases that preserve, to varying degrees, confidentiality those databases. Such can be used by federal or state statistical agencies share information from their individual databases, make such available others. Secure data integration, which provides lowest level protection, actually integrates but in a manner no database owner determine origin any records other than its own. Regression,...
When statistical agencies release microdata to the public, malicious users (intruders) may be able link records in released data external databases. Releasing ways that fail prevent such identifications discredit agency or, for some data, constitute a breach of law. To limit disclosures, often altered versions data; however, there usually remain risks identification. This article applies and extends framework developed by Duncan Lambert computing probabilities identification sampled units....
Abstract This article is aimed at practitioners who plan to use Bayesian inference on multiply-imputed datasets in settings where posterior distributions of the parameters interest are not approximately Gaussian. We seek steer away from a naive approach inference, namely estimating distribution each completed dataset and averaging functionals these distributions. demonstrate that this results unreliable inferences. A better mix draws dataset, mixed summarize distribution. Using simulations,...
To limit disclosures, statistical agencies and other data disseminators can release partially synthetic, public use microdata sets. These comprise the units originally surveyed; but some collected values, for example, sensitive values at high risk of disclosure or key identifiers, are replaced with multiple draws from models. Because original records on file, there remain risks identifications. In this paper, we describe how to evaluate identification in synthetic data, accounting released...
Several national statistical agencies are now releasing partially synthetic, public use microdata. These comprise the units in original database with sensitive or identifying values replaced simulated from models. Specifying synthesis models can be daunting databases that includemany variables of diverse types. variablesmay related inways difficult to capture standard parametric tools. In this article, we describe how random forests adapted generate synthetic data for categorical variables....