- Electoral Systems and Political Participation
- Internet Traffic Analysis and Secure E-voting
- Advanced Causal Inference Techniques
- Scientific Computing and Data Management
- Statistics Education and Methodologies
- Communication in Education and Healthcare
- Health Systems, Economic Evaluations, Quality of Life
- Big Data and Business Intelligence
- Hate Speech and Cyberbullying Detection
- Statistical Methods and Inference
- Algorithms and Data Compression
- Advanced Statistical Methods and Models
- Evaluation of Teaching Practices
- Chaos-based Image/Signal Encryption
- Healthcare Policy and Management
- Game Theory and Voting Systems
- Media Influence and Politics
- Cellular Automata and Applications
- Advanced Steganography and Watermarking Techniques
- Cognitive and psychological constructs research
- Statistical Methods in Clinical Trials
- Optimal Experimental Design Methods
- Sports Analytics and Performance
- Innovations in Educational Methods
- Management and Marketing Education
University of California, Berkeley
2016-2020
QB3
2018
Contra Costa County Library
2018
Abstract Student evaluations of teaching (SET) are widely used in academic personnel decisions as a measure effectiveness. We show: SET biased against female instructors by an amount that is large and statistically significant. The bias affects how students rate even putatively objective aspects teaching, such promptly assignments graded. varies discipline student gender, among other things. It not possible to adjust for the bias, because it depends on so many factors. more sensitive...
What are the challenges and best practices for doing data-intensive research in teams, labs, other groups? This paper reports from a discussion which researchers many different disciplines departments shared their experiences on data science domains. The issues we discuss range technical to social, including with getting same computational stack, workflow pipeline management, handoffs, composing well-balanced team, dealing fluid membership, fostering coordination communication, not...
What actions can we take to foster diverse and inclusive workplaces in the broad fields around data science? This paper reports from a discussion which researchers many different disciplines departments raised questions shared their experiences with various aspects diversity, inclusion, equity. The issues discuss include fostering interpersonal small group dynamics, rules codes of conduct, increasing diversity less-representative groups disciplines, organizing events for long-term efforts...
Hypothesis tests based on linear models are widely accepted by organizations that regulate clinical trials. These derived using strong assumptions about the data-generating process so resulting inference can be parametric distributions. Because these methods well understood and robust, they sometimes applied to data depart from assumptions, such as ordinal integer scores. Permutation a nonparametric alternative require minimal which often guaranteed randomization was conducted. We compare...
The pseudo-random number generators (PRNGs), sampling algorithms, and algorithms for generating random integers in some common statistical packages programming languages are unnecessarily inaccurate, by an amount that may matter inference. Most use PRNGs with state spaces too small contemporary problems methods such as the bootstrap permutation tests. many rely on false assumption produce IID $U[0, 1)$ outputs. discreteness of PRNG outputs limited space cause those to perform poorly...
Abstract Randomized control trials (RCTs) are the gold standard for estimating causal effects, but often use samples that non-representative of actual population interest. We propose a reweighting method average treatment effects in settings with noncompliance. Simulations show proposed compliance-adjusted estimator outperforms its unadjusted counterpart when compliance is relatively low and can be predicted by observed covariates. apply to evaluate effect Medicaid coverage on health care...
Colorado conducted risk-limiting tabulation audits (RLAs) across the state in 2017, including both ballot-level comparison and ballot-polling audits. Those only covered contests restricted to a single county; methods efficiently audit that cross county boundaries combine ballot polling comparisons have not been available. Colorado's current software (RLATool) needs be improved these lines small efficiently. This paper addresses needs. It presents extremely simple but inefficient methods,...
There are many recommendations of "best practices" for those doing data science, data-intensive research, and research in general. These documents usually present a particular vision how people should work with computing, recommending specific tools, activities, mechanisms, sensibilities. However, implementation best (or better) practices any setting is often met resistance from individuals groups, who perceive some drawbacks to the proposed changes everyday practice. We offer definitions...
We present a method and software for ballot-polling risk-limiting audits (RLAs) based on Bernoulli sampling: ballots are included in the sample with probability $p$, independently. sampling has several advantages: (1) it does not require ballot manifest; (2) can be conducted independently at different locations, rather than requiring central authority to select from whole population of cast or stratified sampling; (3) start polling places election night, before margins known. If reported...
Risk-limiting audits (RLAs) offer a statistical guarantee: if full manual tally of the paper ballots would show that reported election outcome is wrong, an RLA has known minimum chance leading to tally. RLAs generally rely on random samples. Stratified sampling--partitioning population into disjoint strata and sampling independently from strata--may simplify logistics or increase efficiency compared simpler designs, but makes risk calculations harder. We present SUITE, new method for...
R (Version 3.5.1 patched) has an issue with its random sampling functionality. generates integers between $1$ and $m$ by multiplying floats $m$, taking the floor, adding to result. Well-known quantization effects in this approach result a non-uniform distribution on $\{ 1, \ldots, m\}$. The difference, which depends can be substantial. Because sample function relies generating integers, is biased. There easy fix: construct directly from bits, rather than float $m$. That strategy taken...