- Gaussian Processes and Bayesian Inference
- Statistical Methods and Inference
- Bayesian Methods and Mixture Models
- Genomics and Phylogenetic Studies
- RNA and protein synthesis mechanisms
- Galaxies: Formation, Evolution, Phenomena
- Random Matrices and Applications
- Algorithms and Data Compression
- Stochastic Gradient Optimization Techniques
- Genetic diversity and population structure
- Machine Learning in Bioinformatics
- RNA modifications and cancer
- Limits and Structures in Graph Theory
- Machine Learning and Algorithms
- Chromosomal and Genetic Variations
- Markov Chains and Monte Carlo Methods
- Parallel Computing and Optimization Techniques
- Sparse and Compressive Sensing Techniques
- Domain Adaptation and Few-Shot Learning
- Face and Expression Recognition
- Cancer-related molecular mechanisms research
- Evolution and Genetic Dynamics
- Blind Source Separation Techniques
- Computational and Text Analysis Methods
- Advanced Statistical Methods and Models
University of California, Berkeley
2006-2021
Carnegie Mellon University
2020
Lawrence Berkeley National Laboratory
2019
Columbia University
2019
Princeton University
2019
Worcester Polytechnic Institute
2014
Al Akhawayn University
2014
University of Pennsylvania
2006-2010
Massachusetts Institute of Technology
2010
Biotechnology Institute
2005
Many of the classification algorithms developed in machine learning literature, including support vector and boosting, can be viewed as minimum contrast methods that minimize a convex surrogate 0–1 loss function. The convexity makes these computationally efficient. use surrogate, however, has statistical consequences must balanced against computational virtues convexity. To study issues, we provide general quantitative relationship between risk assessed using any nonnegative We show this...
Nonhuman primates represent the most relevant model organisms to understand biology of Homo sapiens. The recent divergence and associated overall sequence conservation between individual members this taxon have nonetheless largely precluded use in comparative studies. We used comparisons an extensive set Old World New monkeys hominoids identify functional regions human genome. Analysis these data enabled discovery primate-specific gene regulatory elements demarcation exons multiple genes....
Type I collagen, the predominant protein of vertebrates, polymerizes with type III and V collagens non-collagenous molecules into large cable-like fibrils, yet how fibril interacts cells other binding partners remains poorly understood. To help reveal insights collagen structure-function relationship, a data base was assembled including hundreds ligand sites mutations on two-dimensional model fibril. Visual examination distribution functional sites, statistical analysis mutation...
Discrete choice models are commonly used by applied statisticians in numerous fields, such as marketing, economics, finance, and operations research. When agents discrete assumed to have differing preferences, exact inference is often intractable. Markov chain Monte Carlo techniques make approximate possible, but the computational cost prohibitive on large data sets now becoming routinely available. Variational methods provide a deterministic alternative for approximation of posterior...
We introduce supervised latent Dirichlet allocation (sLDA), a statistical model of labelled documents. The accommodates variety response types. derive an approximate maximum-likelihood procedure for parameter estimation, which relies on variational methods to handle intractable posterior expectations. Prediction problems motivate this research: we use the fitted predict values new test sLDA two real-world problems: movie ratings predicted from reviews, and political tone amendments in U.S....
A confidence sequence is a of intervals that uniformly valid over an unbounded time horizon. Our work develops sequences whose widths go to zero, with nonasymptotic coverage guarantees under nonparametric conditions. We draw connections between the Cram\'er-Chernoff method for exponential concentration, law iterated logarithm (LIL), and sequential probability ratio test -- our are time-uniform extensions first; provide tight, characterizations second; generalize third settings, including...
High-pressure liquid chromatography–tandem mass spectrometry was used to obtain a protein profile of Escherichia coli strain MG1655 grown in minimal medium with glycerol as the carbon source. By using cell lysate from only 3 × 10 8 cells, at least four different tryptic peptides were detected for each 404 proteins short 4-h experiment. At one peptide high reliability score 986 proteins. Because membrane underrepresented, second experiment performed preparation enriched membranes. An...
We develop a class of exponential bounds for the probability that martingale sequence crosses time-dependent linear threshold. Our key insight is it both natural and fruitful to formulate concentration inequalities in this way. illustrate point by presenting single assumption theorem together unify strengthen many tail martingales, including classical (1960–80) Bernstein, Bennett, Hoeffding, Freedman; contemporary (1980–2000) Shorack Wellner, Pinelis, Blackwell, van de Geer, la Peña; several...
We determined global transcriptional responses of Escherichia coli K-12 to sulfur (S)- or nitrogen (N)-limited growth in adapted batch cultures and subjected nutrient shifts. Using two limitations helped distinguish between nutrient-specific changes mRNA levels common related the rate. Both homeostatic slow were amplified upon This made detection these more reliable increased number genes that differentially expressed. analyzed microarray data several ways: by determining expression after...
Abstract Motivation: Phylogenetic shadowing is a comparative genomics principle that allows for the discovery of conserved regions in sequences from multiple closely related organisms. We develop formal probabilistic framework combining phylogenetic with feature-based functional annotation methods. The resulting model, generalized hidden Markov phylogeny (GHMP), applies to variety situations where are be inferred evolutionary constraints. Results: show how GHMPs can used predict complete...
The creation of an acceptable codebook, as defined by three methods measuring performance (peak signal-to-noise ratio, image quality, and entropy), is discussed how the Linde-Buzo-Gray (LBG) Kohonen neural network (KNN) differ detailed. results show that codebooks generated these two both enable low bits-per-pixel coding with distortion. When using fewer training vectors, when given a suboptimal initial KNN method outperformed LBG. For theoretical lower bound, mean square error comparisons...
We previously characterized nutrient-specific transcriptional changes in Escherichia coli upon limitation of nitrogen (N) or sulfur (S). These global homeostatic responses presumably minimize the slowing growth under a particular condition. Here, we characterize to slow per se that are not nutrient-specific. The latter help coordinate growth, and case down-regulated genes, conserve scarce N S for other purposes. Three effects were particularly striking. First, although many genes control...
Abstract Background Bronchoscopy for suspected lung cancer has low diagnostic sensitivity, rendering many inconclusive results. The Bronchial Genomic Classifier (BGC) was developed to help with patient management by identifying those risk of when bronchoscopy is inconclusive. BGC trained and validated on patients in the Airway Epithelial Gene Expression Diagnosis Lung Cancer (AEGIS) trials. A modern cohort, Registry, showed differences key clinical factors from AEGIS cohorts, less smoking...
Plasmodium parasites, the causal agents of malaria, result in more than 1 million deaths annually. are unicellular eukaryotes with small ∼23 Mb genomes encoding ∼5200 protein-coding genes. The genes comprise about half these genomes. Although evolutionary processes have a significant impact on malaria control, selective pressures within poorly understood, particularly non-protein-coding portion genome. We use methods to describe both coding and non-coding regions Based genome alignments...
Astronomical catalogs derived from wide-field imaging surveys are an important tool for understanding the Universe. We construct astronomical catalog 55 TB of data using Celeste, a Bayesian variational inference code written entirely in high-productivity programming language Julia. Using over 1.3 million threads on 650,000 Intel Xeon Phi cores Cori Phase II supercomputer, Celeste achieves peak rate 1.54 DP PFLOP/s. is able to jointly optimize parameters 188M stars and galaxies, loading...
Abstract Motivation: Genomic analyses of many solid cancers have demonstrated extensive genetic heterogeneity between as well within individual tumors. However, statistical methods for classifying tumors by subtype based on genomic biomarkers generally entail an all-or-none decision, which may be misleading clinical samples containing a mixture subtypes and/or normal cell contamination. Results: We developed mixed-membership classification model, called glad , that simultaneously learns...
Sequence comparison across multiple organisms aids in the detection of regions under selection. However, resource limitations require a prioritization genomes to be sequenced. This should grounded two considerations: lineal scope encompassing biological phenomena interest, and optimal species within that for detecting functional elements. We introduce statistical framework subset selection, based on maximizing power detect conserved sites. Analysis phylogenetic star topology shows...