Jon McAuliffe

ORCID: 0000-0003-2626-7320
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Gaussian Processes and Bayesian Inference
  • Statistical Methods and Inference
  • Bayesian Methods and Mixture Models
  • Genomics and Phylogenetic Studies
  • RNA and protein synthesis mechanisms
  • Galaxies: Formation, Evolution, Phenomena
  • Random Matrices and Applications
  • Algorithms and Data Compression
  • Stochastic Gradient Optimization Techniques
  • Genetic diversity and population structure
  • Machine Learning in Bioinformatics
  • RNA modifications and cancer
  • Limits and Structures in Graph Theory
  • Machine Learning and Algorithms
  • Chromosomal and Genetic Variations
  • Markov Chains and Monte Carlo Methods
  • Parallel Computing and Optimization Techniques
  • Sparse and Compressive Sensing Techniques
  • Domain Adaptation and Few-Shot Learning
  • Face and Expression Recognition
  • Cancer-related molecular mechanisms research
  • Evolution and Genetic Dynamics
  • Blind Source Separation Techniques
  • Computational and Text Analysis Methods
  • Advanced Statistical Methods and Models

University of California, Berkeley
2006-2021

Carnegie Mellon University
2020

Lawrence Berkeley National Laboratory
2019

Columbia University
2019

Princeton University
2019

Worcester Polytechnic Institute
2014

Al Akhawayn University
2014

University of Pennsylvania
2006-2010

Massachusetts Institute of Technology
2010

Biotechnology Institute
2005

Many of the classification algorithms developed in machine learning literature, including support vector and boosting, can be viewed as minimum contrast methods that minimize a convex surrogate 0–1 loss function. The convexity makes these computationally efficient. use surrogate, however, has statistical consequences must balanced against computational virtues convexity. To study issues, we provide general quantitative relationship between risk assessed using any nonnegative We show this...

10.1198/016214505000000907 article EN Journal of the American Statistical Association 2006-02-15

Nonhuman primates represent the most relevant model organisms to understand biology of Homo sapiens. The recent divergence and associated overall sequence conservation between individual members this taxon have nonetheless largely precluded use in comparative studies. We used comparisons an extensive set Old World New monkeys hominoids identify functional regions human genome. Analysis these data enabled discovery primate-specific gene regulatory elements demarcation exons multiple genes....

10.1126/science.1081331 article EN Science 2003-02-27

Type I collagen, the predominant protein of vertebrates, polymerizes with type III and V collagens non-collagenous molecules into large cable-like fibrils, yet how fibril interacts cells other binding partners remains poorly understood. To help reveal insights collagen structure-function relationship, a data base was assembled including hundreds ligand sites mutations on two-dimensional model fibril. Visual examination distribution functional sites, statistical analysis mutation...

10.1074/jbc.m709319200 article EN cc-by Journal of Biological Chemistry 2008-05-17

Discrete choice models are commonly used by applied statisticians in numerous fields, such as marketing, economics, finance, and operations research. When agents discrete assumed to have differing preferences, exact inference is often intractable. Markov chain Monte Carlo techniques make approximate possible, but the computational cost prohibitive on large data sets now becoming routinely available. Variational methods provide a deterministic alternative for approximation of posterior...

10.1198/jasa.2009.tm08030 article EN Journal of the American Statistical Association 2010-03-01

We introduce supervised latent Dirichlet allocation (sLDA), a statistical model of labelled documents. The accommodates variety response types. derive an approximate maximum-likelihood procedure for parameter estimation, which relies on variational methods to handle intractable posterior expectations. Prediction problems motivate this research: we use the fitted predict values new test sLDA two real-world problems: movie ratings predicted from reviews, and political tone amendments in U.S....

10.48550/arxiv.1003.0783 preprint EN other-oa arXiv (Cornell University) 2010-01-01

A confidence sequence is a of intervals that uniformly valid over an unbounded time horizon. Our work develops sequences whose widths go to zero, with nonasymptotic coverage guarantees under nonparametric conditions. We draw connections between the Cram\'er-Chernoff method for exponential concentration, law iterated logarithm (LIL), and sequential probability ratio test -- our are time-uniform extensions first; provide tight, characterizations second; generalize third settings, including...

10.1214/20-aos1991 article EN The Annals of Statistics 2021-04-01

High-pressure liquid chromatography–tandem mass spectrometry was used to obtain a protein profile of Escherichia coli strain MG1655 grown in minimal medium with glycerol as the carbon source. By using cell lysate from only 3 × 10 8 cells, at least four different tryptic peptides were detected for each 404 proteins short 4-h experiment. At one peptide high reliability score 986 proteins. Because membrane underrepresented, second experiment performed preparation enriched membranes. An...

10.1073/pnas.1533294100 article EN Proceedings of the National Academy of Sciences 2003-07-23

We develop a class of exponential bounds for the probability that martingale sequence crosses time-dependent linear threshold. Our key insight is it both natural and fruitful to formulate concentration inequalities in this way. illustrate point by presenting single assumption theorem together unify strengthen many tail martingales, including classical (1960–80) Bernstein, Bennett, Hoeffding, Freedman; contemporary (1980–2000) Shorack Wellner, Pinelis, Blackwell, van de Geer, la Peña; several...

10.1214/18-ps321 article EN cc-by Probability Surveys 2020-01-01

We determined global transcriptional responses of Escherichia coli K-12 to sulfur (S)- or nitrogen (N)-limited growth in adapted batch cultures and subjected nutrient shifts. Using two limitations helped distinguish between nutrient-specific changes mRNA levels common related the rate. Both homeostatic slow were amplified upon This made detection these more reliable increased number genes that differentially expressed. analyzed microarray data several ways: by determining expression after...

10.1128/jb.187.3.1074-1090.2005 article EN Journal of Bacteriology 2005-01-19

Abstract Motivation: Phylogenetic shadowing is a comparative genomics principle that allows for the discovery of conserved regions in sequences from multiple closely related organisms. We develop formal probabilistic framework combining phylogenetic with feature-based functional annotation methods. The resulting model, generalized hidden Markov phylogeny (GHMP), applies to variety situations where are be inferred evolutionary constraints. Results: show how GHMPs can used predict complete...

10.1093/bioinformatics/bth153 article EN Bioinformatics 2004-02-26

The creation of an acceptable codebook, as defined by three methods measuring performance (peak signal-to-noise ratio, image quality, and entropy), is discussed how the Linde-Buzo-Gray (LBG) Kohonen neural network (KNN) differ detailed. results show that codebooks generated these two both enable low bits-per-pixel coding with distortion. When using fewer training vectors, when given a suboptimal initial KNN method outperformed LBG. For theoretical lower bound, mean square error comparisons...

10.1109/icassp.1990.116035 article EN International Conference on Acoustics, Speech, and Signal Processing 2002-12-04

We previously characterized nutrient-specific transcriptional changes in Escherichia coli upon limitation of nitrogen (N) or sulfur (S). These global homeostatic responses presumably minimize the slowing growth under a particular condition. Here, we characterize to slow per se that are not nutrient-specific. The latter help coordinate growth, and case down-regulated genes, conserve scarce N S for other purposes. Three effects were particularly striking. First, although many genes control...

10.1073/pnas.0500141102 article EN Proceedings of the National Academy of Sciences 2005-02-16

Abstract Background Bronchoscopy for suspected lung cancer has low diagnostic sensitivity, rendering many inconclusive results. The Bronchial Genomic Classifier (BGC) was developed to help with patient management by identifying those risk of when bronchoscopy is inconclusive. BGC trained and validated on patients in the Airway Epithelial Gene Expression Diagnosis Lung Cancer (AEGIS) trials. A modern cohort, Registry, showed differences key clinical factors from AEGIS cohorts, less smoking...

10.1186/s12920-020-00782-1 article EN cc-by BMC Medical Genomics 2020-10-01

Plasmodium parasites, the causal agents of malaria, result in more than 1 million deaths annually. are unicellular eukaryotes with small ∼23 Mb genomes encoding ∼5200 protein-coding genes. The genes comprise about half these genomes. Although evolutionary processes have a significant impact on malaria control, selective pressures within poorly understood, particularly non-protein-coding portion genome. We use methods to describe both coding and non-coding regions Based genome alignments...

10.1371/journal.pgen.1001099 article EN cc-by PLoS Genetics 2010-09-09

Astronomical catalogs derived from wide-field imaging surveys are an important tool for understanding the Universe. We construct astronomical catalog 55 TB of data using Celeste, a Bayesian variational inference code written entirely in high-productivity programming language Julia. Using over 1.3 million threads on 650,000 Intel Xeon Phi cores Cori Phase II supercomputer, Celeste achieves peak rate 1.54 DP PFLOP/s. is able to jointly optimize parameters 188M stars and galaxies, loading...

10.1109/ipdps.2018.00015 article EN 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2018-05-01

Abstract Motivation: Genomic analyses of many solid cancers have demonstrated extensive genetic heterogeneity between as well within individual tumors. However, statistical methods for classifying tumors by subtype based on genomic biomarkers generally entail an all-or-none decision, which may be misleading clinical samples containing a mixture subtypes and/or normal cell contamination. Results: We developed mixed-membership classification model, called glad , that simultaneously learns...

10.1093/bioinformatics/btu618 article EN Bioinformatics 2014-09-29

Sequence comparison across multiple organisms aids in the detection of regions under selection. However, resource limitations require a prioritization genomes to be sequenced. This should grounded two considerations: lineal scope encompassing biological phenomena interest, and optimal species within that for detecting functional elements. We introduce statistical framework subset selection, based on maximizing power detect conserved sites. Analysis phylogenetic star topology shows...

10.1073/pnas.0502790102 article EN Proceedings of the National Academy of Sciences 2005-05-23
Coming Soon ...