NFDI4DS | UHH-SEMS - Publication Details

Jon McAuliffe

ORCID: 0000-0003-2626-7320

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5010768151

Research Areas

Gaussian Processes and Bayesian Inference
Statistical Methods and Inference
Bayesian Methods and Mixture Models
Genomics and Phylogenetic Studies
RNA and protein synthesis mechanisms
Galaxies: Formation, Evolution, Phenomena
Random Matrices and Applications
Algorithms and Data Compression
Stochastic Gradient Optimization Techniques
Genetic diversity and population structure
Machine Learning in Bioinformatics
RNA modifications and cancer
Limits and Structures in Graph Theory
Machine Learning and Algorithms
Chromosomal and Genetic Variations
Markov Chains and Monte Carlo Methods
Parallel Computing and Optimization Techniques
Sparse and Compressive Sensing Techniques
Domain Adaptation and Few-Shot Learning
Face and Expression Recognition
Cancer-related molecular mechanisms research
Evolution and Genetic Dynamics
Blind Source Separation Techniques
Computational and Text Analysis Methods
Advanced Statistical Methods and Models

University of California, Berkeley
2006-2021

Carnegie Mellon University
2020

Lawrence Berkeley National Laboratory
2019

Columbia University
2019

Princeton University
2019

Worcester Polytechnic Institute
2014

Al Akhawayn University
2014

University of Pennsylvania
2006-2010

Massachusetts Institute of Technology
2010

Biotechnology Institute
2005

Convexity, Classification, and Risk Bounds

OPENALEX - Publications

Peter L. Bartlett Michael I. Jordan Jon McAuliffe

Many of the classification algorithms developed in machine learning literature, including support vector and boosting, can be viewed as minimum contrast methods that minimize a convex surrogate 0–1 loss function. The convexity makes these computationally efficient. use surrogate, however, has statistical consequences must balanced against computational virtues convexity. To study issues, we provide general quantitative relationship between risk assessed using any nonnegative We show this...

10.1198/016214505000000907 article EN Journal of the American Statistical Association 2006-02-15

Phylogenetic Shadowing of Primate Sequences to Find Functional Regions of the Human Genome

OPENALEX - Publications

Dario Boffelli Jon McAuliffe Dmitriy Ovcharenko Keith D. Lewis Ivan Ovcharenko and 2 more

Nonhuman primates represent the most relevant model organisms to understand biology of Homo sapiens. The recent divergence and associated overall sequence conservation between individual members this taxon have nonetheless largely precluded use in comparative studies. We used comparisons an extensive set Old World New monkeys hominoids identify functional regions human genome. Analysis these data enabled discovery primate-specific gene regulatory elements demarcation exons multiple genes....

10.1126/science.1081331 article EN Science 2003-02-27

Candidate Cell and Matrix Interaction Domains on the Collagen Fibril, the Predominant Protein of Vertebrates

OPENALEX - Publications

Shawn M. Sweeney Joseph Orgel Andrzej Fertala Jon McAuliffe Kevin Turner and 10 more

Type I collagen, the predominant protein of vertebrates, polymerizes with type III and V collagens non-collagenous molecules into large cable-like fibrils, yet how fibril interacts cells other binding partners remains poorly understood. To help reveal insights collagen structure-function relationship, a data base was assembled including hundreds ligand sites mutations on two-dimensional model fibril. Visual examination distribution functional sites, statistical analysis mutation...

10.1074/jbc.m709319200 article EN cc-by Journal of Biological Chemistry 2008-05-17

Variational Inference for Large-Scale Models of Discrete Choice

OPENALEX - Publications

Michael Braun Jon McAuliffe

Discrete choice models are commonly used by applied statisticians in numerous fields, such as marketing, economics, finance, and operations research. When agents discrete assumed to have differing preferences, exact inference is often intractable. Markov chain Monte Carlo techniques make approximate possible, but the computational cost prohibitive on large data sets now becoming routinely available. Variational methods provide a deterministic alternative for approximation of posterior...

10.1198/jasa.2009.tm08030 article EN Journal of the American Statistical Association 2010-03-01

Supervised Topic Models

OPENALEX - Publications

David M. Blei Jon McAuliffe

We introduce supervised latent Dirichlet allocation (sLDA), a statistical model of labelled documents. The accommodates variety response types. derive an approximate maximum-likelihood procedure for parameter estimation, which relies on variational methods to handle intractable posterior expectations. Prediction problems motivate this research: we use the fitted predict values new test sLDA two real-world problems: movie ratings predicted from reviews, and political tone amendments in U.S....

10.48550/arxiv.1003.0783 preprint EN other-oa arXiv (Cornell University) 2010-01-01

Time-uniform, nonparametric, nonasymptotic confidence sequences

OPENALEX - Publications

Steven R. Howard Aaditya Ramdas Jon McAuliffe Jasjeet S. Sekhon

A confidence sequence is a of intervals that uniformly valid over an unbounded time horizon. Our work develops sequences whose widths go to zero, with nonasymptotic coverage guarantees under nonparametric conditions. We draw connections between the Cram\'er-Chernoff method for exponential concentration, law iterated logarithm (LIL), and sequential probability ratio test -- our are time-uniform extensions first; provide tight, characterizations second; generalize third settings, including...

10.1214/20-aos1991 article EN The Annals of Statistics 2021-04-01

Toward a protein profile of Escherichia coli : Comparison to its transcription profile

OPENALEX - Publications

Rebecca W. Corbin Oleg Paliy Feng Yang Jeffrey Shabanowitz Mark Platt and 7 more

High-pressure liquid chromatography–tandem mass spectrometry was used to obtain a protein profile of Escherichia coli strain MG1655 grown in minimal medium with glycerol as the carbon source. By using cell lysate from only 3 × 10 8 cells, at least four different tryptic peptides were detected for each 404 proteins short 4-h experiment. At one peptide high reliability score 986 proteins. Because membrane underrepresented, second experiment performed preparation enriched membranes. An...

10.1073/pnas.1533294100 article EN Proceedings of the National Academy of Sciences 2003-07-23

Nonparametric empirical Bayes for the Dirichlet process mixture model

OPENALEX - Publications

Jon McAuliffe David M. Blei Michael I. Jordan

10.1007/s11222-006-5196-2 article EN Statistics and Computing 2006-01-01

Time-uniform Chernoff bounds via nonnegative supermartingales

OPENALEX - Publications

Steven R. Howard Aaditya Ramdas Jon McAuliffe Jasjeet S. Sekhon

We develop a class of exponential bounds for the probability that martingale sequence crosses time-dependent linear threshold. Our key insight is it both natural and fruitful to formulate concentration inequalities in this way. illustrate point by presenting single assumption theorem together unify strengthen many tail martingales, including classical (1960–80) Bernstein, Bennett, Hoeffding, Freedman; contemporary (1980–2000) Shorack Wellner, Pinelis, Blackwell, van de Geer, la Peña; several...

10.1214/18-ps321 article EN cc-by Probability Surveys 2020-01-01

Sulfur and Nitrogen Limitation in Escherichia coli K-12: Specific Homeostatic Responses

OPENALEX - Publications

Prasad Gyaneshwar Oleg Paliy Jon McAuliffe David L. Popham Michael I. Jordan and 1 more

We determined global transcriptional responses of Escherichia coli K-12 to sulfur (S)- or nitrogen (N)-limited growth in adapted batch cultures and subjected nutrient shifts. Using two limitations helped distinguish between nutrient-specific changes mRNA levels common related the rate. Both homeostatic slow were amplified upon This made detection these more reliable increased number genes that differentially expressed. analyzed microarray data several ways: by determining expression after...

10.1128/jb.187.3.1074-1090.2005 article EN Journal of Bacteriology 2005-01-19

Multiple-sequence functional annotation and the generalized hidden Markov phylogeny

OPENALEX - Publications

Jon McAuliffe Lior Pachter Michael I. Jordan

Abstract Motivation: Phylogenetic shadowing is a comparative genomics principle that allows for the discovery of conserved regions in sequences from multiple closely related organisms. We develop formal probabilistic framework combining phylogenetic with feature-based functional annotation methods. The resulting model, generalized hidden Markov phylogeny (GHMP), applies to variety situations where are be inferred evolutionary constraints. Results: show how GHMPs can used predict complete...

10.1093/bioinformatics/bth153 article EN Bioinformatics 2004-02-26

A comparison of the LBG algorithm and Kohonen neural network paradigm for image vector quantization

OPENALEX - Publications

Jon McAuliffe Les Atlas Carlos Rivera

The creation of an acceptable codebook, as defined by three methods measuring performance (peak signal-to-noise ratio, image quality, and entropy), is discussed how the Linde-Buzo-Gray (LBG) Kohonen neural network (KNN) differ detailed. results show that codebooks generated these two both enable low bits-per-pixel coding with distortion. When using fewer training vectors, when given a suboptimal initial KNN method outperformed LBG. For theoretical lower bound, mean square error comparisons...

10.1109/icassp.1990.116035 article EN International Conference on Acoustics, Speech, and Signal Processing 2002-12-04

Lessons from Escherichia coli genes similarly regulated in response to nitrogen and sulfur limitation

OPENALEX - Publications

Prasad Gyaneshwar Oleg Paliy Jon McAuliffe Adriane C. Jones Michael I. Jordan and 1 more

We previously characterized nutrient-specific transcriptional changes in Escherichia coli upon limitation of nitrogen (N) or sulfur (S). These global homeostatic responses presumably minimize the slowing growth under a particular condition. Here, we characterize to slow per se that are not nutrient-specific. The latter help coordinate growth, and case down-regulated genes, conserve scarce N S for other purposes. Three effects were particularly striking. First, although many genes control...

10.1073/pnas.0500141102 article EN Proceedings of the National Academy of Sciences 2005-02-16

Improving lung cancer risk stratification leveraging whole transcriptome RNA sequencing and machine learning across multiple cohorts

OPENALEX - Publications

Yoonha Choi Jianghan Qu Esther Wu Yangyang Hao Jiarui Zhang and 11 more

Abstract Background Bronchoscopy for suspected lung cancer has low diagnostic sensitivity, rendering many inconclusive results. The Bronchial Genomic Classifier (BGC) was developed to help with patient management by identifying those risk of when bronchoscopy is inconclusive. BGC trained and validated on patients in the Airway Epithelial Gene Expression Diagnosis Lung Cancer (AEGIS) trials. A modern cohort, Registry, showed differences key clinical factors from AEGIS cohorts, less smoking...

10.1186/s12920-020-00782-1 article EN cc-by BMC Medical Genomics 2020-10-01

Long- and Short-Term Selective Forces on Malaria Parasite Genomes

OPENALEX - Publications

Sanne Nygaard Alexander Braunstein Gareth Malsen Stijn van Dongen Paul P. Gardner and 7 more

Plasmodium parasites, the causal agents of malaria, result in more than 1 million deaths annually. are unicellular eukaryotes with small ∼23 Mb genomes encoding ∼5200 protein-coding genes. The genes comprise about half these genomes. Although evolutionary processes have a significant impact on malaria control, selective pressures within poorly understood, particularly non-protein-coding portion genome. We use methods to describe both coding and non-coding regions Based genome alignments...

10.1371/journal.pgen.1001099 article EN cc-by PLoS Genetics 2010-09-09

Cataloging the Visible Universe Through Bayesian Inference at Petascale

OPENALEX - Publications

Jeffrey Regier Jon McAuliffe R. C. Thomas Prabhat Kiran Pamnany and 7 more

Astronomical catalogs derived from wide-field imaging surveys are an important tool for understanding the Universe. We construct astronomical catalog 55 TB of data using Celeste, a Bayesian variational inference code written entirely in high-productivity programming language Julia. Using over 1.3 million threads on 650,000 Intel Xeon Phi cores Cori Phase II supercomputer, Celeste achieves peak rate 1.54 DP PFLOP/s. is able to jointly optimize parameters 188M stars and galaxies, loading...

10.1109/ipdps.2018.00015 article EN 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2018-05-01

GLAD: a mixed-membership model for heterogeneous tumor subtype classification

OPENALEX - Publications

Hachem Saddiki Jon McAuliffe Patrick Flaherty

Abstract Motivation: Genomic analyses of many solid cancers have demonstrated extensive genetic heterogeneity between as well within individual tumors. However, statistical methods for classifying tumors by subtype based on genomic biomarkers generally entail an all-or-none decision, which may be misleading clinical samples containing a mixture subtypes and/or normal cell contamination. Results: We developed mixed-membership classification model, called glad , that simultaneously learns...

10.1093/bioinformatics/btu618 article EN Bioinformatics 2014-09-29

Cataloging the visible universe through Bayesian inference in Julia at petascale

OPENALEX - Publications

Jeffrey Regier Keno Fischer Kiran Pamnany Andreas Noack Jarrett Revels and 7 more

10.1016/j.jpdc.2018.12.008 article EN publisher-specific-oa Journal of Parallel and Distributed Computing 2019-01-21

Subtree power analysis and species selection for comparative genomics

OPENALEX - Publications

Jon McAuliffe Michael I. Jordan Lior Pachter

Sequence comparison across multiple organisms aids in the detection of regions under selection. However, resource limitations require a prioritization genomes to be sequenced. This should grounded two considerations: lineal scope encompassing biological phenomena interest, and optimal species within that for detecting functional elements. We introduce statistical framework subset selection, based on maximizing power detect conserved sites. Analysis phylogenetic star topology shows...

10.1073/pnas.0502790102 article EN Proceedings of the National Academy of Sciences 2005-05-23

Coming Soon ...