- Statistical Methods and Inference
- Natural Language Processing Techniques
- Bayesian Modeling and Causal Inference
- Topic Modeling
- Bayesian Methods and Mixture Models
- Neural Networks and Applications
- Machine Learning and Algorithms
- Sparse and Compressive Sensing Techniques
- Hemoglobinopathies and Related Disorders
- Algorithms and Data Compression
- Advanced Statistical Methods and Models
- Face and Expression Recognition
- Text and Document Classification Technologies
- Information Retrieval and Search Behavior
- Machine Learning and Data Classification
- Distributed Sensor Networks and Detection Algorithms
- Iron Metabolism and Disorders
- Statistical Methods and Bayesian Inference
- Stochastic Gradient Optimization Techniques
- Gene expression and cancer classification
- Blood groups and transfusion
- Advanced Text Analysis Techniques
- Speech Recognition and Synthesis
- Markov Chains and Monte Carlo Methods
- Gaussian Processes and Bayesian Inference
Yale University
2015-2024
Carnegie Mellon University
2007-2020
Johns Hopkins University
2012-2020
University of Chicago
2012-2017
New Mexico Institute of Mining and Technology
2016
Amazon (United States)
2016
University of Pennsylvania
2016
University of Illinois Chicago
2015
Princeton University
2007-2014
Stanford University
2014
A family of probabilistic time series models is developed to analyze the evolution topics in large document collections. The approach use state space on natural parameters multinomial distributions that represent topics. Variational approximations based Kalman filters and nonparametric wavelet regression are carry out approximate posterior inference over latent In addition giving quantitative, predictive a sequential corpus, dynamic topic provide qualitative window into contents collection....
In this paper, we present a statistical approach to machine translation. We describe the application of our translation from French English and give preliminary results.
Language modeling approaches to information retrieval are attractive and promising because they connect the problem of with that language model estimation, which has been studied extensively in other application areas such as speech recognition. The basic idea these is estimate a for each document, then rank documents by likelihood query according estimated model. A core estimation smoothing, adjusts maximum estimator so correct inaccuracy due data sparseness. In this paper, we study...
Language modeling approaches to information retrieval are attractive and promising because they connect the problem of with that language model estimation, which has been studied extensively in other application areas such as speech recognition. The basic idea these is estimate a for each document, then rank documents by likelihood query according estimated model. A central issue estimation smoothing , adjusting maximum estimator compensate data sparseness. In this article, we study its...
Topic models, such as latent Dirichlet allocation (LDA), can be useful tools for the statistical analysis of document collections and other discrete data. The LDA model assumes that words each arise from a mixture topics, which is distribution over vocabulary. A limitation inability to topic correlation even though, example, about genetics more likely also disease than X-ray astronomy. This stems use variability among proportions. In this paper we develop correlated (CTM), where proportions...
We present a technique for constructing random fields from set of training samples. The learning paradigm builds increasingly complex by allowing potential functions, or features, that are supported large subgraphs. Each feature has weight is trained minimizing the Kullback-Leibler divergence between model and empirical distribution data. A greedy algorithm determines how features incrementally added to field an iterative scaling used estimate optimal values weights. models techniques...
We consider the problem of estimating graph associated with a binary Ising Markov random field. describe method based on ℓ1-regularized logistic regression, in which neighborhood any given node is estimated by performing regression subject to an ℓ1-constraint. The analyzed under high-dimensional scaling both number nodes p and maximum size d are allowed grow as function observations n. Our main results provide sufficient conditions triple (n, p, d) model parameters for succeed consistently...
The language modeling approach to retrieval has been shown perform well empirically. One advantage of this new is its statistical foundations. However, feedback, as one important component in a system, only dealt with heuristically approach: the original query usually literally expanded by adding additional terms it. Such expansion-based feedback creates an inconsistent interpretation and query. In paper, we present more principled approach. Specifically, treat updating model based on extra...
We present a framework for information retrieval that combines document models and query using probabilistic ranking function based on Bayesian decision theory. The suggests an operational model extends recent developments in the language modeling approach to retrieval. A each is estimated, as well query, problem cast terms of risk minimization. can be exploited user preferences, context synonomy word senses. While work has incorporated translation this purpose, we introduce new method...
Recent methods for estimating sparse undirected graphs real-valued data in high dimensional problems rely heavily on the assumption of normality. We show how to use a semiparametric Gaussian copula---or nonparanormal---for inference. Just as additive models extend linear by replacing functions with set one-dimensional smooth functions, nonparanormal extends normal transforming variables functions. derive method nonparanormal, study method's theoretical properties, and that it works well many...
We propose a new probabilistic approach to information retrieval based upon the ideas and methods of statistical machine translation. The central ingredient in this is model how user might distill or "translate" given document into query. To assess relevance user's query, we estimate probability that query would have been generated as translation document, factor general preferences form prior distribution over documents. simple, well motivated document-to-query process, describe an...
We present a non-traditional retrieval problem we call subtopic retrieval. The is concerned with finding documents that cover many different subtopics of query topic. In such problem, the utility document in ranking dependent on other ranking, violating assumption independent relevance which assumed most traditional methods. Subtopic poses challenges for evaluating performance, as well developing effective algorithms. propose framework generalizes precision and recall metrics by accounting...
Summary We present a new class of methods for high dimensional non-parametric regression and classification called sparse additive models. Our combine ideas from linear modelling regression. derive an algorithm fitting the models that is practical effective even when number covariates larger than sample size. Sparse are essentially functional version grouped lasso Yuan Lin. They also closely related to COSSO model Lin Zhang but decouple smoothing sparsity, enabling use arbitrary smoothers....
We present a framework for information retrieval that combines document models and query using probabilistic ranking function based on Bayesian decision theory. The suggests an operational model extends recent developments in the language modeling approach to retrieval. A each is estimated, as well query, problem cast terms of risk minimization. can be exploited user preferences, context synonomy word senses. While work has incorporated translation this purpose, we introduce new method...
We propose a semiparametric approach called the nonparanormal SKEPTIC for efficiently and robustly estimating high-dimensional undirected graphical models. To achieve modeling flexibility, we consider models proposed by Liu, Lafferty Wasserman [J. Mach. Learn. Res. 10 (2009) 2295–2328]. estimation robustness, exploit nonparametric rank-based correlation coefficient estimators, including Spearman’s rho Kendall’s tau. prove that achieves optimal parametric rates of convergence both graph...
We describe an R package named huge which provides easy-to-use functions for estimating high dimensional undirected graphs from data. This implements recent results in the literature, including Friedman et al. (2007), Liu (2009, 2012) and (2010). Compared with existing graph estimation glasso, extra features: (1) instead of using Fortan, it is written C, makes code more portable easier to modify; (2) besides fitting Gaussian graphical models, also semiparametric copula models; (3) like...