Xin Bing

ORCID: 0000-0001-7462-9360
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Statistical Methods and Inference
  • Bayesian Methods and Mixture Models
  • Sparse and Compressive Sensing Techniques
  • Advanced Statistical Methods and Models
  • Gene expression and cancer classification
  • Bioinformatics and Genomic Networks
  • Random Matrices and Applications
  • Single-cell and spatial transcriptomics
  • Domain Adaptation and Few-Shot Learning
  • Machine Learning and Algorithms
  • Blind Source Separation Techniques
  • Biosensors and Analytical Detection
  • Algorithms and Data Compression
  • Face and Expression Recognition
  • Advanced biosensing and bioanalysis techniques
  • Molecular Biology Techniques and Applications
  • Text and Document Classification Technologies
  • Advanced Optimization Algorithms Research
  • Topic Modeling
  • Advanced Image and Video Retrieval Techniques
  • Computational and Text Analysis Methods
  • Stochastic Gradient Optimization Techniques
  • Geophysical and Geoelectrical Methods
  • Gene Regulatory Network Analysis
  • Machine Learning in Bioinformatics

University of Toronto
2022-2024

Cornell University
2018-2022

This work introduces a novel estimation method, called LOVE, of the entries and structure loading matrix $A$ in latent factor model $X=AZ+E$, for an observable random vector $X\in \mathbb{R}^{p}$, with correlated unobservable factors $Z\in \mathbb{R}^{K}$, $K$ unknown, uncorrelated noise $E$. Each row is scaled, allowed to be sparse. In order identify $A$, we require existence pure variables, which are components $X$ that associated, via one only factor. Despite fact number $K$, variables...

10.1214/19-aos1877 article EN The Annals of Statistics 2020-08-01

Topic models have become popular for the analysis of data that consists in a collection n independent multinomial observations, with parameters $N_{i}\in\mathbb{N}$ and $\Pi_{i}\in[0,1]^{p}$ $i=1,\ldots,n$. The model links all cell probabilities, collected $p\times n$ matrix $\Pi$, via assumption $\Pi$ can be factorized as product two nonnegative matrices $A\in[0,1]^{p\times K}$ $W\in[0,1]^{K\times n}$. been originally developed text mining, when one browses through $n$ documents, based on...

10.3150/19-bej1166 article EN Bernoulli 2020-04-27

We consider the multivariate response regression problem with a coefficient matrix of low, unknown rank. In this setting, we analyze new criterion for selecting optimal reduced This differs notably from one proposed in Bunea, She and Wegkamp (Ann. Statist. 39 (2011) 1282–1309) that it does not require estimation variance noise, nor depend on delicate choice tuning parameter. develop an iterative, fully data-driven procedure, adapts to signal-to-noise ratio. procedure finds true rank few...

10.1214/18-aos1774 article EN The Annals of Statistics 2019-10-31

High-dimensional cellular and molecular profiling of biological samples highlights the need for analytical approaches that can integrate multi-omic datasets to generate prioritized causal inferences. Current methods are limited by high dimensionality combined datasets, differences in their data distributions, integration infer relationships. Here, we present Essential Regression (ER), a novel latent-factor-regression-based interpretable machine-learning approach addresses these problems...

10.1016/j.patter.2022.100473 article EN cc-by-nc-nd Patterns 2022-03-25

AbstractThis article studies the inference of regression coefficient matrix under multivariate response linear regressions in presence hidden variables. A novel procedure for constructing confidence intervals entries is proposed. Our method first uses nature responses by estimating and adjusting effect to construct an initial estimator matrix. By further deploying a low-dimensional projection reduce bias introduced regularization previous step, refined proposed shown be asymptotically...

10.1080/01621459.2023.2241701 article EN Journal of the American Statistical Association 2023-09-26

A prominent concern of scientific investigators is the presence unobserved hidden variables in association analysis. Ignoring often yields biased statistical results and misleading conclusions. Motivated by this practical issue, paper studies multivariate response regression with variables, Y=(Ψ∗)TX+(B∗)TZ+E, where Y∈Rm vector, X∈Rp observable feature, Z∈RK represents vector possibly correlated X, E an independent error. The number K unknown both m p are allowed, but not required, to grow...

10.1214/21-aos2059 article EN The Annals of Statistics 2022-04-01

Topic models have become popular tools for dimension reduction and exploratory analysis of text data which consists in observed frequencies a vocabulary $p$ words $n$ documents, stored $p\times n$ matrix. The main premise is that the mean this matrix can be factorized into product two non-negative matrices: K$ word-topic $A$ $K\times topic-document $W$. This paper studies estimation possibly element-wise sparse, number topics $K$ unknown. In under-explored context, we derive new minimax...

10.48550/arxiv.2001.07861 preprint EN other-oa arXiv (Cornell University) 2020-01-01

We propose a new method of estimation in topic models, that is not variation on the existing simplex finding algorithms, and estimates number topics K from observed data. derive finite sample minimax lower bounds for A, as well upper our proposed estimator. describe scenarios where estimator adaptive. Our analysis valid any documents (n), individual document length (N_i), dictionary size (p) (K), both p are allowed to increase with n, situation handled by previous analyses. complement...

10.48550/arxiv.1805.06837 preprint EN other-oa arXiv (Cornell University) 2018-01-01

Regression models, in which the observed features X∈Rp and response Y∈R depend, jointly, on a lower dimensional, unobserved, latent vector Z∈RK, with K≪p, are popular large array of applications, mainly used for predicting from correlated features. In contrast, methodology theory inference regression coefficient β∈RK relating Y to Z scarce, since typically un-observable factor is hard interpret. Furthermore, determination asymptotic variance an estimator β long-standing problem, solutions...

10.3150/21-bej1374 article EN Bernoulli 2022-03-04

This paper studies the estimation of high-dimensional, discrete, possibly sparse, mixture models in context topic models. The data consists observed multinomial counts p words across n independent documents. In models, p×n expected word frequency matrix is assumed to be factorized as a p×K word-topic A and K×n topic-document T. Since columns both matrices represent conditional probabilities belonging probability simplices, are viewed p-dimensional components that common all documents while T...

10.1214/22-aos2229 article EN The Annals of Statistics 2022-12-01

Mixed multinomial logits are discrete mixtures introduced several decades ago to model the probability of choosing an attribute from $p$ possible candidates, in heterogeneous populations. The has recently attracted attention AI literature, under name softmax mixtures, where it is routinely used final layer a neural network map large number vectors $\mathbb{R}^L$ vector. Despite its wide applicability and empirical success, statistically optimal estimators mixture parameters, obtained via...

10.48550/arxiv.2409.09903 preprint EN arXiv (Cornell University) 2024-09-15

High-dimensional feature vectors are likely to contain sets of measurements that approximate replicates one another. In complex applications, or automated data collection, these not known a priori, and need be determined. This work proposes class latent factor models on the observed, high-dimensional, random vector X∈Rp, for defining, identifying estimating index set its approximately replicate components. The model is parametrized by p×K loading matrix A contains hidden sub-matrix whose...

10.3150/22-bej1502 article EN Bernoulli 2023-02-20

Abstract This paper considers binary classification of high-dimensional features under a postulated model with low-dimensional latent Gaussian mixture structure and nonvanishing noise. A generalized least-squares estimator is used to estimate the direction optimal separating hyperplane. The estimated hyperplane shown interpolate on training data. While vector can be consistently estimated, as could expected from recent results in linear regression, naive plug-in fails intercept. simple...

10.1093/biomet/asad037 article EN Biometrika 2023-06-08

This work introduces a novel estimation method, called LOVE, of the entries and structure loading matrix A in sparse latent factor model X = AZ + E, for an observable random vector Rp, with correlated unobservable factors Z \in RK, K unknown, independent noise E. Each row is scaled sparse. In order to identify A, we require existence pure variables, which are components that associated, via one only factor. Despite fact number K, their location all mild condition on covariance Z, minimum two...

10.48550/arxiv.1704.06977 preprint EN other-oa arXiv (Cornell University) 2017-01-01

This work is devoted to the finite sample prediction risk analysis of a class linear predictors response $Y\in \mathbb{R}$ from high-dimensional random vector $X\in \mathbb{R}^p$ when $(X,Y)$ follows latent factor regression model generated by unobservable $Z$ dimension less than $p$. Our primary contribution in establishing bounds for with ubiquitous Principal Component Regression (PCR) method, under model, number principal components adaptively selected data -- form theoretical guarantee...

10.48550/arxiv.2007.10050 preprint EN other-oa arXiv (Cornell University) 2020-01-01

High-dimensional feature vectors are likely to contain sets of measurements that approximate replicates one another. In complex applications, or automated data collection, these not known a priori, and need be determined. This work proposes class latent factor models on the observed high-dimensional random vector $X \in \mathbb{R}^p$, for defining, identifying estimating index set its approximately replicate components. The model is parametrized by $p \times K$ loading matrix $A$ contains...

10.48550/arxiv.2010.02288 preprint EN other-oa arXiv (Cornell University) 2020-01-01

In high-dimensional classification problems, a commonly used approach is to first project the features into lower-dimensional space, and base on resulting projections. this paper, we formulate latent-variable model with hidden low-dimensional structure justify two-step procedure guide which projection choose. We propose computationally efficient classifier that takes certain principal components (PCs) of observed as projections, number retained PCs selected in data-driven way. A general...

10.1214/23-aos2289 article EN The Annals of Statistics 2023-06-01
Coming Soon ...