- Single-cell and spatial transcriptomics
- Gene Regulatory Network Analysis
- Cell Image Analysis Techniques
- Gene expression and cancer classification
- Bioinformatics and Genomic Networks
- EEG and Brain-Computer Interfaces
- RNA modifications and cancer
- MicroRNA in disease regulation
- Mathematical Biology Tumor Growth
- Extracellular vesicles in disease
- Pluripotent Stem Cells Research
- ECG Monitoring and Analysis
- Atrial Fibrillation Management and Outcomes
- Genomics and Phylogenetic Studies
- AI in cancer detection
- Data Mining Algorithms and Applications
- Scientific Computing and Data Management
- Biomedical Text Mining and Ontologies
- Immune cells in cancer
- RNA and protein synthesis mechanisms
- Molecular Biology Techniques and Applications
- Epigenetics and DNA Methylation
- Cancer Genomics and Diagnostics
- Domain Adaptation and Few-Shot Learning
- Congenital heart defects research
Tsinghua University
2020-2025
Abstract Large-scale pretrained models have become foundation leading to breakthroughs in natural language processing and related fields. Developing life science for deciphering the “languages” of cells facilitating biomedical research is promising yet challenging. We developed a large-scale model scFoundation with 100M parameters this purpose. was trained on over 50 million human single-cell transcriptomics data, which contain high-throughput observations complex molecular features all...
Recent developments of spatial transcriptomic sequencing technologies provide powerful tools for understanding cells in the physical context tissue microenvironments. A fundamental task gene expression analysis is to identify genes with spatially variable patterns, or (SVgenes). Several computational methods have been developed this task. Their high complexity limited their scalability latest and future large-scale data.We present SOMDE, an efficient method identifying SVgenes data. SOMDE...
The accumulation of massive single-cell omics data provides growing resources for building biomolecular atlases all cells human organs or the whole body. true assembly a cell atlas should be cell-centric rather than file-centric. We developed unified informatics framework seamless and built Ensemble Cell Atlas (hECA) from scattered data. hECA v1.0 assembled 1,093,299 labeled 116 published datasets, covering 38 11 systems. invented three new methods applications based on assembly: "in data"...
Cell-cell communication events (CEs) are mediated by multiple ligand-receptor (LR) pairs. Usually only a particular subset of CEs directly works for specific downstream response in microenvironment. We name them as functional (FCEs) the target responses. Decoding FCE-target gene relations is: important understanding mechanisms many biological processes, but has been intractable due to mixing factors and lack direct observations. developed method HoloNet decoding FCEs using spatial...
Abstract Profiling spatial variations of cellular composition and transcriptomic characteristics is important for understanding the physiology pathology tissues. Spatial transcriptomics (ST) data depict gene expression but currently dominating high-throughput technology yet not at single-cell resolution. Single-cell RNA-sequencing (SC) provide information level lack information. Integrating these two types would be ideal revealing landscapes We develop method STEM (SpaTially aware EMbedding)...
A bstract The advances in high-throughput sequencing technology have led to significant progress measuring gene expressions single-cell level. amount of publicly available RNA-seq (scRNA-seq) data is already surpassing 50M records for human with each record 20,000 genes. This highlights the need unsupervised representation learning fully ingest these data, yet classical transformer architectures are prohibitive train on such terms both computation and memory. To address this challenge, we...
Large language models (LLMs) have made breakthroughs in natural processing (NLP) and understanding, brought revolutions many other fields [1-4]. Inspired by those successes, several large cellular (LCMs) adopting similar structures of LLMs been developed for single-cell transcriptomics, including (but not limited to) scBERT [5], Geneformer [6], scGPT [7], scFoundation [8], GeneCompass [9]. The practices these shown LCMs' power potential various biological tasks illustrated the possibilities...
Abstract Motivation Single-cell RNA sequencing (scRNA-seq) data are important for studying the laws of life at single-cell level. However, it is still challenging to obtain enough high-quality scRNA-seq data. To mitigate limited availability data, generative models have been proposed computationally generate synthetic Nevertheless, generated with current not very realistic yet, especially when we need controlled conditions. In meantime, diffusion shown their power in generating high...
Abstract Single-cell multi-omics data have a high potential for deciphering complex cellular mechanisms. But simultaneously measuring from the same cells is still challenging, which calls computational methods to integrate of multiple modalities and generate unobserved data. In this paper, we present scDiffusion-X, latent diffusion model tailored task. The uses autoencoders map multi-modalities into low-dimensional spaces, coupled with Dual-Cross-Attention (DCA) module invented learn hidden...
Advances in AI are transforming scientific discovery, yet spatial biology, a field that deciphers the molecular organization within tissues, remains constrained by labor-intensive workflows. Here, we present SpatialAgent, fully autonomous agent dedicated for spatial-biology research. SpatialAgent integrates large language models with dynamic tool execution and adaptive reasoning. spans entire research pipeline, from experimental design to multimodal data analysis hypothesis generation....
Reprogramming cell state transitions provides the potential for engineering and regenerative therapy many diseases. Finding reprogramming transcription factors (TFs) their combinations that can direct desired transition is crucial task. Computational methods have been developed to identify such TFs. However, most of them only generate a ranked list individual TFs ignore identification TF combinations. Even identification, current often fail put real effective at top rankings. To address...
Learning spatial context of cells through pretraining on transcriptomics (ST) data may empower us to systematically decipher tissue organization and cellular interactions. Yet, transformer-based generative models often focus modeling individual cells, neglecting the intricate relationships within them. We develop GeST, a deep transformer model that is pretrained by novel spatially informed generation task: Predict expression profile given location based information from its neighboring...
Abstract Gene expression could be perceived as a form of cell language, with underlying regulatory mechanisms akin to biological grammar. Decoding this “language” is critical in understanding cellular functions and behaviors, but presents significant challenges. Several works have attempted learn the language by pre-training large foundation models based on single-cell transcriptomic data, inspired success natural processing. In study, we further enrich paradigm integrating an abundance...
Abstract Recent development of large language models (LLMs) in AI has inspired scientists to develop a few large-scale foundation for single-cell transcriptomics or cellular (LCMs) pretrained on massive RNA-seq data. They illustrated superior performances wide spectrum tasks although the were only self-supervised manner, without any specific design downstream tasks. The success opened promising new route toward using grasp underlying biological knowledge from data scale that cannot be...
Single-cell RNA sequencing (scRNA-seq) data are important for studying the laws of life at single-cell level. However, it is still challenging to obtain enough high-quality scRNA-seq data. To mitigate limited availability data, generative models have been proposed computationally generate synthetic Nevertheless, generated with current not very realistic yet, especially when we need controlled conditions. In meantime, Diffusion shown their power in generating high fidelity, providing a new...
Abstract Cell–cell communication events (CEs) are mediated by multiple ligand–receptor pairs. Usually only a particular subset of CEs directly works for specific downstream response in microenvironment. We name them as functional (FCEs) the target responses. Decoding FCE-target gene relations is important understanding machanisms many biological processes, but has been intractable due to mixing factors and lack direct observations. developed method HoloNet decoding FCEs using spatial...
Abstract Recent developments of spatial transcriptomic sequencing technologies provide powerful tools for understanding cells in the physical context tissue micro-environments. A fundamental task gene expression analysis is to identify genes with spatially variable patterns, or (SVgenes). Several computational methods have been developed this task. Their high complexity limited their scalability latest and future large-scale data. We present SOMDE, an efficient method identifying SVgenes...
Abstract Cell state transitions are complicated processes that occur in various life activities. Understanding and artificially manipulating them have been longstanding challenges. Substantial experiments reveal the could be directed by several key transcription factors (TFs). Here we present scDirect, a computational framework to identify TFs based on single-cell RNA-seq ATAC-seq data. scDirect models TF identification task as linear inverse problem, solve it with gene regulatory networks...
The functional or structural spatial regions within tissues, referred to as niches, are elements for illustrating the contexts of multicellular organisms. A key challenge is querying shared niches across diverse which crucial achieving a comprehensive understanding organization and phenotypes cell populations. However, current data analysis methods predominantly focus on creating spatial-aware embeddings cells, neglecting development niche-level representations effective querying. To address...
Single-cell RNA-seq (scRNA-seq) has become a prominent tool for studying human biology and disease. The availability of massive scRNA-seq datasets advanced machine learning techniques recently driven the development single-cell foundation models that provide informative versatile cell representations based on expression profiles. However, to understand disease states, we need consider entire tissue ecosystems, simultaneously considering many different interacting cells. Here, tackle this...