- Biomedical Text Mining and Ontologies
- Machine Learning in Bioinformatics
- Bioinformatics and Genomic Networks
- Genomics and Phylogenetic Studies
- Topic Modeling
- Particle physics theoretical and experimental studies
- Quantum Chromodynamics and Particle Interactions
- vaccines and immunoinformatics approaches
- Gene expression and cancer classification
- Computational Drug Discovery Methods
- Natural Language Processing Techniques
- Advanced Text Analysis Techniques
- Immunotherapy and Immune Responses
- High-Energy Particle Collisions Research
- Text and Document Classification Technologies
- Monoclonal and Polyclonal Antibodies Research
- Data Management and Algorithms
- Protein Structure and Dynamics
- Machine Learning in Materials Science
- Web Data Mining and Analysis
- Antimicrobial Peptides and Activities
- Metabolomics and Mass Spectrometry Studies
- Advanced Database Systems and Queries
- Advanced Clustering Algorithms Research
- CO2 Reduction Techniques and Catalysts
Fudan University
2016-2025
Shanghai Institute for Science of Science
2019-2025
Shanghai Center for Brain Science and Brain-Inspired Technology
2019-2025
Nanjing University
2019-2025
Shanghai Innovative Research Center of Traditional Chinese Medicine
2021-2024
ShangHai JiAi Genetics & IVF Institute
2022-2024
Anhui Science and Technology University
2024
Anhui University of Science and Technology
2024
Institute of Science and Technology
2023-2024
Institute of Art
2020-2024
We address the problem of predicting new drug-target interactions from three inputs: known interactions, similarities over drugs and those targets. This setting has been considered by many methods, which however have a common allowing to only one similarity matrix that The key idea our approach is use more than matrices as well targets, where weights multiple are estimated data automatically select similarities, effective for improving performance interactions. propose factor model, named...
Abstract Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the Initiative Critical Assessment Metagenome Interpretation (CAMI). The CAMI II challenge engaged community to assess methods on realistic complex datasets with long- short-read sequences, created computationally from around 1,700 new known genomes, as well 600 plasmids viruses. Here we analyze 5,002 results by 76 program versions. Substantial improvements were seen in assembly, some due...
Abstract Motivation Gene Ontology (GO) has been widely used to annotate functions of proteins and understand their biological roles. Currently only <1% >70 million in UniProtKB have experimental GO annotations, implying the strong necessity automated function prediction (AFP) proteins, where AFP is a hard multilabel classification problem due one protein with diverse number terms. Most these sequences as input information, indicating importance sequence-based (SAFP: are input)....
Abstract Binning aims to recover microbial genomes from metagenomic data. For complex communities, the available binning methods are far satisfactory, which usually do not fully use different types of features and important biological knowledge. We developed a novel ensemble binner, MetaBinner, generates component results with multiple by k-means uses single-copy gene information for initialization. It then employs two-stage strategy based on genes integrate efficiently effectively....
As one of the state-of-the-art automated function prediction (AFP) methods, NetGO 2.0 integrates multi-source information to improve performance. However, it mainly utilizes proteins with experimentally supported functional annotations without leveraging valuable from a vast number unannotated proteins. Recently, protein language models have been proposed learn informative representations [e.g., Evolutionary Scale Modeling (ESM)-1b embedding] sequences based on self-supervision. Here, we...
Homologous protein search is one of the most commonly used methods for annotation and analysis. Compared to structure search, detecting distant evolutionary relationships from sequences alone remains challenging. Here we propose PLMSearch (Protein Language Model), a homologous method with only as input. uses deep representations pre-trained language model trains similarity prediction large number real similarity. This enables capture remote homology information concealed behind sequences....
Identifying drug-target interactions is an important task in drug discovery. To reduce heavy time and financial cost experimental way, many computational approaches have been proposed. Although these used different principles, their performance far from satisfactory, especially predicting of new candidate drugs or targets.Approaches based on machine learning for this problem can be divided into two types: feature-based similarity-based methods. Learning to rank the most powerful technique...
Automated function prediction (AFP) of proteins is great significance in biology. AFP can be regarded as a problem the large-scale multi-label classification where protein associated with multiple gene ontology terms its labels. Based on our GOLabeler-a state-of-the-art method for third critical assessment functional annotation (CAFA3), this paper we propose NetGO, web server that able to further improve performance by incorporating massive protein-protein network information. Specifically,...
Abstract Motivation: Medical Subject Headings (MeSH) indexing, which is to assign a set of MeSH main headings citations, crucial for many important tasks in biomedical text mining and information retrieval. Large-scale indexing has two challenging aspects: the citation side side. For side, all existing methods, including Text Indexer (MTI) by National Library Medicine state-of-the-art method, MeSHLabeler, deal with bag-of-words, cannot capture semantic context-dependent well. Methods: We...
Abstract Motivation Automated function prediction (AFP) of proteins is a large-scale multi-label classification problem. Two limitations most network-based methods for AFP are (i) single model must be trained each species and (ii) protein sequence information totally ignored. These cause weaker performance than sequence-based methods. Thus, the challenge how to develop powerful method overcome these limitations. Results We propose DeepGraphGO, an end-to-end, multispecies graph neural AFP,...
Abstract With the explosive growth of protein sequences, large-scale automated function prediction (AFP) is becoming challenging. A usually associated with dozens gene ontology (GO) terms. Therefore, AFP regarded as a problem multi-label classification. Under learning to rank (LTR) framework, our previous NetGO tool integrated massive networks and multi-type information about sequences achieve good performance by dealing all possible GO terms (>44 000). In this work, we propose...
Extreme multi-label text classification (XMTC) is an important problem in the era of big data, for tagging a given with most relevant multiple labels from extremely large-scale label set. XMTC can be found many applications, such as item categorization, web page tagging, and news annotation. Traditionally methods used bag-of-words (BOW) inputs, ignoring word context well deep semantic information. Recent attempts to overcome problems BOW by learning still suffer 1) failing capture subtext...
Abstract Contig binning plays a crucial role in metagenomic data analysis by grouping contigs from the same or closely related genomes. However, existing methods face challenges practical applications due to diversity of types and difficulties efficiently integrating heterogeneous information. Here, we introduce COMEBin, method based on contrastive multi-view representation learning. COMEBin utilizes augmentation generate multiple fragments (views) each contig obtains high-quality embeddings...
Motivation Accurate identification of peptides binding to specific Major Histocompatibility Complex Class II (MHC-II) molecules is great importance for elucidating the underlying mechanism immune recognition, as well developing effective epitope-based vaccines and promising immunotherapies many severe diseases. Due extreme polymorphism MHC-II alleles high cost biochemical experiments, development computational methods accurate prediction molecules, particularly ones with few or no...
Medical Subject Headings (MeSHs) are used by National Library of Medicine (NLM) to index almost all citations in MEDLINE, which greatly facilitates the applications biomedical information retrieval and text mining. To reduce time financial cost manual annotation, NLM has developed a software package, Text Indexer (MTI), for assisting MeSH uses k-nearest neighbors (KNN), pattern matching indexing rules. Other types information, such as prediction classifiers (trained separately), can also be...
The authors study the problem of how news summarization can help stock price prediction, proposing a generic prediction framework to enable use different external signals predict prices. Experiments were conducted on five years Hong Kong Stock Exchange data, with reported by Finet; evaluations performed at individual stock, sector index, and market index levels. authors' results show that based article effectively outperform full-length articles both validation independent testing sets.
Abstract Motivation Metagenomic contig binning is an important computational problem in metagenomic research, which aims to cluster contigs from the same genome into group. Unlike classical clustering problem, can utilize known relationships among some of or taxonomic identity contigs. However, current state-of-the-art methods do not make full use additional biological information except coverage and sequence composition Results We developed a novel method, Semi-supervised Spectral...
Abstract Drug–drug interactions (DDIs) are one of the major concerns in pharmaceutical research, and a number computational methods have been developed to predict whether two drugs interact or not. Recently, more attention has paid events caused by DDIs, which is useful for investigating mechanism hidden behind combined drug usage adverse reactions. However, some rare may only few examples, hindering them from being precisely predicted. To address above issues, we present few-shot method...
The importance of chemical compounds has been emphasized more in molecular biology, and 'chemical genomics' attracted a great deal attention recent years. Thus an important issue current biology is to identify biological-related (more specifically, drugs) genes. Co-occurrence biological entities the literature simple, comprehensive popular technique find association these entities. Our focus mine implicit compound gene' relations from co-occurrence literature.We propose probabilistic model,...
Abstract Motivation: Clustering MEDLINE documents is usually conducted by the vector space model, which computes content similarity between two basically using inner-product of their word vectors. Recently, semantic information MeSH (Medical Subject Headings) thesaurus being applied to clustering mapping into concept vectors be clustered. However, current approaches have serious limitations: first, important may lost when generating vectors, and second, original text has been discarded....