Shanfeng Zhu

ORCID: 0000-0002-6067-5312
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Biomedical Text Mining and Ontologies
  • Machine Learning in Bioinformatics
  • Bioinformatics and Genomic Networks
  • Genomics and Phylogenetic Studies
  • Topic Modeling
  • Particle physics theoretical and experimental studies
  • Quantum Chromodynamics and Particle Interactions
  • vaccines and immunoinformatics approaches
  • Gene expression and cancer classification
  • Computational Drug Discovery Methods
  • Natural Language Processing Techniques
  • Advanced Text Analysis Techniques
  • Immunotherapy and Immune Responses
  • High-Energy Particle Collisions Research
  • Text and Document Classification Technologies
  • Monoclonal and Polyclonal Antibodies Research
  • Data Management and Algorithms
  • Protein Structure and Dynamics
  • Machine Learning in Materials Science
  • Web Data Mining and Analysis
  • Antimicrobial Peptides and Activities
  • Metabolomics and Mass Spectrometry Studies
  • Advanced Database Systems and Queries
  • Advanced Clustering Algorithms Research
  • CO2 Reduction Techniques and Catalysts

Fudan University
2016-2025

Shanghai Institute for Science of Science
2019-2025

Shanghai Center for Brain Science and Brain-Inspired Technology
2019-2025

Nanjing University
2019-2025

Shanghai Innovative Research Center of Traditional Chinese Medicine
2021-2024

ShangHai JiAi Genetics & IVF Institute
2022-2024

Anhui Science and Technology University
2024

Anhui University of Science and Technology
2024

Institute of Science and Technology
2023-2024

Institute of Art
2020-2024

We address the problem of predicting new drug-target interactions from three inputs: known interactions, similarities over drugs and those targets. This setting has been considered by many methods, which however have a common allowing to only one similarity matrix that The key idea our approach is use more than matrices as well targets, where weights multiple are estimated data automatically select similarities, effective for improving performance interactions. propose factor model, named...

10.1145/2487575.2487670 article EN 2013-08-11
Fernando Meyer Adrian Fritz Zhi-Luo Deng David Koslicki Till Robin Lesker and 95 more Alexey Gurevich Gary Robertson Mohammed Alser Dmitry Antipov Francesco Beghini Denis Bertrand Jaqueline Brito C. Titus Brown Jan P. Buchmann Aydın Buluç Bo Chen Rayan Chikhi Philip T. L. C. Clausen Alexandru Cristian Piotr Wojciech Dąbrowski Aaron E. Darling Rob Egan Eleazar Eskin Evangelos Georganas Eugene Goltsman Melissa A. Gray Lars Hestbjerg Hansen Steven Hofmeyr Pingqin Huang Luiz Irber Huijue Jia Tue Sparholt Jørgensen Silas Kieser Terje Klemetsen Axel Kola Mikhail Kolmogorov Anton Korobeynikov Jason C. Kwan Nathan LaPierre Claire Lemaitre Chenhao Li Antoine Limasset Fábio Malcher Miranda Serghei Mangul Vanessa R. Marcelino Camille Marchet Pierre Marijon Dmitry Meleshko Daniel R. Mende Alessio Milanese Niranjan Nagarajan Jakob Nybo Nissen Sergey Nurk Leonid Oliker Lucas Paoli Pierre Peterlongo Vitor C. Piro Jacob S. Porter Simon Rasmussen Evan Rees Knut Reinert Bernhard Y. Renard Espen Mikal Robertsen Gail Rosen Hans‐Joachim Ruscheweyh Varuni Sarwal Nicola Segata Enrico Seiler Lizhen Shi Fengzhu Sun Shinichi Sunagawa Søren J. Sørensen Ashleigh Thomas Chengxuan Tong Mirko Trajkovski Julien Tremblay Gherman Uritskiy Riccardo Vicedomini Zhengyang Wang Ziye Wang Zhong Wang Andrew Warren Nils Peder Willassen Katherine Yelick Ronghui You Georg Zeller Zhengqiao Zhao Shanfeng Zhu Jie Zhu Rubén Garrido‐Oter Petra Gastmeier Stéphane Hacquard Susanne Häußler Ariane Khaledi Friederike Maechler Fantin Mesny Simona Radutoiu Paul Schulze‐Lefert Nathiana Smit Till Strowig

Abstract Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the Initiative Critical Assessment Metagenome Interpretation (CAMI). The CAMI II challenge engaged community to assess methods on realistic complex datasets with long- short-read sequences, created computationally from around 1,700 new known genomes, as well 600 plasmids viruses. Here we analyze 5,002 results by 76 program versions. Substantial improvements were seen in assembly, some due...

10.1038/s41592-022-01431-4 article EN cc-by Nature Methods 2022-04-01

Abstract Motivation Gene Ontology (GO) has been widely used to annotate functions of proteins and understand their biological roles. Currently only <1% >70 million in UniProtKB have experimental GO annotations, implying the strong necessity automated function prediction (AFP) proteins, where AFP is a hard multilabel classification problem due one protein with diverse number terms. Most these sequences as input information, indicating importance sequence-based (SAFP: are input)....

10.1093/bioinformatics/bty130 article EN Bioinformatics 2018-03-06

Abstract Binning aims to recover microbial genomes from metagenomic data. For complex communities, the available binning methods are far satisfactory, which usually do not fully use different types of features and important biological knowledge. We developed a novel ensemble binner, MetaBinner, generates component results with multiple by k-means uses single-copy gene information for initialization. It then employs two-stage strategy based on genes integrate efficiently effectively....

10.1186/s13059-022-02832-6 article EN cc-by Genome biology 2023-01-06

As one of the state-of-the-art automated function prediction (AFP) methods, NetGO 2.0 integrates multi-source information to improve performance. However, it mainly utilizes proteins with experimentally supported functional annotations without leveraging valuable from a vast number unannotated proteins. Recently, protein language models have been proposed learn informative representations [e.g., Evolutionary Scale Modeling (ESM)-1b embedding] sequences based on self-supervision. Here, we...

10.1016/j.gpb.2023.04.001 article EN cc-by Genomics Proteomics & Bioinformatics 2023-04-01

Homologous protein search is one of the most commonly used methods for annotation and analysis. Compared to structure search, detecting distant evolutionary relationships from sequences alone remains challenging. Here we propose PLMSearch (Protein Language Model), a homologous method with only as input. uses deep representations pre-trained language model trains similarity prediction large number real similarity. This enables capture remote homology information concealed behind sequences....

10.1038/s41467-024-46808-5 article EN cc-by Nature Communications 2024-03-30

Identifying drug-target interactions is an important task in drug discovery. To reduce heavy time and financial cost experimental way, many computational approaches have been proposed. Although these used different principles, their performance far from satisfactory, especially predicting of new candidate drugs or targets.Approaches based on machine learning for this problem can be divided into two types: feature-based similarity-based methods. Learning to rank the most powerful technique...

10.1093/bioinformatics/btw244 article EN cc-by-nc Bioinformatics 2016-06-11

Automated function prediction (AFP) of proteins is great significance in biology. AFP can be regarded as a problem the large-scale multi-label classification where protein associated with multiple gene ontology terms its labels. Based on our GOLabeler-a state-of-the-art method for third critical assessment functional annotation (CAFA3), this paper we propose NetGO, web server that able to further improve performance by incorporating massive protein-protein network information. Specifically,...

10.1093/nar/gkz388 article EN cc-by-nc Nucleic Acids Research 2019-05-01

Abstract Motivation: Medical Subject Headings (MeSH) indexing, which is to assign a set of MeSH main headings citations, crucial for many important tasks in biomedical text mining and information retrieval. Large-scale indexing has two challenging aspects: the citation side side. For side, all existing methods, including Text Indexer (MTI) by National Library Medicine state-of-the-art method, MeSHLabeler, deal with bag-of-words, cannot capture semantic context-dependent well. Methods: We...

10.1093/bioinformatics/btw294 article EN cc-by-nc Bioinformatics 2016-06-11

Abstract Motivation Automated function prediction (AFP) of proteins is a large-scale multi-label classification problem. Two limitations most network-based methods for AFP are (i) single model must be trained each species and (ii) protein sequence information totally ignored. These cause weaker performance than sequence-based methods. Thus, the challenge how to develop powerful method overcome these limitations. Results We propose DeepGraphGO, an end-to-end, multispecies graph neural AFP,...

10.1093/bioinformatics/btab270 article EN Bioinformatics 2021-04-23

Abstract With the explosive growth of protein sequences, large-scale automated function prediction (AFP) is becoming challenging. A usually associated with dozens gene ontology (GO) terms. Therefore, AFP regarded as a problem multi-label classification. Under learning to rank (LTR) framework, our previous NetGO tool integrated massive networks and multi-type information about sequences achieve good performance by dealing all possible GO terms (>44 000). In this work, we propose...

10.1093/nar/gkab398 article EN cc-by-nc Nucleic Acids Research 2021-05-04

Extreme multi-label text classification (XMTC) is an important problem in the era of big data, for tagging a given with most relevant multiple labels from extremely large-scale label set. XMTC can be found many applications, such as item categorization, web page tagging, and news annotation. Traditionally methods used bag-of-words (BOW) inputs, ignoring word context well deep semantic information. Recent attempts to overcome problems BOW by learning still suffer 1) failing capture subtext...

10.48550/arxiv.1811.01727 preprint EN other-oa arXiv (Cornell University) 2018-01-01

Abstract Contig binning plays a crucial role in metagenomic data analysis by grouping contigs from the same or closely related genomes. However, existing methods face challenges practical applications due to diversity of types and difficulties efficiently integrating heterogeneous information. Here, we introduce COMEBin, method based on contrastive multi-view representation learning. COMEBin utilizes augmentation generate multiple fragments (views) each contig obtains high-quality embeddings...

10.1038/s41467-023-44290-z article EN cc-by Nature Communications 2024-01-17

Motivation Accurate identification of peptides binding to specific Major Histocompatibility Complex Class II (MHC-II) molecules is great importance for elucidating the underlying mechanism immune recognition, as well developing effective epitope-based vaccines and promising immunotherapies many severe diseases. Due extreme polymorphism MHC-II alleles high cost biochemical experiments, development computational methods accurate prediction molecules, particularly ones with few or no...

10.1371/journal.pone.0030483 article EN cc-by PLoS ONE 2012-02-23

Medical Subject Headings (MeSHs) are used by National Library of Medicine (NLM) to index almost all citations in MEDLINE, which greatly facilitates the applications biomedical information retrieval and text mining. To reduce time financial cost manual annotation, NLM has developed a software package, Text Indexer (MTI), for assisting MeSH uses k-nearest neighbors (KNN), pattern matching indexing rules. Other types information, such as prediction classifiers (trained separately), can also be...

10.1093/bioinformatics/btv237 article EN cc-by-nc Bioinformatics 2015-06-10

The authors study the problem of how news summarization can help stock price prediction, proposing a generic prediction framework to enable use different external signals predict prices. Experiments were conducted on five years Hong Kong Stock Exchange data, with reported by Finet; evaluations performed at individual stock, sector index, and market index levels. authors' results show that based article effectively outperform full-length articles both validation independent testing sets.

10.1109/mis.2015.1 article EN IEEE Intelligent Systems 2015-01-12

Abstract Motivation Metagenomic contig binning is an important computational problem in metagenomic research, which aims to cluster contigs from the same genome into group. Unlike classical clustering problem, can utilize known relationships among some of or taxonomic identity contigs. However, current state-of-the-art methods do not make full use additional biological information except coverage and sequence composition Results We developed a novel method, Semi-supervised Spectral...

10.1093/bioinformatics/btz253 article EN Bioinformatics 2019-04-05

Abstract Drug–drug interactions (DDIs) are one of the major concerns in pharmaceutical research, and a number computational methods have been developed to predict whether two drugs interact or not. Recently, more attention has paid events caused by DDIs, which is useful for investigating mechanism hidden behind combined drug usage adverse reactions. However, some rare may only few examples, hindering them from being precisely predicted. To address above issues, we present few-shot method...

10.1093/bib/bbab514 article EN Briefings in Bioinformatics 2021-11-10

The importance of chemical compounds has been emphasized more in molecular biology, and 'chemical genomics' attracted a great deal attention recent years. Thus an important issue current biology is to identify biological-related (more specifically, drugs) genes. Co-occurrence biological entities the literature simple, comprehensive popular technique find association these entities. Our focus mine implicit compound gene' relations from co-occurrence literature.We propose probabilistic model,...

10.1093/bioinformatics/bti1141 article EN Bioinformatics 2005-09-01

Abstract Motivation: Clustering MEDLINE documents is usually conducted by the vector space model, which computes content similarity between two basically using inner-product of their word vectors. Recently, semantic information MeSH (Medical Subject Headings) thesaurus being applied to clustering mapping into concept vectors be clustered. However, current approaches have serious limitations: first, important may lost when generating vectors, and second, original text has been discarded....

10.1093/bioinformatics/btp338 article EN Bioinformatics 2009-06-03
Coming Soon ...