Siqi Sun

ORCID: 0000-0001-7240-8724
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Topic Modeling
  • Natural Language Processing Techniques
  • RNA and protein synthesis mechanisms
  • Protein Structure and Dynamics
  • Machine Learning in Bioinformatics
  • Multimodal Machine Learning Applications
  • Bayesian Modeling and Causal Inference
  • Genomics and Phylogenetic Studies
  • Adversarial Robustness in Machine Learning
  • Statistical Methods and Inference
  • Bayesian Methods and Mixture Models
  • Machine Learning and Algorithms
  • RNA modifications and cancer
  • Context-Aware Activity Recognition Systems
  • Glycosylation and Glycoproteins Research
  • Domain Adaptation and Few-Shot Learning
  • Bioinformatics and Genomic Networks
  • Face and Expression Recognition
  • Speech and dialogue systems
  • Computational Drug Discovery Methods
  • Human Pose and Action Recognition
  • Explainable Artificial Intelligence (XAI)
  • Monoclonal and Polyclonal Antibodies Research
  • Microbial Metabolic Engineering and Bioproduction
  • Anomaly Detection Techniques and Applications

Fudan University
2011-2025

Beijing Academy of Artificial Intelligence
2025

Tianjin Medical University
2025

Shanghai Artificial Intelligence Laboratory
2022-2025

Peking Union Medical College Hospital
2024

Chinese Academy of Medical Sciences & Peking Union Medical College
2024

Sichuan University
2023

University of Liverpool
2023

Microsoft (United States)
2019-2022

Microsoft Research (United Kingdom)
2019-2022

Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Jingjing Liu, Bill Dolan. Proceedings of the 58th Annual Meeting Association for Computational Linguistics: System Demonstrations. 2020.

10.18653/v1/2020.acl-demos.30 preprint EN cc-by 2020-01-01

Protein contacts contain key information for the understanding of protein structure and function thus, contact prediction from sequence is an important problem. Recently exciting progress has been made on this problem, but predicted proteins without many homologs still low quality not very useful de novo prediction.This paper presents a new deep learning method that predicts by integrating both evolutionary coupling (EC) conservation through ultra-deep neural network formed two residual...

10.1371/journal.pcbi.1005324 article EN cc-by PLoS Computational Biology 2017-01-05

ChatGPT, an artificial intelligence generated content (AIGC) model developed by OpenAI, has attracted world-wide attention for its capability of dealing with challenging language understanding and generation tasks in the form conversations. This paper briefly provides overview on history, status quo potential future development helping to provide entry point think about ChatGPT. Specifically, from limited open-accessed resources, we conclude core techniques mainly including large-scale...

10.1109/jas.2023.123618 article EN IEEE/CAA Journal of Automatica Sinica 2023-05-01

Siqi Sun, Yu Cheng, Zhe Gan, Jingjing Liu. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint (EMNLP-IJCNLP). 2019.

10.18653/v1/d19-1441 article EN cc-by 2019-01-01

Adversarial training, which minimizes the maximal risk for label-preserving input perturbations, has proved to be effective improving generalization of language models. In this work, we propose a novel adversarial training algorithm, FreeLB, that promotes higher invariance in embedding space, by adding perturbations word embeddings and minimizing resultant inside different regions around samples. To validate effectiveness proposed approach, apply it Transformer-based models natural...

10.48550/arxiv.1909.11764 preprint EN other-oa arXiv (Cornell University) 2019-01-01

In this paper, we present Hierarchical Graph Network (HGN) for multi-hop question answering. To aggregate clues from scattered texts across multiple paragraphs, a hierarchical graph is created by constructing nodes on different levels of granularity (questions, sentences, entities), the representations which are initialized with pre-trained contextual encoders. Given graph, initial node updated through propagation, and reasoning performed via traversing edges each subsequent sub-task (e.g.,...

10.18653/v1/2020.emnlp-main.710 article EN cc-by 2020-01-01

Abstract Non-coding RNA structure and function are essential to understanding various biological processes, such as cell signaling, gene expression, post-transcriptional regulations. These all among the core problems in field. With rapid growth of sequencing technology, we have accumulated a massive amount unannotated sequences. On other hand, expensive experimental observatory results only limited numbers annotated data 3D structures. Hence, it is still challenging design computational...

10.1101/2022.08.06.503062 preprint EN cc-by-nc-nd bioRxiv (Cold Spring Harbor Laboratory) 2022-08-07

We present a large, tunable neural conversational response generation model, DialoGPT (dialogue generative pre-trained transformer). Trained on 147M conversation-like exchanges extracted from Reddit comment chains over period spanning 2005 through 2017, extends the Hugging Face PyTorch transformer to attain performance close human both in terms of automatic and evaluation single-turn dialogue settings. show that systems leverage generate more relevant, contentful context-consistent responses...

10.48550/arxiv.1911.00536 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Abstract Here we present the results of protein contact prediction achieved in CASP12 by our RaptorX‐Contact server, which is an early implementation deep learning method for prediction. On a set 38 free‐modeling target domains with median family size around 58 effective sequences, server obtained average top L/5 long‐ and medium‐range accuracy 47% 44%, respectively ( L = length). A complete has 59% 57%, respectively. Our formulates as pixel‐level image labeling problem simultaneously...

10.1002/prot.25377 article EN publisher-specific-oa Proteins Structure Function and Bioinformatics 2017-08-28

Siqi Sun, Yen-Chun Chen, Linjie Li, Shuohang Wang, Yuwei Fang, Jingjing Liu. Proceedings of the 2021 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies. 2021.

10.18653/v1/2021.naacl-main.77 article EN cc-by Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2021-01-01

Large-scale cross-lingual language models (LM), such as mBERT, Unicoder and XLM, have achieved great success in representation learning. However, when applied to zero-shot transfer tasks, most existing methods use only single-language input for LM finetuning, without leveraging the intrinsic alignment between different languages that proves essential multilingual tasks. In this paper, we propose FILTER, an enhanced fusion method takes data XLM finetuning. Specifically, FILTER first encodes...

10.1609/aaai.v35i14.17512 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2021-05-18

Accurate prediction of RNA three-dimensional (3D) structures remains an unsolved challenge. Determining 3D is crucial for understanding their functions and informing RNA-targeting drug development synthetic biology design. The structural flexibility RNA, which leads to the scarcity experimentally determined data, complicates computational efforts. Here we present RhoFold+, language model-based deep learning method that accurately predicts single-chain RNAs from sequences. By integrating...

10.1038/s41592-024-02487-0 article EN cc-by-nc-nd Nature Methods 2024-11-21

This proposal introduces a Dialogue Challenge for building end-to-end task-completion dialogue systems, with the goal of encouraging research community to collaborate and benchmark on standard datasets unified experimental environment. In this special session, we will release human-annotated conversational data in three domains (movie-ticket booking, restaurant reservation, taxi booking), as well an experiment platform built-in simulators each domain, training evaluation purposes. The final...

10.48550/arxiv.1807.11125 preprint EN other-oa arXiv (Cornell University) 2018-01-01

Existing language model compression methods mostly use a simple L_2 loss to distill knowledge in the intermediate representations of large BERT smaller one. Although widely used, this objective by design assumes that all dimensions hidden are independent, failing capture important structural layers teacher network. To achieve better distillation efficacy, we propose Contrastive Distillation on Intermediate Representations (CoDIR), principled framework where student is trained through via...

10.18653/v1/2020.emnlp-main.36 article EN cc-by 2020-01-01

The identification of protein homologs in large databases using conventional methods, such as sequence comparison, often misses remote homologs. Here, we offer an ultrafast, highly sensitive method, dense homolog retriever (DHR), for detecting on the basis a language model and retrieval techniques. Its dual-encoder architecture generates different embeddings same easily locates by comparing these representations. alignment-free nature improves speed incorporates rich evolutionary structural...

10.1038/s41587-024-02353-6 article EN cc-by-nc-nd Nature Biotechnology 2024-08-09

Abstract Motivation Protein contacts contain key information for the understanding of protein structure and function thus, contact prediction from sequence is an important problem. Recently exciting progress has been made on this problem, but predicted proteins without many homologs still low quality not extremely useful de novo prediction. Method This paper presents a new deep learning method that predicts by integrating both evolutionary coupling (EC) conservation through ultra-deep neural...

10.1101/073239 preprint EN cc-by-nc-nd bioRxiv (Cold Spring Harbor Laboratory) 2016-09-03

Abstract In recent decades, antibodies have emerged as indispensable therapeutics for combating diseases, particularly viral infections. However, their development has been hindered by limited structural information and labor-intensive engineering processes. Fortunately, significant advancements in deep learning methods facilitated the precise prediction of protein structure function leveraging co-evolution from homologous proteins. Despite these advances, predicting conformation remains...

10.1093/bib/bbae245 article EN cc-by-nc Briefings in Bioinformatics 2024-05-23

Dietary l-carnitine produces γ-butylbetaine (γBB) in a gut-microbiota-dependent manner humans, and has been proven to be an intermediate product possibly associated with incident cardiovascular diseases or major adverse events. Eliminating reducing the production of microbiota-dependent γBB may contribute adjuvant therapy for diseases. However, date, our understanding metabolic gene clusters (MGCs) microorganisms remains limited. To solve this problem, we constructed manually curated cluster...

10.3390/microorganisms13020225 article EN cc-by Microorganisms 2025-01-21

Homo-oligomerization of biological macromolecules leads to functional assemblies that are critical understanding various cellular processes. However, RNA quaternary structures have been rarely reported. Comparative genomics analysis has identified families containing hundreds sequences adopt conserved secondary and likely fold into complex three-dimensional (3D) structures. We use cryo-electron microscopy (cryo-EM) determine from four families, including ARRPOF OLE forming dimers, ROOL GOLLD...

10.1126/science.adv3451 article EN Science 2025-03-13

We present SpliceTransformer (SpTransformer), a deep-learning framework that predicts tissue-specific RNA splicing alterations linked to human diseases based on genomic sequence. SpTransformer outperforms all previous methods prediction. Application approximately 1.3 million genetic variants in the ClinVar database reveals account for 60% of intronic and synonymous pathogenic mutations, occur at different frequencies across tissue types. Importantly, match their clinical manifestations...

10.1038/s41467-024-53088-6 article EN cc-by-nc-nd Nature Communications 2024-10-23

10.1007/978-3-319-46227-1_1 article EN Lecture notes in computer science 2016-01-01
Coming Soon ...