- Genomics and Phylogenetic Studies
- RNA modifications and cancer
- Chromosomal and Genetic Variations
- RNA and protein synthesis mechanisms
- Epigenetics and DNA Methylation
- Plant Virus Research Studies
- Environmental DNA in Biodiversity Studies
- Advanced Graph Theory Research
- Complexity and Algorithms in Graphs
- Cancer-related molecular mechanisms research
- Machine Learning in Bioinformatics
- Algorithms and Data Compression
- Advanced Data Storage Technologies
- Graph Labeling and Dimension Problems
- Enzyme Structure and Function
- Sparse and Compressive Sensing Techniques
- Plant Disease Resistance and Genetics
- Optimization and Packing Problems
- Protist diversity and phylogeny
- Cancer-related gene regulation
- Parallel Computing and Optimization Techniques
- Genomics and Rare Diseases
- graph theory and CDMA systems
- Radiation Effects in Electronics
- Protein Structure and Dynamics
Central South University
2017-2024
Harvard University
2023
Dana-Farber Cancer Institute
2023
Abstract Motivation The Oxford Nanopore sequencing enables to directly detect methylation states of bases in DNA from reads without extra laboratory techniques. Novel computational methods are required improve the accuracy and robustness state prediction using reads. Results In this study, we develop DeepSignal, a deep learning method Testing on Homo sapiens (H. sapiens), Escherichia coli (E. coli) pUC19 shows that DeepSignal can achieve higher performance at both read level genome detecting...
Abstract Motivation Evaluating the gene completeness is critical to measuring quality of a genome assembly. An incomplete assembly can lead errors in predictions, annotation, and other downstream analyses. Benchmarking Universal Single-Copy Orthologs (BUSCO) widely used tool for assessing by testing presence set single-copy orthologs conserved across wide range taxa. However, BUSCO slow particularly large assemblies. It cumbersome apply number Results Here, we present compleasm, an efficient...
Abstract In plants, cytosine DNA methylations (5mCs) can happen in three sequence contexts as CpG, CHG, and CHH (where H = A, C, or T), which play different roles the regulation of biological processes. Although long Nanopore reads are advantageous detection 5mCs comparing to short-read bisulfite sequencing, existing methods only detect CpG context, limits their application plants. Here, we develop DeepSignal-plant, a deep learning tool genome-wide all plants from reads. We Arabidopsis...
Long single-molecular sequencing technologies, such as PacBio circular consensus (CCS) and nanopore sequencing, are advantageous in detecting DNA 5-methylcytosine CpGs (5mCpGs), especially repetitive genomic regions. However, existing methods for 5mCpGs using CCS less accurate robust. Here, we present ccsmeth, a deep-learning method to detect reads. We sequence polymerase-chain-reaction treated M.SssI-methyltransferase of one human sample training ccsmeth. Using long (≥10 Kb) reads, ccsmeth...
Abstract The high sequencing error rate has impeded the application of long noisy reads for diploid genome assembly. Most existing assemblers failed to generate high-quality phased assemblies using reads. Here, we present PECAT, a P hased E rror C orrection and A ssembly T ool, reconstructing genomes from We design haplotype-aware correction method that can retain heterozygote alleles while correcting errors. combine corrected read SNP caller raw further improve identification inconsistent...
Abstract Motivation Assembly completeness evaluation of genome assembly is a critical assessment the accuracy and reliability genomic data. An incomplete can lead to errors in gene predictions, annotation, other downstream analyses. BUSCO one most widely used tools for assessing by comparing presence set single-copy orthologs conserved across wide range taxa. However, runtime be long, particularly some large assemblies. It challenge researchers quickly iterate assemblies or analyze number...
Highly portable Oxford Nanopore sequencer producing long reads in real-time at low cost has made many breakthroughs genomics studies. However, a major limitation of nanopore sequencing is its high errors when deciphering DNA sequences from noisy and complex raw data. In this paper, we developed an end-to-end basecaller, SACall, based on convolution layers, transformer self-attention layers CTC decoder. the are used to downsample signals capture local patterns. To achieve contextual relevance...
Abstract Long single-molecular sequencing, such as PacBio circular consensus sequencing (CCS) and nanopore is advantageous in detecting DNA 5-methylcytosine (5mC) CpGs, especially repetitive genomic regions. However, existing methods for 5mCpGs using CCS are less accurate robust. Here, we present ccsmeth, a deep-learning method to detect reads. We sequence PCR-treated M.SssI-treated of one human sample training ccsmeth. Using long (≥10Kb) reads, ccsmeth achieves 0.90 accuracy 0.97 AUC on...
Oxford Nanopore sequencing producing long reads at low cost has made many breakthroughs in genomics studies. However, the large number of errors genome assembly affect accuracy analysis. Polishing is a procedure to correct and can improve reliability downstream performances existing polishing methods are still not satisfactory.We developed novel method, NeuralPolish, assemblies based on alignment matrix construction orthogonal Bi-GRU networks. In this we designed an feature for representing...
Compared with the second-generation sequencing technologies, third-generation technologies allows us to obtain longer reads (average ∼10 kbps, maximum 900 kbps), but brings a higher error rate (∼15% rate). Nanopolish is variant and methylation detection tool based on hidden Markov model, which uses Oxford Nanopore data for signal-level analysis. can greatly improve accuracy of assembly, whereas it limited by long running time since most executive parts serial computationally expensive...
Determining the structures of proteins is a critical step to understand their biological functions. Crystallography-based X-ray diffraction technique main method for experimental protein structure determination. However, underlying crystallization process, which needs multiple time-consuming and costly steps, has high attrition rate. To overcome this issue, series in silico methods have been developed with primary aim selecting sequences that are promising be crystallized. predictive...
Abstract Motivation Oxford Nanopore sequencing has great potential and advantages in population-scale studies. Due to the cost of sequencing, depth whole-genome for per individual sample must be small. However, existing single nucleotide polymorphism (SNP) callers are aimed at high-coverage reads. Detecting SNP variants on low-coverage data is still a challenging problem. Results We developed novel deep learning-based calling method, NanoSNP, identify sites (excluding short indels) based In...
Highly portable Oxford Nanopore sequencer producing long reads in real time at low cost has made many breakthroughts genomics studies. However, a major limitation of nanopore sequencing is its high errors when deciphering DNA sequences from noisy and complex raw data. Here we develops SACall, an end-to-end basecaller based on convolution layers, transformer self-attention layers CTC decoder. From the perspective read accuracy, SACall yields better performance benchmark than ONT official...
Abstract The Oxford Nanopore sequencing enables to directly detect methylation sites in DNA from reads without extra laboratory techniques. In this study, we develop DeepSignal, a deep learning method methylated reads. DeepSignal construct features both raw electrical signals and signal sequences Testing on of pUC19, E. coli human, show that can achieve higher read level genome accuracy detecting 6mA 5mC comparing previous HMM based methods. Moreover, achieves similar performance cross...
Abstract Methylation states of DNA bases can be detected from native Nanopore reads directly. At present, there are many computational methods that detect 5mCs in CpG contexts accurately by sequencing. However, is currently a lack to non-CpG contexts. In this study, we propose pipeline which 5mC sites both and plant genomes using And sequenced two model plants Arabidopsis thaliana ( A. ) Oryza sativa O. sequencing bisulfite The results our proposed the achieved high correlations with...
The high sequencing error rate has impeded the application of long noisy reads for diploid genome assembly. Most existing assemblers failed to generate high-quality phased assemblies using reads. Here, we present PECAT, a Phased Error Correction and Assembly Tool, reconstructing genomes from We design haplotype-aware correction method that can retain heterozygote alleles while correcting errors. combine corrected read SNP caller raw further improve identification inconsistent overlaps in...
The advent of third-generation sequencing (TGS) technologies, such as the Pacific Biosciences (PacBio) and Oxford Nanopore machines, provides new possibilities for contig assembly, scaffolding, high-performance computing in bioinformatics due to its long reads. However, high error rate poor quality TGS reads provide challenges accurate genome assembly long-read alignment. Efficient processing methods are need prioritize high-quality improving results correction assembly. In this study, we...
Abstract Long-read sequencing technology enables significant progress in de novo genome assembly. However, the high error rate and wide distribution of raw reads result a large number errors Polishing is procedure to fix draft assembly improve reliability genomic analysis. existing methods treat all regions equally while there are fundamental differences between distributions these regions. How achieve very accuracy still challenging problem. Motivated by uneven different assembly, we...
Abstract The third-generation sequencing technology has advanced genome analysis with long-read length, but the reads need error correction due to high rate. Error is a time-consuming process especially when coverage high. Generally, for pair of overlapping A and B, existing methods perform base-level alignment from B correcting read A. And another performed B. However, based on our observation, information can be reused. In this article, we present fast tool Fec, using two-rounds caching....