- RNA and protein synthesis mechanisms
- Genomics and Phylogenetic Studies
- Machine Learning in Bioinformatics
- Chromosomal and Genetic Variations
- Genomics and Chromatin Dynamics
- RNA modifications and cancer
- Liver Disease and Transplantation
- Glycosylation and Glycoproteins Research
- Bacteriophages and microbial interactions
- Metabolomics and Mass Spectrometry Studies
- Advanced Proteomics Techniques and Applications
- Liver Disease Diagnosis and Treatment
- Molecular Biology Techniques and Applications
- Advanced biosensing and bioanalysis techniques
- Cellular Automata and Applications
- Genetic Mapping and Diversity in Plants and Animals
- DNA and Nucleic Acid Chemistry
- Plant Molecular Biology Research
- Music and Audio Processing
- vaccines and immunoinformatics approaches
- Gene Regulatory Network Analysis
- Genomic variations and chromosomal abnormalities
- Handwritten Text Recognition Techniques
- Genomics and Rare Diseases
- Molecular Junctions and Nanostructures
Pennsylvania State University
2023-2025
Penn State Milton S. Hershey Medical Center
2023-2025
The rapid progression of genomics and proteomics has been driven by the advent advanced sequencing technologies, large, diverse, readily available omics datasets, evolution computational data processing capabilities. vast amount generated these advancements necessitates efficient algorithms to extract meaningful information. K-mers serve as a valuable tool when working with large offering several advantages in speed memory efficiency carrying potential for intrinsic biological functionality....
Repetitive DNA sequences can form noncanonical structures such as H-DNA. The new telomere-to-telomere genome assembly for the human has eliminated gaps, enabling examination of highly repetitive regions including centromeric and pericentromeric repeats ribosomal arrays. We find that H-DNA appears once every 25 000 base pairs in genome. Its distribution is inhomogeneous with motif hotspots being detectable acrocentric chromosomes. Ribosomal arrays are genomic element a 40.94-fold enrichment....
Determining the organisms present in a biosample has many important applications agriculture, wildlife conservation, and healthcare. Here, we develop universal fingerprint based on identification of short peptides that are unique to specific organism. We define quasi-prime as sequences found only one species, analyzed proteomes from 21 875 viruses humans, annotated smallest peptide kmer species absent all other proteomes. also perform simulations across reference observe lower than expected...
Whole Genome and Proteome Alignments, represented by the Multiple Alignment File (MAF) format, have become a standard approach in comparative genomics proteomics. These often require identifying conserved motifs, which is crucial for understanding functional evolutionary relationships. However, current approaches lack direct method motif detection within MAF files. We present MAFin, novel tool that enables efficient conservation analysis files to address this gap, streamlining genomic...
Despite the exponential increase in sequencing information driven by massively parallel DNA technologies, universal and succinct genomic fingerprints for each organism are still missing. Identifying shortest species-specific nucleotide sequences offers insights into species evolution holds potential practical applications agriculture, wildlife conservation, healthcare. We propose a new method sequence analysis termed nucleic “quasi-primes,” occurring of 45,076 organismal reference genomes,...
Abstract The identification of succinct, universal fingerprints that enable the characterization individual taxonomies can reveal insights into trait development and have widespread applications in pathogen diagnostics, human healthcare, ecology biomes. Here, we investigated existence peptide k-mer sequences are exclusively present a specific taxonomy absent every other taxonomic level, termed quasi-primes. By analyzing proteomes across 24,073 species, identified quasi-prime peptides to...
Z-DNA is an alternative left-handed helical form of DNA with a zigzag-shaped backbone that differs from the right-handed canonical B-DNA helix. has been implicated in various biological processes, including transcription, replication, and repair, can induce genetic instability. Repetitive sequences alternating purines pyrimidines have potential to adopt structures. ZSeeker novel computational tool developed for accurate detection Z-DNA-forming genomes, addressing limitations prior methods....
Abstract Inverted repeats are repetitive elements that can form hairpin and cruciform structures. They linked to genomic instability; however, they also have various biological functions. Their distribution differs markedly across taxonomic groups in the tree of life, exhibit high polymorphism due their inherent instability. Advances sequencing technologies declined costs enabled generation an ever-growing number complete genomes for organisms life. However, a comprehensive database...
Abstract Short tandem repeats (STRs) are widespread, dynamic repetitive elements with a number of biological functions and relevance to human diseases. However, their prevalence across taxa remains poorly characterized. Here we examined the impact STRs in genomes 117,253 organisms spanning tree life. We find that there large differences frequencies between organismal these largely driven by taxonomic group an organism belongs to. Using simulated genomes, on average is no enrichment bacterial...
Abstract The prevalence of nucleic and peptide short sequences across organismal genomes proteomes has not been thoroughly investigated. We examined 45 785 reference 21 871 proteomes, spanning archaea, bacteria, eukaryotes viruses to calculate the rarity in them. To capture this, we developed a metric each sequence nature, index. find that frequency certain dipeptides rare oligopeptide is hundreds times lower than expected, which case for any dinucleotides. also generate predictive...
The decrease in sequencing expenses has facilitated the creation of reference genomes and proteomes for an expanding array organisms. Nevertheless, no established repository that details organism-specific genomic proteomic sequences specific lengths, referred to as kmers, exists our knowledge. In this article, we present kmerDB, a database accessible through interactive web interface provides kmer-based information from systematic way. kmerDB currently contains 202,340,859,107 base pairs...
Inverted repeats (IRs) can form alternative DNA secondary structures called hairpins and cruciforms, which have a multitude of functional roles been associated with genomic instability. However, their prevalence across diverse organismal genomes remains only partially understood. Here, we examine the IRs 118,065 complete genomes. Our comprehensive analysis taxonomic subdivisions reveals significant differences in distribution, frequency, biophysical properties perfect among these We identify...
G-quadruplex DNA structures exhibit a profound influence on essential biological processes, including transcription, replication, telomere maintenance, and genomic stability. These have demonstrably shaped organismal evolution. However, comprehensive, organism-wide map encompassing the diversity of life has remained elusive. Here, we introduce Quadrupia, most extensive well-characterized database to date, facilitating exploration across evolutionary spectrum. Quadrupia identified sequences...
Short tandem repeats (STRs) are widespread, repetitive elements, with a number of biological functions and among the most rapidly mutating regions in genome. Their distribution varies significantly between taxonomic groups tree life highly polymorphic within human population. Advances sequencing technologies coupled decreasing costs have enabled generation an ever-growing complete genomes. Additionally, arrival accurate long reads has facilitated Telomere-to-Telomere (T2T) assemblies...
Abstract Early detection of human disease is associated with improved clinical outcomes. However, many diseases are often detected at an advanced, symptomatic stage where patients past efficacious treatment periods and can result in less favorable Therefore, methods that accurately detect a presymptomatic urgently needed. Here, we introduce “frequentmers”; short sequences specific recurrently observed either patient or healthy control samples, but not both. We showcase the utility...
Abstract The prevalence of nucleic and peptide short sequences across organismal genomes proteomes has not been thoroughly investigated. Here we examined 45,785 reference 21,871 proteomes, spanning archaea, bacteria, viruses eukaryotes to calculate the rarity in them. To capture this, developed a metric each sequence nature, Anti-Kardashian index. We find that frequency certain dipeptides rare oligopeptide is hundreds times lower than expected, which case for any dinucleotides. also generate...
ABSTRACT The rapid decline in sequencing cost has enabled the generation of reference genomes and proteomes for a growing number organisms. However, at present time, there is no established repository that provides information about organism-specific genomic proteomic sequences certain lengths, also known as kmers, are either or absent each genome proteome. In this article, we kmerDB, database accessible through an interactive web interface kmer based from systematic way. kmerDB currently...
Abstract Early detection of human disease is associated with improved clinical outcomes. However, many diseases are often detected at an advanced, symptomatic stage where patients past efficacious treatment periods and can result in less favorable Therefore, methods that accurately detect a presymptomatic urgently needed. Here, we introduce “frequentmers”; short sequences specific recurrently observed either patient or healthy control samples, but not both. We showcase the utility...
Abstract Repetitive DNA sequences can form non-canonical structures such as H-DNA which is an intramolecular triplex structure. The new Telomere-to-Telomere (T2T) genome assembly for the human has eliminated gaps, enabling examination of highly repetitive regions including centromeric and pericentromeric repeats ribosomal arrays. This gapless allows distribution in parts that were not previously annotated. We find appears once every 30,000 bps genome. Its inhomogeneous with motif hotspots...
Inverted repeats are repetitive elements that can form hairpin and cruciform structures. They linked to genomic instability, however they also have various biological functions. Their distribution differs markedly across taxonomic groups in the tree of life, exhibit high polymorphism due their inherent instability. Advances sequencing technologies declined costs enabled generation an ever-growing number complete genomes for organisms life. However, a comprehensive database encompassing...