Diogo Pratas

ORCID: 0000-0003-1176-552X
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Algorithms and Data Compression
  • Genomics and Phylogenetic Studies
  • Fractal and DNA sequence analysis
  • Computability, Logic, AI Algorithms
  • Machine Learning in Bioinformatics
  • Bacteriophages and microbial interactions
  • RNA and protein synthesis mechanisms
  • Gene expression and cancer classification
  • DNA and Biological Computing
  • Chromosomal and Genetic Variations
  • Molecular Biology Techniques and Applications
  • Parvovirus B19 Infection Studies
  • Advanced Data Storage Technologies
  • Viral-associated cancers and disorders
  • Plant and Fungal Interactions Research
  • Plant Virus Research Studies
  • Polyomavirus and related diseases
  • Scientific Computing and Data Management
  • Natural Language Processing Techniques
  • Viral Infections and Outbreaks Research
  • Benford’s Law and Fraud Detection
  • Forensic and Genetic Research
  • Cancer Genomics and Diagnostics
  • Genomics and Chromatin Dynamics
  • Computational Drug Discovery Methods

University of Aveiro
2015-2024

University of Helsinki
2019-2024

Helsinki University Hospital
2021-2024

Institute of Electronics
2023-2024

The ever increasing growth of the production high-throughput sequencing data poses a serious challenge to storage, processing and transmission these data. As frequently stated, it is deluge. Compression essential address this challenge—it reduces storage space costs, along with speeding up transmission. In paper, we provide comprehensive survey existing compression approaches, that are specialized for biological data, including protein DNA sequences. Also, devote an important part paper...

10.3390/info7040056 article EN cc-by Information 2016-10-14

Abstract Background Cassava (Manihot esculenta) is an important clonally propagated food crop in tropical and subtropical regions worldwide. Genetic gain by molecular breeding has been limited, partially because cassava a highly heterozygous with repetitive difficult-to-assemble genome. Findings Here we demonstrate that Pacific Biosciences high-fidelity (HiFi) sequencing reads, combination the assembler hifiasm, produced genome assemblies at near complete haplotype resolution higher...

10.1093/gigascience/giac028 article EN cc-by GigaScience 2022-01-01

Abstract Little is known on the landscape of viruses that reside within our cells, nor interplay with host imperative for their persistence. Yet, a lifetime interactions conceivably have an imprint physiology and immune phenotype. In this work, we revealed genetic make-up unique composition eukaryotic human DNA virome in nine organs (colon, liver, lung, heart, brain, kidney, skin, blood, hair) 31 Finnish individuals. By integration quantitative (qPCR) qualitative (hybrid-capture sequencing)...

10.1093/nar/gkad199 article EN cc-by Nucleic Acids Research 2023-03-23

Abstract Motivation: The data deluge phenomenon is becoming a serious problem in most genomic centers. To alleviate it, general purpose tools, such as gzip, are used to compress the data. However, although pervasive and easy use, these tools fall short when intention reduce much possible data, for example, medium- long-term storage. A number of algorithms have been proposed compression genomics but unfortunately only few them made available usable reliable tools. Results: In this article, we...

10.1093/bioinformatics/btt594 article EN cc-by-nc Bioinformatics 2013-10-16

Research in the genomic sciences is confronted with volume of sequencing and resequencing data increasing at a higher pace than that storage communication resources, shifting significant part research budgets from component project to computational one. Hence, being able efficiently store problem paramount importance. In this article, we describe GReEn (Genome Resequencing Encoding), tool for compressing genome using reference sequence. It overcomes some drawbacks recently proposed GRS,...

10.1093/nar/gkr1124 article EN cc-by-nc Nucleic Acids Research 2011-12-01

The number of genomic sequences is growing substantially. Besides discarding part the data, only efficient possibility for coping with this trend data compression. We present an compressor sequences, allowing both reference-free and referential This uses a mixture context models several orders, according to two model classes: reference target. A new type model, which capable tolerating substitution errors, introduced. For ensuring flexibility regarding hardware specifications, cache-hashes...

10.1109/dcc.2016.60 article EN 2016-03-01

Ebola virus causes high mortality hemorrhagic fevers, with more than 25 000 cases and 10 deaths in the current outbreak. Only experimental therapies are available, thus, novel diagnosis tools druggable targets needed.Analysis of genomes from outbreak reveals presence short DNA sequences that appear nowhere human genome. We identify shortest such lengths between 12 14. three absent length exist they consistently at same location on two proteins, all genomes, but The alignment-free method used...

10.1093/bioinformatics/btv189 article EN cc-by-nc Bioinformatics 2015-04-02

Abstract Background The increasing production of genomic data has led to an intensified need for models that can cope efficiently with the lossless compression DNA sequences. Important applications include long-term storage and compression-based analysis. In literature, only a few recent articles propose use neural networks sequence compression. However, they fall short when compared specific tools, such as GeCo2. This limitation is due absence specifically designed this work, we combine...

10.1093/gigascience/giaa119 article EN cc-by GigaScience 2020-11-01

The increasing availability of viral sequences has led to the emergence many optimized genome reconstruction tools. Given that number new tools is steadily increasing, it complex identify functional and offer an equilibrium between accuracy computational resources as well features each tool provides. In this paper, we surveyed open-source (including pipelines) used for human reconstruction, identifying specific characteristics, features, similarities, dissimilarities these For quantitative...

10.1101/2025.01.17.633368 preprint EN cc-by-nc bioRxiv (Cold Spring Harbor Laboratory) 2025-01-22

Advances in sequencing technologies have enabled the characterization of multiple microbial and host genomes, opening new frontiers knowledge while kindling novel applications research perspectives. Among these is investigation viral communities residing human body their impact on health disease. To this end, study samples from tissues critical, yet, complexity such analysis calls for a dedicated pipeline. We provide an automatic efficient pipeline identification, assembly, genomes that...

10.1093/gigascience/giaa086 article EN cc-by GigaScience 2020-08-01

The ability of finite-context models for compressing DNA sequences has been demonstrated on some recent works. In this paper, we further explore line, proposing a compression method based eight models, with orders from two to sixteen, whose probabilities are averaged using weights calculated through recursive procedure. was tested total 2,338 belonging bacterial genomes, sizes ranging 1,286 13,033,779 bases, showing better results than the state-of-the-art XM coding algorithm and also faster...

10.1109/ssp.2011.5967637 article EN 2011-06-01

Abstract Species evolution is indirectly registered in their genomic structure. The emergence and advances sequencing technology provided a way to access genome information, namely identify study evolutionary macro-events, as well chromosome alterations for clinical purposes. This paper describes completely alignment-free computational method, based on blind unsupervised approach, detect large-scale small-scale rearrangements between pairs of DNA sequences. To illustrate the power usefulness...

10.1038/srep10203 article EN cc-by Scientific Reports 2015-05-18

Abstract Summary The ever-increasing growth of high-throughput sequencing technologies has led to a great acceleration medical and biological research discovery. As these platforms advance, the amount information for diverse genomes increases at unprecedented rates. Confidentiality, integrity authenticity such genomic should be ensured due its extremely sensitive nature. In this paper, we propose Cryfa, fast secure encryption tool data, namely in Fasta, Fastq, VCF, SAM BAM formats, which is...

10.1093/bioinformatics/bty645 article EN cc-by-nc Bioinformatics 2018-07-18

The emerging next-generation sequencing (NGS) is bringing, besides the natural huge amounts of data, an avalanche new specialized tools (for analysis, compression, alignment, among others) and large public private network infrastructures. Therefore, a direct necessity specific simulation for testing benchmarking rising, such as flexible portable FASTQ read simulator, without need reference sequence, yet correctly prepared producing approximately same characteristics real data. We present XS,...

10.1186/1756-0500-7-40 article EN cc-by BMC Research Notes 2014-01-01

The implications of inherited chromosomally integrated human herpesvirus 6 (iciHHV-6) in solid organ transplantation remain uncertain. Although this trait has been linked to unfavorable clinical outcomes, an association between viral reactivation and complications only conclusively established a few cases. We used hybrid capture sequencing for in-depth analysis the sequences reconstructed from sequential liver biopsies. Moreover, we investigated replication through situ hybridization...

10.1093/infdis/jiae268 article EN cc-by The Journal of Infectious Diseases 2024-05-17

Authorship attribution is a classical classification problem. We use it here to illustrate the performance of compression-based measure that relies on notion relative compression. Besides comparing with recent approaches multiple discriminant analysis and support vector machines, we compare Normalized Conditional Compression Distance (a direct approximation Information Distance) popular Distance. The Relative (NRC) attained 100% correct in data set used, showing consistency between...

10.1109/dcc.2016.53 article EN 2016-03-01

The development of efficient data compressors for DNA sequences is crucial not only reducing the storage and bandwidth transmission, but also analysis purposes. In particular, improved compression models directly influences outcome anthropological biomedical compression-based methods. this paper, we describe a new lossless compressor with capabilities representing different domains kingdoms. reference-free method uses competitive prediction model to estimate, each symbol, best class be used...

10.3390/e21111074 article EN cc-by Entropy 2019-11-02

Abstract Background The development of high-throughput sequencing technologies and, as its result, the production huge volumes genomic data, has accelerated biological and medical research discovery. Study on rearrangements is crucial owing to their role in chromosomal evolution, genetic disorders, cancer. Results We present Smash++, an alignment-free memory-efficient tool find visualize small- large-scale between 2 DNA sequences. This computational solution extracts information contents...

10.1093/gigascience/giaa048 article EN cc-by GigaScience 2020-05-01

The long-term impact of viruses residing in the human bone marrow (BM) remains unexplored. However, chronic inflammatory processes driven by single or multiple could significantly alter hematopoiesis and immune function. We performed a systematic analysis DNAs 38 BM. detected, quantitative PCRs next-generation sequencing, viral DNA 88.9% samples, up to five one individual. Included were, among others, several herpesviruses, hepatitis B virus, Merkel cell polyomavirus and, unprecedentedly,...

10.3389/fcimb.2021.657245 article EN cc-by Frontiers in Cellular and Infection Microbiology 2021-04-22

Abstract Background Viruses are among the shortest yet highly abundant species that harbor minimal instructions to infect cells, adapt, multiply, and exist. However, with current substantial availability of viral genome sequences, scientific repertory lacks a complexity landscape automatically enlights genomes’ organization, relation, fundamental characteristics. Results This work provides comprehensive genome’s (or quantity information), identifying most redundant complex groups regarding...

10.1093/gigascience/giac079 article EN cc-by GigaScience 2022-01-01

Abstract Bacterial biofilms are a source of infectious human diseases and heavily linked to antibiotic resistance. Pseudomonas aeruginosa is multidrug-resistant bacterium widely present implicated in several hospital-acquired infections. Over the last years, development new drugs able inhibit by interfering with its ability form has become promising strategy drug discovery. Identifying molecules interfere biofilm formation difficult, but further developing these rationally improving their...

10.1007/s10822-023-00505-5 article EN cc-by Journal of Computer-Aided Molecular Design 2023-04-22

The sequencing of ancient DNA samples provides a novel way to find, characterize, and distinguish exogenous genomes endogenous targets. After sequencing, computational composition analysis enables filtering undesired sources in the focal organism, with purpose improving quality assemblies subsequent data analysis. More importantly, such allows extinct extant species be identified without requiring specific or new run. However, identification organisms is complex task, given nature...

10.3390/genes9090445 article EN Genes 2018-09-06
Coming Soon ...