- Machine Learning in Bioinformatics
- Genomics and Phylogenetic Studies
- Protein Structure and Dynamics
- Scientific Computing and Data Management
- Computational Drug Discovery Methods
- Machine Learning in Materials Science
- RNA and protein synthesis mechanisms
- Modular Robots and Swarm Intelligence
- Cell Image Analysis Techniques
- SARS-CoV-2 and COVID-19 Research
- Genetics, Bioinformatics, and Biomedical Research
- Image Retrieval and Classification Techniques
- Robotics and Automated Systems
- Advanced Image and Video Retrieval Techniques
- Single-cell and spatial transcriptomics
- Bioinformatics and Genomic Networks
- Microbial Metabolic Engineering and Bioproduction
- CRISPR and Genetic Engineering
- Biomedical Text Mining and Ontologies
- Innovative Microfluidic and Catalytic Techniques Innovation
- Advanced Electron Microscopy Techniques and Applications
- Digital Imaging for Blood Diseases
- Virus-based gene therapy research
- vaccines and immunoinformatics approaches
- Color Science and Applications
Argonne National Laboratory
2022-2024
Pacific Lutheran University
2020-2021
We seek to transform how new and emergent variants of pandemic-causing viruses, specifically SARS-CoV-2, are identified classified. By adapting large language models (LLMs) for genomic data, we build genome-scale (GenSLMs) which can learn the evolutionary landscape SARS-CoV-2 genomes. pre-training on over 110 million prokaryotic gene sequences fine-tuning a SARS-CoV-2-specific model 1.5 genomes, show that GenSLMs accurately rapidly identify concern. Thus, our knowledge, represents one first...
We seek to transform how new and emergent variants of pandemic-causing viruses, specifically SARS-CoV-2, are identified classified. By adapting large language models (LLMs) for genomic data, we build genome-scale (GenSLMs) which can learn the evolutionary landscape SARS-CoV-2 genomes. pre-training on over 110 million prokaryotic gene sequences fine-tuning a SARS-CoV-2-specific model 1.5 genomes, show that GenSLMs accurately rapidly identify concern. Thus, our knowledge, represents one first...
With new developments in biomedical technology, it is now a viable therapeutic treatment to alter genes with techniques like CRISPR. At the same time, increasingly cheaper perform whole genome sequencing, resulting rapid advancement gene therapy and editing precision medicine. Understanding current industry academic applications of provides an important backdrop future scientific developments. Additionally, machine learning artificial intelligence allow for reduction time money spent...
Cryo-electron microscopy (cryo-EM) has become a major experimental technique to determine the structures of large protein complexes and molecular assemblies, as evidenced by 2017 Nobel Prize. Although cryo-EM been drastically improved generate high-resolution three-dimensional (3D) maps that contain detailed structural information about macromolecules, computational methods for using data automatically build structure models are lagging far behind. The traditional model building approach is...
Advances in robotic automation, high-performance computing, and artificial intelligence encourage us to propose large, general-purpose science factories with the scale needed tackle large discovery problems support thousands of scientists.
<abstract> <![CDATA[N4-methylcytosine (4mC) is a kind of DNA modification which could regulate multiple biological processes. Correctly identifying 4mC sites in genomic sequences can provide precise knowledge about their genetic roles. This study aimed to develop an ensemble model predict the mouse genome. In proposed model, were encoded by <italic>k-mer</italic>, enhanced nucleic acid composition and <italic>k</italic>-spaced pairs. Subsequently, these features optimized using minimum...
Advances in robotic automation, high-performance computing (HPC), and artificial intelligence (AI) encourage us to conceive of science factories: large, general-purpose computation- AI-enabled self-driving laboratories (SDLs) with the generality scale needed both tackle large discovery problems support thousands scientists. Science factories require modular hardware software that can be replicated for (re)configured many applications. To this end, we propose a prototype factory architecture...
Predicting protein function from sequence is a main challenge in the computational biology field. Traditional methods that search sequences against existing databases may not work well practice, particularly when little or no homology exists database. We introduce ProLanGO2 method which utilizes natural language processing and machine learning techniques to tackle prediction problem with as input. Our has been benchmarked blindly latest Critical Assessment of Function Annotation algorithms...
The Estimation of Model Accuracy problem is a cornerstone in the field Bioinformatics. As CASP14, there are 79 global QA methods, and minority 39 residue-level methods with very few them working on protein complexes. Here, we introduce ZoomQA, novel, single-model method for assessing accuracy tertiary structure/complex prediction at residue level, which have many applications such as drug discovery. ZoomQA differs from others by considering change chemical physical features fragment...
In the upcoming decade, deep learning may revolutionize natural sciences, enhancing our capacity to model and predict occurrences. This could herald a new era of scientific exploration, bringing significant advancements across sectors from drug development renewable energy. To answer this call, we present DeepSpeed4Science initiative (deepspeed4science.ai) which aims build unique capabilities through AI system technology innovations help domain experts unlock today's biggest science...
Self Driving Labs (SDLs) that combine automation of experimental procedures with autonomous decision making are gaining popularity as a means increasing the throughput scientific workflows. The task identifying quantities supplied colored pigments match target color, color matching problem, provides simple and flexible SDL test case, it requires experiment proposal, sample creation, analysis, three common components in discovery applications. We present robotic solution to problem allows for...
Deduplication is a major focus for assembling and curating training datasets large language models (LLM) -- detecting eliminating additional instances of the same content in collections technical documents. Unrestrained, duplicates dataset increase costs lead to undesirable properties such as memorization trained or cheating on evaluation. Contemporary approaches document-level deduplication are often extremely expensive both runtime memory. We propose LSHBloom, an extension MinhashLSH,...
Artificial Intelligence models encoding biology and chemistry are opening new routes to high-throughput high-quality in-silico drug development. However, their training increasingly relies on computational scale, with recent protein language (pLM) hundreds of graphical processing units (GPUs). We introduce the BioNeMo Framework facilitate AI across GPUs. Its modular design allows integration individual components, such as data loaders, into existing workflows is open community contributions....
Large language models (LLMs) trained on vast biological datasets can learn motifs and correlations across the evolutionary landscape of natural proteins. LLMs then be used for de novo design novel proteins with specific structures, functions, physicochemical properties. We employ a pre-trained genome-scale model that uses codons as tokens integrate it into workflow targeted generation sequences. Our framework suggests new gene sequences are ranked downstream evaluation by metrics...
Abstract: Malaria caused by Plasmodium falciparum is one of the major infectious diseases in world. It essential to exploit an effective method predict secretory proteins malaria parasites develop cures and treatment. Biochemical assays can provide details for accurate identification proteins, but these methods are expensive time-consuming. In this paper, we summarized machine learningbased algorithms compared construction strategies between different computational methods. Also, discussed...
As the body of genomic product data increases at a much faster rate than can be annotated, computational analysis protein function has never been more important. In this research, we introduce novel prediction method HMMeta, which is based on prominent natural language technique Hidden Markov Models (HMM). With new representation sequence as language, trained unique HMM for each Gene Ontology (GO) term taken from UniProt database, in total 27,451 GO IDs leading to creation Models. We...
ABSTRACT Motivation It has been a challenge for biologists to determine 3D shapes of proteins from linear chain amino acids and understand how carry out life’s tasks. Experimental techniques, such as X-ray crystallography or Nuclear Magnetic Resonance, are time-consuming. This highlights the importance computational methods protein structure predictions. In field prediction, ranking predicted decoys selecting one closest native is known model quality assessment (QA), accuracy estimation...
ABSTRACT Motivation The Estimation of Model Accuracy problem is a cornerstone in the field Bioinformatics. When predictions are made for proteins which we do not know native structure, run into an issue to tell how good tertiary structure prediction is, especially protein binding regions, useful drug discovery. Currently, most methods only evaluate overall quality decoy, and few can work on residue level complex. Here introduce ZoomQA, novel, single-model method assessing accuracy / complex...
Self Driving Labs (SDLs) that combine automation of experimental procedures with autonomous decision making are gaining popularity as a means increasing the throughput scientific workflows. The task identifying quantities supplied colored pigments match target color, color matching problem, provides simple and flexible SDL test case, it requires experiment proposal, sample creation, analysis, three common components in discovery applications. We present robotic solution to problem allows for...