NFDI4DS | UHH-SEMS - Publication Details

GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics

OPENALEX - Publications

Maxim Zvyagin Alexander Brace Kyle Hippe Yuntian Deng Bin Zhang and 31 more

We seek to transform how new and emergent variants of pandemic-causing viruses, specifically SARS-CoV-2, are identified classified. By adapting large language models (LLMs) for genomic data, we build genome-scale (GenSLMs) which can learn the evolutionary landscape SARS-CoV-2 genomes. pre-training on over 110 million prokaryotic gene sequences fine-tuning a SARS-CoV-2-specific model 1.5 genomes, show that GenSLMs accurately rapidly identify concern. Thus, our knowledge, represents one first...

10.1177/10943420231201154 article EN The International Journal of High Performance Computing Applications 2023-10-27

GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics

OPENALEX - Publications

Maxim Zvyagin Alexander Brace Kyle Hippe Yuntian Deng Bin Zhang and 29 more

We seek to transform how new and emergent variants of pandemic-causing viruses, specifically SARS-CoV-2, are identified classified. By adapting large language models (LLMs) for genomic data, we build genome-scale (GenSLMs) which can learn the evolutionary landscape SARS-CoV-2 genomes. pre-training on over 110 million prokaryotic gene sequences fine-tuning a SARS-CoV-2-specific model 1.5 genomes, show that GenSLMs accurately rapidly identify concern. Thus, our knowledge, represents one first...

10.1101/2022.10.10.511571 preprint EN cc-by-nc-nd bioRxiv (Cold Spring Harbor Laboratory) 2022-10-11

Recent Progress of Machine Learning in Gene Therapy

OPENALEX - Publications

Cassandra Hunt Sandra K. Montgomery Joshua William Berkenpas Noel Sigafoos John Christian Oakley and 8 more

With new developments in biomedical technology, it is now a viable therapeutic treatment to alter genes with techniques like CRISPR. At the same time, increasingly cheaper perform whole genome sequencing, resulting rapid advancement gene therapy and editing precision medicine. Understanding current industry academic applications of provides an important backdrop future scientific developments. Additionally, machine learning artificial intelligence allow for reduction time money spent...

10.2174/1566523221666210622164133 article EN Current Gene Therapy 2021-06-23

Artificial intelligence advances for de novo molecular structure modeling in cryo‐electron microscopy

OPENALEX - Publications

Dong Si Andrew Nakamura Runbang Tang Haowen Guan Jie Hou and 4 more

Cryo-electron microscopy (cryo-EM) has become a major experimental technique to determine the structures of large protein complexes and molecular assemblies, as evidenced by 2017 Nobel Prize. Although cryo-EM been drastically improved generate high-resolution three-dimensional (3D) maps that contain detailed structural information about macromolecules, computational methods for using data automatically build structure models are lagging far behind. The traditional model building approach is...

10.1002/wcms.1542 article EN Wiley Interdisciplinary Reviews Computational Molecular Science 2021-05-15

Towards a modular architecture for science factories

OPENALEX - Publications

Rafael Vescovi Tobias Ginsburg Kyle Hippe Doga Ozgulbas C. R. Stone and 12 more

Advances in robotic automation, high-performance computing, and artificial intelligence encourage us to propose large, general-purpose science factories with the scale needed tackle large discovery problems support thousands of scientists.

10.1039/d3dd00142c article EN cc-by-nc Digital Discovery 2023-01-01

Computational identification of N4-methylcytosine sites in the mouse genome with machine-learning method

OPENALEX - Publications

Hasan Zulfiqar Rida Sarwar Khan Farwa Hassan Kyle Hippe Cassandra Hunt and 3 more

<abstract> <![CDATA[N4-methylcytosine (4mC) is a kind of DNA modification which could regulate multiple biological processes. Correctly identifying 4mC sites in genomic sequences can provide precise knowledge about their genetic roles. This study aimed to develop an ensemble model predict the mouse genome. In proposed model, were encoded by <italic>k-mer</italic>, enhanced nucleic acid composition and <italic>k</italic>-spaced pairs. Subsequently, these features optimized using minimum...

10.3934/mbe.2021167 article EN cc-by Mathematical Biosciences & Engineering 2021-01-01

Towards a Modular Architecture for Science Factories

OPENALEX - Publications

Rafael Vescovi Tobias Ginsburg Kyle Hippe Doga Ozgulbas C. R. Stone and 12 more

Advances in robotic automation, high-performance computing (HPC), and artificial intelligence (AI) encourage us to conceive of science factories: large, general-purpose computation- AI-enabled self-driving laboratories (SDLs) with the generality scale needed both tackle large discovery problems support thousands scientists. Science factories require modular hardware software that can be replicated for (re)configured many applications. To this end, we propose a prototype factory architecture...

10.48550/arxiv.2308.09793 preprint EN cc-by arXiv (Cornell University) 2023-01-01

ProLanGO2

OPENALEX - Publications

Kyle Hippe Sola Gbenro Renzhi Cao

Predicting protein function from sequence is a main challenge in the computational biology field. Traditional methods that search sequences against existing databases may not work well practice, particularly when little or no homology exists database. We introduce ProLanGO2 method which utilizes natural language processing and machine learning techniques to tackle prediction problem with as input. Our has been benchmarked blindly latest Critical Assessment of Function Annotation algorithms...

10.1145/3388440.3414701 article EN 2020-09-21

ZoomQA: residue-level protein model accuracy estimation with machine learning on sequential and 3D structural features

OPENALEX - Publications

Kyle Hippe Cade Lilley Joshua William Berkenpas Ciri Chandana Pocha Kiyomi Kishaba and 4 more

The Estimation of Model Accuracy problem is a cornerstone in the field Bioinformatics. As CASP14, there are 79 global QA methods, and minority 39 residue-level methods with very few them working on protein complexes. Here, we introduce ZoomQA, novel, single-model method for assessing accuracy tertiary structure/complex prediction at residue level, which have many applications such as drug discovery. ZoomQA differs from others by considering change chemical physical features fragment...

10.1093/bib/bbab384 article EN Briefings in Bioinformatics 2021-09-08

DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System Technologies

OPENALEX - Publications

Shuaiwen Leon Song Bonnie Kruft Minjia Zhang Conglong Li Shiyang Chen and 87 more

In the upcoming decade, deep learning may revolutionize natural sciences, enhancing our capacity to model and predict occurrences. This could herald a new era of scientific exploration, bringing significant advancements across sectors from drug development renewable energy. To answer this call, we present DeepSpeed4Science initiative (deepspeed4science.ai) which aims build unique capabilities through AI system technology innovations help domain experts unlock today's biggest science...

10.48550/arxiv.2310.04610 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Exploring Benchmarks for Self-Driving Labs using Color Matching

OPENALEX - Publications

Tobias Ginsburg Kyle Hippe Ryan Lewis Aileen Cleary Doga Ozgulbas and 5 more

Self Driving Labs (SDLs) that combine automation of experimental procedures with autonomous decision making are gaining popularity as a means increasing the throughput scientific workflows. The task identifying quantities supplied colored pigments match target color, color matching problem, provides simple and flexible SDL test case, it requires experiment proposal, sample creation, analysis, three common components in discovery applications. We present robotic solution to problem allows for...

10.1145/3624062.3624615 article EN 2023-11-10

LSHBloom: Memory-efficient, Extreme-scale Document Deduplication

OPENALEX - Publications

Arham Khan Robert Underwood Carlo Siebenschuh Yadu Babuji Aswathy Ajith and 5 more

Deduplication is a major focus for assembling and curating training datasets large language models (LLM) -- detecting eliminating additional instances of the same content in collections technical documents. Unrestrained, duplicates dataset increase costs lead to undesirable properties such as memorization trained or cheating on evaluation. Contemporary approaches document-level deduplication are often extremely expensive both runtime memory. We propose LSHBloom, an extension MinhashLSH,...

10.48550/arxiv.2411.04257 preprint EN arXiv (Cornell University) 2024-11-06

BioNeMo Framework: a modular, high-performance library for AI model development in drug discovery

OPENALEX - Publications

Peter C. St. John Dejun Lin P. Binder Malcolm W. Greaves Varun Shah and 83 more

Artificial Intelligence models encoding biology and chemistry are opening new routes to high-throughput high-quality in-silico drug development. However, their training increasingly relies on computational scale, with recent protein language (pLM) hundreds of graphical processing units (GPUs). We introduce the BioNeMo Framework facilitate AI across GPUs. Its modular design allows integration individual components, such as data loaders, into existing workflows is open community contributions....

10.48550/arxiv.2411.10548 preprint EN arXiv (Cornell University) 2024-11-15

MProt-DPO: Breaking the ExaFLOPS Barrier for Multimodal Protein Design Workflows with Direct Preference Optimization

OPENALEX - Publications

Gautham Dharuman Kyle Hippe Alexander Brace Sam Foreman Väinö Hatanpää and 25 more

10.1109/sc41406.2024.00013 article EN 2024-11-17

Protein Generation via Genome-scale Language Models with Bio-physical Scoring

OPENALEX - Publications

Gautham Dharuman Logan Ward Heng Ma Priyanka V. Setty Ozan Gökdemir and 10 more

Large language models (LLMs) trained on vast biological datasets can learn motifs and correlations across the evolutionary landscape of natural proteins. LLMs then be used for de novo design novel proteins with specific structures, functions, physicochemical properties. We employ a pre-trained genome-scale model that uses codons as tokens integrate it into workflow targeted generation sequences. Our framework suggests new gene sequences are ranked downstream evaluation by metrics...

10.1145/3624062.3626087 article EN 2023-11-10

The Development of Machine Learning Methods in Discriminating Secretory Proteins of Malaria Parasite

OPENALEX - Publications

Ting Liu Jiamao Chen Qian Zhang Kyle Hippe Cassandra Hunt and 3 more

Abstract: Malaria caused by Plasmodium falciparum is one of the major infectious diseases in world. It essential to exploit an effective method predict secretory proteins malaria parasites develop cures and treatment. Biochemical assays can provide details for accurate identification proteins, but these methods are expensive time-consuming. In this paper, we summarized machine learningbased algorithms compared construction strategies between different computational methods. Also, discussed...

10.2174/0929867328666211005140625 article EN Current Medicinal Chemistry 2021-10-23

HMMeta

OPENALEX - Publications

Sola Gbenro Kyle Hippe Renzhi Cao

As the body of genomic product data increases at a much faster rate than can be annotated, computational analysis protein function has never been more important. In this research, we introduce novel prediction method HMMeta, which is based on prominent natural language technique Hidden Markov Models (HMM). With new representation sequence as language, trained unique HMM for each Gene Ontology (GO) term taken from UniProt database, in total 27,451 GO IDs leading to creation Models. We...

10.1145/3388440.3414702 article EN 2020-09-21

Synthqa - Hierarchical Machine Learning-Based Protein Quality Assessment

OPENALEX - Publications

Mikhail Korovnik Kyle Hippe Jie Hou Dong Si Kiyomi Kishaba and 1 more

ABSTRACT Motivation It has been a challenge for biologists to determine 3D shapes of proteins from linear chain amino acids and understand how carry out life’s tasks. Experimental techniques, such as X-ray crystallography or Nuclear Magnetic Resonance, are time-consuming. This highlights the importance computational methods protein structure predictions. In field prediction, ranking predicted decoys selecting one closest native is known model quality assessment (QA), accuracy estimation...

10.1101/2021.01.28.428710 preprint EN cc-by-nc-nd bioRxiv (Cold Spring Harbor Laboratory) 2021-01-30

Zoomqa: Residue-Level Single-Model QA Support Vector Machine Utilizing Sequential and 3D Structural Features

OPENALEX - Publications

Kyle Hippe Cade Lilley William Berkenpas Kiyomi Kishaba Renzhi Cao

ABSTRACT Motivation The Estimation of Model Accuracy problem is a cornerstone in the field Bioinformatics. When predictions are made for proteins which we do not know native structure, run into an issue to tell how good tertiary structure prediction is, especially protein binding regions, useful drug discovery. Currently, most methods only evaluate overall quality decoy, and few can work on residue level complex. Here introduce ZoomQA, novel, single-model method assessing accuracy / complex...

10.1101/2021.01.28.428680 preprint EN bioRxiv (Cold Spring Harbor Laboratory) 2021-01-30

Abstract 2521: AI-enabled multiscale modeling of SARS-CoV-2 replication transcription complex

OPENALEX - Publications

Arvind Ramanathan Anda Trifan Defne G. Ozgulbas Alexander Brace Kyle Hippe and 4 more

10.1016/j.jbc.2023.103443 article EN cc-by Journal of Biological Chemistry 2023-01-01

Exploring Benchmarks for Self-Driving Labs using Color Matching

OPENALEX - Publications

Tobias Ginsburg Kyle Hippe Ryan Lewis Doga Ozgulbas Aileen Cleary and 4 more

Self Driving Labs (SDLs) that combine automation of experimental procedures with autonomous decision making are gaining popularity as a means increasing the throughput scientific workflows. The task identifying quantities supplied colored pigments match target color, color matching problem, provides simple and flexible SDL test case, it requires experiment proposal, sample creation, analysis, three common components in discovery applications. We present robotic solution to problem allows for...

10.48550/arxiv.2310.00510 preprint EN other-oa arXiv (Cornell University) 2023-01-01