NFDI4DS | UHH-SEMS - Publication Details

Jacob Morrison

ORCID: 0000-0001-8592-4744

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5088629436

Research Areas

Natural Language Processing Techniques
Topic Modeling
Epigenetics and DNA Methylation
RNA modifications and cancer
Cancer-related gene regulation
Single-cell and spatial transcriptomics
Semantic Web and Ontologies
Genomics and Phylogenetic Studies
Domain Adaptation and Few-Shot Learning
Information Systems Education and Curriculum Development
Underwater Vehicles and Communication Systems
Experimental Learning in Engineering
Structural Integrity and Reliability Analysis
Mechanical stress and fatigue analysis
Privacy, Security, and Data Protection
Ethics and Social Impacts of AI
Hate Speech and Cyberbullying Detection
Human Pose and Action Recognition
Maritime Navigation and Safety
Conflict of Laws and Jurisdiction
Freedom of Expression and Defamation
Underwater Acoustics Research
Cancer Genomics and Diagnostics
Speech Recognition and Synthesis
Open Education and E-Learning

Van Andel Institute
2021-2025

University of St Andrews
2023

University of Edinburgh
2023

Allen Institute
2022

University of Washington
1977-2022

Middle Georgia State College
2012

Qinetiq (United Kingdom)
2003

OLMo: Accelerating the Science of Language Models

OPENALEX - Publications

Dirk Groeneveld Iz Beltagy Evan Pete Walsh Akshita Bhagia Rodney Kinney and 38 more

10.18653/v1/2024.acl-long.841 article EN 2024-01-01

BISCUIT: an efficient, standards-compliant tool suite for simultaneous genetic and epigenetic inference in bulk and single-cell studies

OPENALEX - Publications

Wanding Zhou Benjamin K. Johnson Jacob Morrison Ian Beddows James Eapen and 8 more

Abstract Data from both bulk and single-cell whole-genome DNA methylation experiments are under-utilized in many ways. This is attributable to inefficient mapping of sequencing reads, routinely discarded genetic information, neglected read-level epigenetic linkage information. We introduce the BISulfite-seq Command line User Interface Toolkit (BISCUIT) its companion R/Bioconductor package, biscuiteer, for simultaneous extraction information sequencing. BISCUIT’s performance, flexibility...

10.1093/nar/gkae097 article EN cc-by Nucleic Acids Research 2024-02-27

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

OPENALEX - Publications

Luca Soldaini Rodney Kinney Akshita Bhagia Dustin Schwenk David Atkinson and 31 more

10.18653/v1/2024.acl-long.840 article EN 2024-01-01

Evaluation of whole-genome DNA methylation sequencing library preparation protocols

OPENALEX - Publications

Jacob Morrison Julie Koeman Benjamin K. Johnson Kelly K. Foy Ian Beddows and 6 more

Abstract Background With rapidly dropping sequencing cost, the popularity of whole-genome DNA methylation has been on rise. Multiple library preparation protocols currently exist. We have performed 22 experiments snap frozen human samples, and extensively benchmarked common for sequencing, including three traditional bisulfite-based a new enzyme-based protocol. In addition, different input quantities were compared two kits compatible with reduced starting quantity. we also present...

10.1186/s13072-021-00401-y article EN cc-by Epigenetics & Chromatin 2021-06-19

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

OPENALEX - Publications

Luca Soldaini Rodney Kinney Akshita Bhagia Dustin Schwenk David Atkinson and 31 more

Language models have become a critical technology to tackling wide range of natural language processing tasks, yet many details about how the best-performing were developed are not reported. In particular, information their pretraining corpora is seldom discussed: commercial rarely provide any data; even open release datasets they trained on, or an exact recipe reproduce them. As result, it challenging conduct certain threads modeling research, such as understanding training data impacts...

10.48550/arxiv.2402.00159 preprint EN arXiv (Cornell University) 2024-01-31

Impact of BRCA mutations, age, surgical indication, and hormone status on the molecular phenotype of the human Fallopian tube

OPENALEX - Publications

Ian Beddows Svetlana Djirackor Dalia K. Omran Euihye Jung Natalie Shih and 15 more

10.1038/s41467-025-58145-2 article EN cc-by-nc-nd Nature Communications 2025-03-26

Transparent Human Evaluation for Image Captioning

OPENALEX - Publications

Jungo Kasai Keisuke Sakaguchi Lavinia Dunagan Jacob Morrison Ronan Le Bras and 2 more

Jungo Kasai, Keisuke Sakaguchi, Lavinia Dunagan, Jacob Morrison, Ronan Le Bras, Yejin Choi, Noah Smith. Proceedings of the 2022 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies. 2022.

10.18653/v1/2022.naacl-main.254 article EN cc-by Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2022-01-01

OLMo: Accelerating the Science of Language Models

OPENALEX - Publications

Dirk Groeneveld Iz Beltagy Pete Walsh Akshita Bhagia Rodney Kinney and 38 more

Language models (LMs) have become ubiquitous in both NLP research and commercial product offerings. As their importance has surged, the most powerful closed off, gated behind proprietary interfaces, with important details of training data, architectures, development undisclosed. Given these scientifically studying models, including biases potential risks, we believe it is essential for community to access powerful, truly open LMs. To this end, technical report first release OLMo, a...

10.48550/arxiv.2402.00838 preprint EN arXiv (Cornell University) 2024-02-01

SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature

OPENALEX - Publications

David Wadden Kejian Shi Jacob Morrison Aakanksha Naik Shruti Singh and 8 more

We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations 54 tasks covering five essential scientific literature understanding capabilities: information extraction, summarization, question answering, claim verification, classification. are notable their long input contexts, detailed task specifications, complex structured outputs. While resources available in specific domains such as clinical medicine chemistry,...

10.48550/arxiv.2406.07835 preprint EN arXiv (Cornell University) 2024-06-10

Dupsifter: a lightweight duplicate marking tool for whole genome bisulfite sequencing

OPENALEX - Publications

Jacob Morrison Wanding Zhou Benjamin K. Johnson Hui Shen

Abstract Summary In whole genome sequencing data, polymerase chain reaction amplification results in duplicate DNA fragments coming from the same location genome. The process of preparing a bisulfite (WGBS) library, on other hand, can create two that should not be considered duplicates. Currently, only one WGBS-aware marking tool exists. However, it works with output single tool, does accept streaming input or output, and requires substantial amount memory relative to size. Dupsifter...

10.1093/bioinformatics/btad729 article EN cc-by Bioinformatics 2023-12-01

Gambit MCM AUV: overview and system performance

OPENALEX - Publications

Jacob Morrison Breanna Evans T.S. James Kevin Allen

The Marine and Acoustics Centre (MAC) at QinetiQ Bincleaves is currently pioneering MCM research development in support of the UK MoD by undertaking a programme work focussed on developing demonstrating surveillance reconnaissance from an AUV. It recognised that operations impose key demands vehicle sensor performance levels, with recent analysis highlighting four main areas essential capability, which form focus programme. These are: ability to correctly detect classify targets; accurately...

10.1109/oceans.2003.178137 article EN Oceans 2003. Celebrating the Past ... Teaming Toward the Future (IEEE Cat. No.03CH37492) 2003-01-01

A Legal Risk Taxonomy for Generative Artificial Intelligence

OPENALEX - Publications

David Atkinson Jacob Morrison

For the first time, this paper presents a taxonomy of legal risks associated with generative AI (GenAI) by breaking down complex concepts to provide common understanding potential challenges for developing and deploying GenAI models. The methodology is based on (1) examining claims that have been filed in existing lawsuits (2) evaluating reasonably foreseeable may be future lawsuits. First, we identified 22 against prominent entities tallied each lawsuit. From there, seven are cited at least...

10.48550/arxiv.2404.09479 preprint EN arXiv (Cornell University) 2024-04-15

Unsettled Law: Time to Generate New Approaches?

OPENALEX - Publications

David Atkinson Jacob Morrison

We identify several important and unsettled legal questions with profound ethical societal implications arising from generative artificial intelligence (GenAI), focusing on its distinguishable characteristics traditional software earlier AI models. Our key contribution is formally identifying the issues that are unique to GenAI so scholars, practitioners, others can conduct more useful investigations discussions. While established frameworks, many originating pre-digital era, currently...

10.48550/arxiv.2407.01968 preprint EN arXiv (Cornell University) 2024-07-02

Intentionally Unintentional: GenAI Exceptionalism and the First Amendment

OPENALEX - Publications

David Atkinson Jena D. Hwang Jacob Morrison

10.2139/ssrn.4964912 preprint EN 2024-01-01

High-coverage allele-resolved single-cell DNA methylation profiling by scDEEP-mC reveals cell lineage, X-inactivation state, and replication dynamics

OPENALEX - Publications

Nathan J. Spix Walid Abi Habib Zhouwei Zhang Emily Eugster Hsiao-yun Milliron and 11 more

DNA methylation is a relatively stable epigenetic mark with important roles in development and disease. Since cell-to-cell variation programming can reflect differences cell state fate, it clear that single-cell methods are essential to understanding this key heterogeneous tissues. Existing whole-genome bisulfite sequencing (scWGBS) have significant shortcomings, including very low CpG coverage or inefficient library generation requiring extremely deep sequencing. These offer limited insight...

10.1101/2024.10.01.616139 preprint EN cc-by-nc-nd bioRxiv (Cold Spring Harbor Laboratory) 2024-10-03

Merge to Learn: Efficiently Adding Skills to Language Models with Model Merging

OPENALEX - Publications

Jacob Morrison Noah A. Smith Hannaneh Hajishirzi Pang Wei Koh Jesse Dodge and 1 more

Adapting general-purpose language models to new skills is currently an expensive process that must be repeated as instruction datasets targeting are created, or can cause the forget older skills. In this work, we investigate effectiveness of adding preexisting by training on in isolation and later merging with general model (e.g. using task vectors). experiments focusing scientific literature understanding, safety, coding, find parallel-train-then-merge procedure, which significantly cheaper...

10.48550/arxiv.2410.12937 preprint EN arXiv (Cornell University) 2024-10-16

Merge to Learn: Efficiently Adding Skills to Language Models with Model Merging

OPENALEX - Publications

Jacob Morrison Noah A. Smith Hannaneh Hajishirzi Pang Wei Koh Jesse Dodge and 1 more

10.18653/v1/2024.findings-emnlp.915 article EN 2024-01-01

On the effect of curriculum learning with developmental data for grammar acquisition

OPENALEX - Publications

Mattia Opper Jacob Morrison N. Siddharth

This work explores the degree to which grammar acquisition is driven by language 'simplicity' and source modality (speech vs. text) of data.Using BabyBERTa (Huebner et al., 2021) as a probe, we find that largely exposure speech data, in particular through two BabyLM (Warstadt 2023) training corpora: AO-Childes Open Subtitles.We arrive at this finding examining various ways presenting input data our model.First, assess impact sequence-level complexity based curricula.We then examine learning...

10.18653/v1/2023.conll-babylm.31 article EN cc-by 2023-01-01

Underwater electrical cable and connector seals: Some in-house design options, commercial options, and performance/failure analyses

OPENALEX - Publications

C.J. Sandwith James Paradis Jacob Morrison

Very limited information is currently available on the options to engineers who need select or design an appropriate connector cable seal for long-term underwater use. To help solve this problem, Applied Physics Laboratory has compiled a Reference Manual Interference Seals and Connectors Undersea Electrical Applications that (1) presents factors in theory of seals, (2) provides manufacturers commercially products, (3) gives bibliographies pertinent military specifications technical...

10.1109/oceans.1977.1154459 article EN 1977-01-01

DESIGN AND IMPLEMENTATION OF A NETWORK LAB TO ENHANCE UNDERGRADUATE NETWORKING AND INFORMATION ASSURANCE CURRICULUM IN A BACCALAUREATE DEGREE PROGRAM: A CASE STUDY

OPENALEX - Publications

Johnathan Yerby Kevin Floyd Jacob Morrison

The curriculum of a program in Information technology must be current and competitive to remain relevant valuable.The authors this paper explored the research related rationale supplement higher education theoretical knowledge networking information assurance with opportunities for students programs gains some hands-on experience.The also used widely accepted learning theories active constructivism assist decision build lab environment.An explanation processes, opportunities, challenges,...

10.48009/1_iis_2012_321-330 article EN cc-by-nc-nd Issues in Information Systems 2012-01-01

Evaluation of Whole-Genome DNA Methylation Sequencing Library Preparation Protocols

OPENALEX - Publications

Jacob Morrison Julie Koeman Benjamin K. Johnson Kelly K. Foy Wanding Zhou and 5 more

Abstract Background: With rapidly dropping sequencing cost, the popularity of whole-genome DNA methylation has been on rise. Multiple library preparation protocols exist, but a systematic evaluation and benchmarking their performance against each other is currently lacking. We have performed 22 experiments fresh frozen human samples, extensively benchmarked common for sequencing, including three traditional bisulfite-based new enzyme-based protocol. Additionally, different input quantities...

10.21203/rs.3.rs-249202/v1 preprint EN cc-by Research Square (Research Square) 2021-03-01

On the effect of curriculum learning with developmental data for grammar acquisition

OPENALEX - Publications

Mattia Opper Jacob Morrison N. Siddharth

This work explores the degree to which grammar acquisition is driven by language `simplicity' and source modality (speech vs. text) of data. Using BabyBERTa as a probe, we find that largely exposure speech data, in particular through two BabyLM training corpora: AO-Childes Open Subtitles. We arrive at this finding examining various ways presenting input data our model. First, assess impact sequence-level complexity based curricula. then examine learning over `blocks' -- covering spans text...

10.48550/arxiv.2311.00128 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Coming Soon ...