Jacob Morrison

ORCID: 0000-0001-8592-4744
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Natural Language Processing Techniques
  • Topic Modeling
  • Epigenetics and DNA Methylation
  • RNA modifications and cancer
  • Cancer-related gene regulation
  • Single-cell and spatial transcriptomics
  • Semantic Web and Ontologies
  • Genomics and Phylogenetic Studies
  • Domain Adaptation and Few-Shot Learning
  • Information Systems Education and Curriculum Development
  • Underwater Vehicles and Communication Systems
  • Experimental Learning in Engineering
  • Structural Integrity and Reliability Analysis
  • Mechanical stress and fatigue analysis
  • Privacy, Security, and Data Protection
  • Ethics and Social Impacts of AI
  • Hate Speech and Cyberbullying Detection
  • Human Pose and Action Recognition
  • Maritime Navigation and Safety
  • Conflict of Laws and Jurisdiction
  • Freedom of Expression and Defamation
  • Underwater Acoustics Research
  • Cancer Genomics and Diagnostics
  • Speech Recognition and Synthesis
  • Open Education and E-Learning

Van Andel Institute
2021-2025

University of St Andrews
2023

University of Edinburgh
2023

Allen Institute
2022

University of Washington
1977-2022

Middle Georgia State College
2012

Qinetiq (United Kingdom)
2003

Abstract Data from both bulk and single-cell whole-genome DNA methylation experiments are under-utilized in many ways. This is attributable to inefficient mapping of sequencing reads, routinely discarded genetic information, neglected read-level epigenetic linkage information. We introduce the BISulfite-seq Command line User Interface Toolkit (BISCUIT) its companion R/Bioconductor package, biscuiteer, for simultaneous extraction information sequencing. BISCUIT’s performance, flexibility...

10.1093/nar/gkae097 article EN cc-by Nucleic Acids Research 2024-02-27

Abstract Background With rapidly dropping sequencing cost, the popularity of whole-genome DNA methylation has been on rise. Multiple library preparation protocols currently exist. We have performed 22 experiments snap frozen human samples, and extensively benchmarked common for sequencing, including three traditional bisulfite-based a new enzyme-based protocol. In addition, different input quantities were compared two kits compatible with reduced starting quantity. we also present...

10.1186/s13072-021-00401-y article EN cc-by Epigenetics & Chromatin 2021-06-19

Language models have become a critical technology to tackling wide range of natural language processing tasks, yet many details about how the best-performing were developed are not reported. In particular, information their pretraining corpora is seldom discussed: commercial rarely provide any data; even open release datasets they trained on, or an exact recipe reproduce them. As result, it challenging conduct certain threads modeling research, such as understanding training data impacts...

10.48550/arxiv.2402.00159 preprint EN arXiv (Cornell University) 2024-01-31

Jungo Kasai, Keisuke Sakaguchi, Lavinia Dunagan, Jacob Morrison, Ronan Le Bras, Yejin Choi, Noah Smith. Proceedings of the 2022 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies. 2022.

10.18653/v1/2022.naacl-main.254 article EN cc-by Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2022-01-01

Language models (LMs) have become ubiquitous in both NLP research and commercial product offerings. As their importance has surged, the most powerful closed off, gated behind proprietary interfaces, with important details of training data, architectures, development undisclosed. Given these scientifically studying models, including biases potential risks, we believe it is essential for community to access powerful, truly open LMs. To this end, technical report first release OLMo, a...

10.48550/arxiv.2402.00838 preprint EN arXiv (Cornell University) 2024-02-01

We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations 54 tasks covering five essential scientific literature understanding capabilities: information extraction, summarization, question answering, claim verification, classification. are notable their long input contexts, detailed task specifications, complex structured outputs. While resources available in specific domains such as clinical medicine chemistry,...

10.48550/arxiv.2406.07835 preprint EN arXiv (Cornell University) 2024-06-10

Abstract Summary In whole genome sequencing data, polymerase chain reaction amplification results in duplicate DNA fragments coming from the same location genome. The process of preparing a bisulfite (WGBS) library, on other hand, can create two that should not be considered duplicates. Currently, only one WGBS-aware marking tool exists. However, it works with output single tool, does accept streaming input or output, and requires substantial amount memory relative to size. Dupsifter...

10.1093/bioinformatics/btad729 article EN cc-by Bioinformatics 2023-12-01

The Marine and Acoustics Centre (MAC) at QinetiQ Bincleaves is currently pioneering MCM research development in support of the UK MoD by undertaking a programme work focussed on developing demonstrating surveillance reconnaissance from an AUV. It recognised that operations impose key demands vehicle sensor performance levels, with recent analysis highlighting four main areas essential capability, which form focus programme. These are: ability to correctly detect classify targets; accurately...

10.1109/oceans.2003.178137 article EN Oceans 2003. Celebrating the Past ... Teaming Toward the Future (IEEE Cat. No.03CH37492) 2003-01-01

For the first time, this paper presents a taxonomy of legal risks associated with generative AI (GenAI) by breaking down complex concepts to provide common understanding potential challenges for developing and deploying GenAI models. The methodology is based on (1) examining claims that have been filed in existing lawsuits (2) evaluating reasonably foreseeable may be future lawsuits. First, we identified 22 against prominent entities tallied each lawsuit. From there, seven are cited at least...

10.48550/arxiv.2404.09479 preprint EN arXiv (Cornell University) 2024-04-15

We identify several important and unsettled legal questions with profound ethical societal implications arising from generative artificial intelligence (GenAI), focusing on its distinguishable characteristics traditional software earlier AI models. Our key contribution is formally identifying the issues that are unique to GenAI so scholars, practitioners, others can conduct more useful investigations discussions. While established frameworks, many originating pre-digital era, currently...

10.48550/arxiv.2407.01968 preprint EN arXiv (Cornell University) 2024-07-02

DNA methylation is a relatively stable epigenetic mark with important roles in development and disease. Since cell-to-cell variation programming can reflect differences cell state fate, it clear that single-cell methods are essential to understanding this key heterogeneous tissues. Existing whole-genome bisulfite sequencing (scWGBS) have significant shortcomings, including very low CpG coverage or inefficient library generation requiring extremely deep sequencing. These offer limited insight...

10.1101/2024.10.01.616139 preprint EN cc-by-nc-nd bioRxiv (Cold Spring Harbor Laboratory) 2024-10-03

Adapting general-purpose language models to new skills is currently an expensive process that must be repeated as instruction datasets targeting are created, or can cause the forget older skills. In this work, we investigate effectiveness of adding preexisting by training on in isolation and later merging with general model (e.g. using task vectors). experiments focusing scientific literature understanding, safety, coding, find parallel-train-then-merge procedure, which significantly cheaper...

10.48550/arxiv.2410.12937 preprint EN arXiv (Cornell University) 2024-10-16

This work explores the degree to which grammar acquisition is driven by language 'simplicity' and source modality (speech vs. text) of data.Using BabyBERTa (Huebner et al., 2021) as a probe, we find that largely exposure speech data, in particular through two BabyLM (Warstadt 2023) training corpora: AO-Childes Open Subtitles.We arrive at this finding examining various ways presenting input data our model.First, assess impact sequence-level complexity based curricula.We then examine learning...

10.18653/v1/2023.conll-babylm.31 article EN cc-by 2023-01-01

Very limited information is currently available on the options to engineers who need select or design an appropriate connector cable seal for long-term underwater use. To help solve this problem, Applied Physics Laboratory has compiled a Reference Manual Interference Seals and Connectors Undersea Electrical Applications that (1) presents factors in theory of seals, (2) provides manufacturers commercially products, (3) gives bibliographies pertinent military specifications technical...

10.1109/oceans.1977.1154459 article EN 1977-01-01

The curriculum of a program in Information technology must be current and competitive to remain relevant valuable.The authors this paper explored the research related rationale supplement higher education theoretical knowledge networking information assurance with opportunities for students programs gains some hands-on experience.The also used widely accepted learning theories active constructivism assist decision build lab environment.An explanation processes, opportunities, challenges,...

10.48009/1_iis_2012_321-330 article EN cc-by-nc-nd Issues in Information Systems 2012-01-01

Abstract Background: With rapidly dropping sequencing cost, the popularity of whole-genome DNA methylation has been on rise. Multiple library preparation protocols exist, but a systematic evaluation and benchmarking their performance against each other is currently lacking. We have performed 22 experiments fresh frozen human samples, extensively benchmarked common for sequencing, including three traditional bisulfite-based new enzyme-based protocol. Additionally, different input quantities...

10.21203/rs.3.rs-249202/v1 preprint EN cc-by Research Square (Research Square) 2021-03-01

This work explores the degree to which grammar acquisition is driven by language `simplicity' and source modality (speech vs. text) of data. Using BabyBERTa as a probe, we find that largely exposure speech data, in particular through two BabyLM training corpora: AO-Childes Open Subtitles. We arrive at this finding examining various ways presenting input data our model. First, assess impact sequence-level complexity based curricula. then examine learning over `blocks' -- covering spans text...

10.48550/arxiv.2311.00128 preprint EN other-oa arXiv (Cornell University) 2023-01-01
Coming Soon ...