NFDI4DS | UHH-SEMS - Publication Details

On the Opportunities and Risks of Foundation Models

OPENALEX - Publications

Rishi Bommasani Drew A. Hudson Ehsan Adeli Russ B. Altman Simran Arora and 95 more

AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and adaptable to wide range downstream tasks. We call these foundation underscore their critically central yet incomplete character. This report provides thorough account opportunities risks models, ranging from capabilities language, vision, robotics, reasoning, human interaction) technical principles(e.g., model architectures, training procedures, data, systems,...

10.48550/arxiv.2108.07258 preprint EN cc-by arXiv (Cornell University) 2021-01-01

Can Foundation Models Wrangle Your Data?

OPENALEX - Publications

Avanika Narayan Ines Chami Laurel Orr Christopher Ré

Foundation Models (FMs) are models trained on large corpora of data that, at very scale, can generalize to new tasks without any task-specific finetuning. As these continue grow in size, innovations push the boundaries what do language and image tasks. This paper aims understand an underexplored area FMs: classical like cleaning integration. a proof-of-concept, we cast five integration as prompting evaluate performance FMs We find that achieve SoTA tasks, even though they not for identify...

10.14778/3574245.3574258 article EN Proceedings of the VLDB Endowment 2022-12-01

Holistic Evaluation of Language Models

OPENALEX - Publications

Percy Liang Rishi Bommasani Tong Lee Dimitris Tsipras Dilara Soylu and 45 more

Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks not well understood. We present Holistic Evaluation of Models (HELM) to improve transparency models. First, we taxonomize vast space potential scenarios (i.e. use cases) metrics desiderata) that interest LMs. Then select a broad subset based on coverage feasibility, noting what's missing or underrepresented (e.g. question answering neglected English...

10.48550/arxiv.2211.09110 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Ask Me Anything: A simple strategy for prompting language models

OPENALEX - Publications

Simran Arora Avanika Narayan Mayee F. Chen Laurel Orr Neel Guha and 4 more

Large language models (LLMs) transfer well to new tasks out-of-the-box simply given a natural prompt that demonstrates how perform the task and no additional training. Prompting is brittle process wherein small modifications can cause large variations in model predictions, therefore significant effort dedicated towards designing painstakingly "perfect prompt" for task. To mitigate high degree of involved prompt-design, we instead ask whether producing multiple effective, yet imperfect,...

10.48550/arxiv.2210.02441 preprint EN public-domain arXiv (Cornell University) 2022-01-01

Explaining query answers with explanation-ready databases

OPENALEX - Publications

Sudeepa Roy Laurel Orr Dan Suciu

With the increased generation and availability of big data in different domains, there is an imminent requirement for analysis tools that are able to 'explain' trends anomalies obtained from this a range users with backgrounds. Wu-Madden (PVLDB 2013) Roy-Suciu (SIGMOD 2014) recently proposed solutions can explain interesting or unexpected answers simple aggregate queries terms predicates on attributes. In paper, we propose generic framework support much richer, insightful explanations by...

10.14778/2856318.2856329 article EN Proceedings of the VLDB Endowment 2015-12-01

Bootleg: Chasing the Tail with Self-Supervised Named Entity Disambiguation

OPENALEX - Publications

Laurel Orr Megan Leszczynski Simran Arora Sen Wu Neel Guha and 2 more

A challenge for named entity disambiguation (NED), the task of mapping textual mentions to entities in a knowledge base, is how disambiguate that appear rarely training data, termed tail entities. Humans use subtle reasoning patterns based on facts, relations, and types unfamiliar Inspired by these patterns, we introduce Bootleg, self-supervised NED system explicitly grounded disambiguation. We define core disambiguation, create learning procedure encourage model learn show weak supervision...

10.48550/arxiv.2010.10363 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Pushing data-induced predicates through joins in big-data clusters

OPENALEX - Publications

Srikanth Kandula Laurel Orr Surajit Chaudhuri

Using data statistics, we convert predicates on a table into induced (diPs) that apply the joining tables. Doing so substantially speeds up multi-relation queries because benefits of predicate pushdown can now beyond just tables have predicates. We use diPs to skip exclusively during query optimization; i.e., lead better plans and no overhead execution. study how for complex expressions usefulness varies with statistics used construct distributions. Our results show building using zone-maps...

10.14778/3368289.3368292 article EN Proceedings of the VLDB Endowment 2019-11-01

Cross-Domain Data Integration for Named Entity Disambiguation in Biomedical Text

OPENALEX - Publications

Maya Varma Laurel Orr Sen Wu Megan Leszczynski Xiao Ling and 1 more

Named entity disambiguation (NED), which involves mapping textual mentions to structured entities, is particularly challenging in the medical domain due presence of rare entities. Existing approaches are limited by coarse-grained structural resources biomedical knowledge bases as well use training datasets that provide low coverage over uncommon resources. In this work, we address these issues proposing a cross-domain data integration method transfers from general text base domain. We...

10.18653/v1/2021.findings-emnlp.388 preprint EN cc-by 2021-01-01

Managing ML pipelines

OPENALEX - Publications

Laurel Orr Atindriyo Sanyal Xiao Ling Karan Goel Megan Leszczynski

The industrial machine learning pipeline requires iterating on model features, training and deploying models, monitoring deployed models at scale. Feature stores were developed to manage standardize the engineer's workflow in this end-to-end pipeline, focusing traditional tabular feature data. In recent years, however, development has shifted towards using self-supervised pretrained embeddings as features. Managing these downstream systems that use them introduces new challenges with respect...

10.14778/3476311.3476402 article EN Proceedings of the VLDB Endowment 2021-07-01

Probabilistic database summarization for interactive data exploration

OPENALEX - Publications

Laurel Orr Magdalena Balazinska Dan Suciu

We present a probabilistic approach to generate small, query-able summary of dataset for interactive data exploration. Departing from traditional summarization techniques, we use the Principle Maximum Entropy representation that can be used give approximate query answers. develop theoretical framework and formulation our show how it answer queries. then solving techniques three critical optimizations improve preprocessing time accuracy. Lastly, experimentally evaluate work using 5 GB flights...

10.14778/3115404.3115419 article EN Proceedings of the VLDB Endowment 2017-06-01

Evaluating Text-to-SQL Model Failures on Real-World Data

OPENALEX - Publications

Manasi Ganti Laurel Orr Sen Wu

10.1109/icde60146.2024.00456 article EN 2022 IEEE 38th International Conference on Data Engineering (ICDE) 2024-05-13

An Irregular Approach to Large-Scale Computed Tomography on Multiple Graphics Processors Improves Voxel Processing Throughput

OPENALEX - Publications

Edward S. Jimenez Laurel Orr Kyle R. Thompson

While much work has been done on applying GPU technology to computed tomography (CT) reconstruction algorithms, many of these implementations focus smaller datasets that are better suited for medical applications. This paper proposes an irregular approach the algorithm design which utilizes hardware's unique cache structure and employs small x-ray image data prefetches host upload GPUs while devices operating large contiguous sub-volumes reconstruction. will improve overall hit-rates thus...

10.1109/sc.companion.2012.42 article EN 2012-11-01

Sample Debiasing in the Themis Open World Database System

OPENALEX - Publications

Laurel Orr Magdalena Balazinska Dan Suciu

Open world database management systems assume tuples not in the still exist and are becoming an increasingly important area of research. We present Themis, first open that automatically rebalances arbitrarily biased samples to approximately answer queries as if they were issued over entire population. leverage apriori population aggregate information develop combine two different approaches for automatic debiasing: sample reweighting Bayesian network probabilistic modeling. build a prototype...

10.1145/3318464.3380606 article EN 2020-05-29

EntropyDB: a probabilistic approach to approximate query processing

OPENALEX - Publications

Laurel Orr Magdalena Balazinska Dan Suciu

10.1007/s00778-019-00582-9 article EN The VLDB Journal 2019-11-02

Exploring mediated reality to approximate x-ray attenuation coefficients from radiographs

OPENALEX - Publications

Edward S. Jimenez Laurel Orr Megan Lea Morgan Kyle R. Thompson

Estimation of the x-ray attenuation properties an object with respect to energy emitted from source is a challenging task for traditional Bremsstrahlung sources. This exploratory work attempts estimate profile range given profile. Previous has shown that calculating single effective value polychromatic not accurate due non-linearities associated image formation process. Instead, we completely characterize imaging system virtually and utilize iterative search method/constrained optimization...

10.1117/12.2064693 article EN Proceedings of SPIE, the International Society for Optical Engineering/Proceedings of SPIE 2014-09-04

Big-Data Management Use-Case

OPENALEX - Publications

Sarah Loebman Jennifer Ortiz Lee Lee Choo Laurel Orr Lauren Anderson and 4 more

We present the motivation, design, implementation, and preliminary evaluation for a service that enables astronomers to study growth history of galaxies by following their `merger trees' in large-scale astrophysical simulations. The uses Myria parallel data management system as back-end D3 visualization library within its graphical front-end. demonstrate at workshop on ~5TB dataset.

10.1145/2627770.2627774 article EN 2014-06-22

Object composition identification via mediated-reality supplemented radiographs

OPENALEX - Publications

Edward S. Jimenez Laurel Orr Kyle R. Thompson

This exploratory work investigates the feasibility of extracting linear attenuation functions with respect to energy from a multi-channel radiograph an object interest composed homogeneous material by simulating entire imaging system combined digital phantom and leveraging this information along acquired image. synergistic combination allows for improved estimates on not only effective energy, but spectrum that is coincident detector elements. Material composition identification radiographs...

10.1109/nssmic.2014.7431055 article EN 2021 IEEE Nuclear Science Symposium and Medical Imaging Conference (NSS/MIC) 2014-11-01

Cluster-based approach to a multi-GPU CT reconstruction algorithm

OPENALEX - Publications

Laurel Orr Edward S. Jimenez Kyle R. Thompson

Conventional CPU-based algorithms for Computed Tomography reconstruction lack the computational efficiency necessary to process large, industrial datasets in a reasonable amount of time. Specifically, processing time single-pass, trillion volumetric pixel (voxel) requires months reconstruct using high performance workstation. An optimized, single workstation multi-GPU approach has shown increases by 2-3 orders-of-magnitude; however, future-size, voxel can still take an entire day complete....

10.1109/nssmic.2014.7431130 article EN 2021 IEEE Nuclear Science Symposium and Medical Imaging Conference (NSS/MIC) 2014-11-01

Preparing for the 100-megapixel detector: reconstructing a multi-terabyte computed-tomography dataset

OPENALEX - Publications

Laurel Orr Edward S. Jimenez

Although there has been progress in applying GPU-technology to Computed-Tomography reconstruction algorithms, much of the work concentrated on optimizing performance for smaller, medical-scale datasets. Industrial CT datasets can vary widely size and number projections. With new advancements high resolution cameras, it is entirely possible that community may soon need pursue a 100-megapixel detector applications. To reconstruct such massive dataset, simply adding extra GPUs would not be an...

10.1117/12.2023090 article EN Proceedings of SPIE, the International Society for Optical Engineering/Proceedings of SPIE 2013-09-26

Rethinking the union of computed tomography reconstruction and GPGPU computing

OPENALEX - Publications

Edward S. Jimenez Laurel Orr

This work will present the utilization of massively multi-threaded environment graphics processors (GPUs) to improve computation time needed reconstruct large computed tomography (CT) datasets and aris- ing challenges for system implementation. Intelligent algorithm design differs greatly from traditional CPU design. Although a brute force port algo- rithm GPU kernel may yield non-trivial performance gains, further measurable gains could be achieved by designing with consideration given...

10.1117/12.2029995 article EN Proceedings of SPIE, the International Society for Optical Engineering/Proceedings of SPIE 2013-09-26

Goodwill Hunting: Analyzing and Repurposing Off-the-Shelf Named Entity Linking Systems

OPENALEX - Publications

Karan Goel Laurel Orr Nazneen Fatema Rajani Jesse Vig Christopher Ré

Karan Goel, Laurel Orr, Nazneen Fatema Rajani, Jesse Vig, Christopher Ré. Proceedings of the 2021 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies: Industry Papers. 2021.

10.18653/v1/2021.naacl-industry.26 article EN cc-by 2021-01-01

Can Foundation Models Wrangle Your Data?

OPENALEX - Publications

Avanika Narayan Ines Chami Laurel Orr Christopher Ré

Foundation Models (FMs) are models trained on large corpora of data that, at very scale, can generalize to new tasks without any task-specific finetuning. As these continue grow in size, innovations push the boundaries what do language and image tasks. This paper aims understand an underexplored area FMs: classical like cleaning integration. a proof-of-concept, we cast five integration as prompting evaluate performance FMs We find that achieve SoTA tasks, even though they not for identify...

10.48550/arxiv.2205.09911 preprint EN public-domain arXiv (Cornell University) 2022-01-01

Irregular large-scale computed tomography on multiple graphics processors improves energy-efficiency metrics for industrial applications

OPENALEX - Publications

Edward S. Jimenez Eric Goodman Ryeojin Park Laurel Orr Kyle R. Thompson

This paper will investigate energy-efficiency for various real-world industrial computed-tomography reconstruction algorithms, both CPU- and GPU-based implementations. work shows that the energy required a given is based on performance problem size. There are many ways to describe efficiency, thus this multiple metrics including performance-per-watt, energy-delay product, consumption. found irregular approaches<sup>1</sup> realized tremendous savings in consumption when compared CPU...

10.1117/12.2060721 article EN Proceedings of SPIE, the International Society for Optical Engineering/Proceedings of SPIE 2014-09-04