- Medical Imaging Techniques and Applications
- Advanced X-ray and CT Imaging
- Data Management and Algorithms
- Advanced Database Systems and Queries
- Topic Modeling
- Natural Language Processing Techniques
- Data Quality and Management
- Medical Image Segmentation Techniques
- Advanced MRI Techniques and Applications
- Data Stream Mining Techniques
- Digital Radiography and Breast Imaging
- Radiation Dose and Imaging
- Bayesian Modeling and Causal Inference
- Graph Theory and Algorithms
- Web Data Mining and Analysis
- Adversarial Robustness in Machine Learning
- Data Visualization and Analytics
- Privacy-Preserving Technologies in Data
- Artificial Intelligence in Healthcare and Education
- Biomedical Text Mining and Ontologies
- Machine Learning and Data Classification
- Computational Physics and Python Applications
- Hydrocarbon exploration and reservoir analysis
- Digital Image Processing Techniques
- Geological Modeling and Analysis
Stanford University
2020-2022
Microsoft (United States)
2019-2021
Salesforce (United States)
2021
University of Washington
2014-2020
Microsoft Research (United Kingdom)
2019
Sandia National Laboratories
2013-2014
Sandia National Laboratories California
2012-2014
AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and adaptable to wide range downstream tasks. We call these foundation underscore their critically central yet incomplete character. This report provides thorough account opportunities risks models, ranging from capabilities language, vision, robotics, reasoning, human interaction) technical principles(e.g., model architectures, training procedures, data, systems,...
Foundation Models (FMs) are models trained on large corpora of data that, at very scale, can generalize to new tasks without any task-specific finetuning. As these continue grow in size, innovations push the boundaries what do language and image tasks. This paper aims understand an underexplored area FMs: classical like cleaning integration. a proof-of-concept, we cast five integration as prompting evaluate performance FMs We find that achieve SoTA tasks, even though they not for identify...
Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks not well understood. We present Holistic Evaluation of Models (HELM) to improve transparency models. First, we taxonomize vast space potential scenarios (i.e. use cases) metrics desiderata) that interest LMs. Then select a broad subset based on coverage feasibility, noting what's missing or underrepresented (e.g. question answering neglected English...
Large language models (LLMs) transfer well to new tasks out-of-the-box simply given a natural prompt that demonstrates how perform the task and no additional training. Prompting is brittle process wherein small modifications can cause large variations in model predictions, therefore significant effort dedicated towards designing painstakingly "perfect prompt" for task. To mitigate high degree of involved prompt-design, we instead ask whether producing multiple effective, yet imperfect,...
With the increased generation and availability of big data in different domains, there is an imminent requirement for analysis tools that are able to 'explain' trends anomalies obtained from this a range users with backgrounds. Wu-Madden (PVLDB 2013) Roy-Suciu (SIGMOD 2014) recently proposed solutions can explain interesting or unexpected answers simple aggregate queries terms predicates on attributes. In paper, we propose generic framework support much richer, insightful explanations by...
A challenge for named entity disambiguation (NED), the task of mapping textual mentions to entities in a knowledge base, is how disambiguate that appear rarely training data, termed tail entities. Humans use subtle reasoning patterns based on facts, relations, and types unfamiliar Inspired by these patterns, we introduce Bootleg, self-supervised NED system explicitly grounded disambiguation. We define core disambiguation, create learning procedure encourage model learn show weak supervision...
Using data statistics, we convert predicates on a table into induced (diPs) that apply the joining tables. Doing so substantially speeds up multi-relation queries because benefits of predicate pushdown can now beyond just tables have predicates. We use diPs to skip exclusively during query optimization; i.e., lead better plans and no overhead execution. study how for complex expressions usefulness varies with statistics used construct distributions. Our results show building using zone-maps...
Named entity disambiguation (NED), which involves mapping textual mentions to structured entities, is particularly challenging in the medical domain due presence of rare entities. Existing approaches are limited by coarse-grained structural resources biomedical knowledge bases as well use training datasets that provide low coverage over uncommon resources. In this work, we address these issues proposing a cross-domain data integration method transfers from general text base domain. We...
The industrial machine learning pipeline requires iterating on model features, training and deploying models, monitoring deployed models at scale. Feature stores were developed to manage standardize the engineer's workflow in this end-to-end pipeline, focusing traditional tabular feature data. In recent years, however, development has shifted towards using self-supervised pretrained embeddings as features. Managing these downstream systems that use them introduces new challenges with respect...
We present a probabilistic approach to generate small, query-able summary of dataset for interactive data exploration. Departing from traditional summarization techniques, we use the Principle Maximum Entropy representation that can be used give approximate query answers. develop theoretical framework and formulation our show how it answer queries. then solving techniques three critical optimizations improve preprocessing time accuracy. Lastly, experimentally evaluate work using 5 GB flights...
While much work has been done on applying GPU technology to computed tomography (CT) reconstruction algorithms, many of these implementations focus smaller datasets that are better suited for medical applications. This paper proposes an irregular approach the algorithm design which utilizes hardware's unique cache structure and employs small x-ray image data prefetches host upload GPUs while devices operating large contiguous sub-volumes reconstruction. will improve overall hit-rates thus...
Open world database management systems assume tuples not in the still exist and are becoming an increasingly important area of research. We present Themis, first open that automatically rebalances arbitrarily biased samples to approximately answer queries as if they were issued over entire population. leverage apriori population aggregate information develop combine two different approaches for automatic debiasing: sample reweighting Bayesian network probabilistic modeling. build a prototype...
Estimation of the x-ray attenuation properties an object with respect to energy emitted from source is a challenging task for traditional Bremsstrahlung sources. This exploratory work attempts estimate profile range given profile. Previous has shown that calculating single effective value polychromatic not accurate due non-linearities associated image formation process. Instead, we completely characterize imaging system virtually and utilize iterative search method/constrained optimization...
We present the motivation, design, implementation, and preliminary evaluation for a service that enables astronomers to study growth history of galaxies by following their `merger trees' in large-scale astrophysical simulations. The uses Myria parallel data management system as back-end D3 visualization library within its graphical front-end. demonstrate at workshop on ~5TB dataset.
This exploratory work investigates the feasibility of extracting linear attenuation functions with respect to energy from a multi-channel radiograph an object interest composed homogeneous material by simulating entire imaging system combined digital phantom and leveraging this information along acquired image. synergistic combination allows for improved estimates on not only effective energy, but spectrum that is coincident detector elements. Material composition identification radiographs...
Conventional CPU-based algorithms for Computed Tomography reconstruction lack the computational efficiency necessary to process large, industrial datasets in a reasonable amount of time. Specifically, processing time single-pass, trillion volumetric pixel (voxel) requires months reconstruct using high performance workstation. An optimized, single workstation multi-GPU approach has shown increases by 2-3 orders-of-magnitude; however, future-size, voxel can still take an entire day complete....
Although there has been progress in applying GPU-technology to Computed-Tomography reconstruction algorithms, much of the work concentrated on optimizing performance for smaller, medical-scale datasets. Industrial CT datasets can vary widely size and number projections. With new advancements high resolution cameras, it is entirely possible that community may soon need pursue a 100-megapixel detector applications. To reconstruct such massive dataset, simply adding extra GPUs would not be an...
This work will present the utilization of massively multi-threaded environment graphics processors (GPUs) to improve computation time needed reconstruct large computed tomography (CT) datasets and aris- ing challenges for system implementation. Intelligent algorithm design differs greatly from traditional CPU design. Although a brute force port algo- rithm GPU kernel may yield non-trivial performance gains, further measurable gains could be achieved by designing with consideration given...
Karan Goel, Laurel Orr, Nazneen Fatema Rajani, Jesse Vig, Christopher Ré. Proceedings of the 2021 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies: Industry Papers. 2021.
Foundation Models (FMs) are models trained on large corpora of data that, at very scale, can generalize to new tasks without any task-specific finetuning. As these continue grow in size, innovations push the boundaries what do language and image tasks. This paper aims understand an underexplored area FMs: classical like cleaning integration. a proof-of-concept, we cast five integration as prompting evaluate performance FMs We find that achieve SoTA tasks, even though they not for identify...
This paper will investigate energy-efficiency for various real-world industrial computed-tomography reconstruction algorithms, both CPU- and GPU-based implementations. work shows that the energy required a given is based on performance problem size. There are many ways to describe efficiency, thus this multiple metrics including performance-per-watt, energy-delay product, consumption. found irregular approaches<sup>1</sup> realized tremendous savings in consumption when compared CPU...