- Advanced Database Systems and Queries
- Semantic Web and Ontologies
- Data Quality and Management
- Data Management and Algorithms
- Web Data Mining and Analysis
- Topic Modeling
- Natural Language Processing Techniques
- Scientific Computing and Data Management
- Service-Oriented Architecture and Web Services
- Sentiment Analysis and Opinion Mining
- Data Mining Algorithms and Applications
- Multimodal Machine Learning Applications
- Personal Information Management and User Behavior
- Misinformation and Its Impacts
- Spam and Phishing Detection
- Peer-to-Peer Network Technologies
- Big Data and Business Intelligence
- Advanced Text Analysis Techniques
- Big Data Technologies and Applications
- Logic, Reasoning, and Knowledge
- Advanced Data Storage Technologies
- Mental Health via Writing
- Biomedical Text Mining and Ontologies
- Geographic Information Systems Studies
- Web visibility and informetrics
Alpha Omega Alpha Medical Honor Society
2022-2024
Menlo School
2020-2024
Amazon (United States)
2024
Stanford University
2023
META Health
2022-2023
Cornell University
2023
Georgia Institute of Technology
2022
Meta (United States)
2020-2022
University of Washington
2000-2021
Meta (Israel)
2021
Problems that involve interacting with humans, such as natural language understanding, have not proven to be solvable by concise, neat formulas like F = ma. Instead, the best approach appears embrace complexity of domain and address it harnessing power data: if other humans engage in tasks generate large amounts unlabeled, noisy data, new algorithms can used build high-quality models from data.
The practice of crowdsourcing is transforming the Web and giving rise to a new field.
Ontologies play a prominent role on the Semantic Web. They make possible widespread publication of machine understandable data, opening myriad opportunities for automated information processing. However, because Web's distributed nature, data it will inevitably come from many different ontologies. Information processing across ontologies is not without knowing semantic mappings between their elements. Manually finding such tedious, error-prone, and clearly at Web scale. Hence, development...
A data-integration system provides access to a multitude of data sources through single mediated schema. key bottleneck in building such systems has been the laborious manual construction semantic mappings between source schemas and We describe LSD, that employs extends current machine-learning techniques semi-automatically find mappings. LSD first asks user provide for small set sources, then uses these together with train learners. Each learner exploits different type information either or...
The development of relational database management systems served to focus the data community for decades, with spectacular results. In recent years, however, rapidly-expanding demands "data everywhere" have led a field comprised interesting and productive efforts, but without central or coordinated agenda. most acute information challenges today stem from organizations (e.g., enterprises, government agencies, libraries, "smart" homes) relying on large number diverse, interrelated sources,...
The World-Wide Web consists of a huge number unstructured documents, but it also contains structured data in the form HTML tables. We extracted 14.1 billion tables from Google's general-purpose web crawl, and used statistical classification techniques to find estimated 154M that contain high-quality relational data. Because each table has its own "schema" labeled typed columns, such can be considered small database. resulting corpus databases is larger than any other we are aware of, by at...
Reference reconciliation is the problem of identifying when different references (i.e., sets attribute values) in a dataset correspond to same real-world entity. Most previous literature assumed single class that had fair number attributes (e.g., research publications). We consider complex information spaces: our belong multiple related classes and each reference may have very few values. A prime example such space Personal Information Management, where goal provide coherent view all on...
This paper introduces ULDBs, an extension of relational databases with simple yet expressive constructs for representing and manipulating both lineage uncertainty. Uncertain data are two important areas management that have been considered extensively in isolation, however many applications require the features tandem. Fundamentally, enables consistent representation uncertain data, it correlates uncertainty query results input processing together presents computational benefits over...
Intuitively, data management and integration tools should be well-suited for exchanging information in a semantically meaningful way. Unfortunately, they suffer from two significant problems: typically require comprehensive schema design before can used to store or share information, are difficult extend because evolution is heavyweight may break backwards compatibility. As result, many small-scale sharing tasks more easily facilitated by nondatabase-oriented that have little support...
Semantic integration has been a long-standing challenge for the database community. It received steady attention over past two decades, and now become prominent area of research. In this article, we first review applications that require semantic discuss difficulties underlying process. We then describe recent progress identify open research issues. focus in particular on schema matching, topic much community, but also data matching (for example, tuple deduplication) issues beyond match...
Creating semantic matches between disparate data sources is fundamental to numerous sharing efforts. Manually creating extremely tedious and error-prone. Hence many recent works have focused on automating the matching process. To date, however, virtually all of these deal only with one-to-one (1-1) matches, such as address = location. They do not consider important class more complex concat (city, state) room-pric room-rate*(1 + tax-rate).We describe iMAP system which semi-automatically...
Schema matching is the problem of identifying corresponding elements in different schemas. Discovering these correspondences or matches inherently difficult to automate. Past solutions have proposed a principled combination multiple algorithms. However, sometimes perform rather poorly due lack sufficient evidence schemas being matched. In this paper we show how corpus and mappings can be used augment about matched, so they matched better. Such typically contains that model similar concepts...
This paper explores an inherent tension in modeling and querying uncertain data: simple, intuitive representations of data capture many application requirements, but these are generally incomplete―standard operations over the may result unrepresentable types uncertainty. Complete models theoretically attractive, they can be nonintuitive more complex than necessary for applications. To address this tension, we propose a two-layer approach to managing underlying logical model that is complete,...
The most acute information management challenges today stem from organizations relying on a large number of diverse, interrelated data sources, but having no means managing them in convenient, integrated, or principled fashion. These arise enterprise and government management, digital libraries, "smart" homes personal management. We have proposed dataspaces as abstraction for these diverse applications DataSpace Support Platforms (DSSPs) systems that should be built to provide the required...
The Deep Web, i.e., content hidden behind HTML forms, has long been acknowledged as a significant gap in search engine coverage. Since it represents large portion of the structured data on accessing Deep-Web long-standing challenge for database community. This paper describes system surfacing content, pre-computing submissions each form and adding resulting pages into index. results our have incorporated Google today drive more than thousand queries per second to content. Surfacing Web poses...
The Web offers a corpus of over 100 million tables [6], but the meaning each table is rarely explicit from itself. Header rows exist in few cases and even when they do, attribute names are typically useless. We describe system that attempts to recover semantics by enriching with additional annotations. Our annotations facilitate operations such as searching for finding related tables. To tables, we leverage database class labels relationships automatically extracted Web. classes has very...
The Semantic Web envisions a World Wide in which data is described with rich semantics and applications can pose complex queries. To this point, researchers have defined new languages for specifying meanings concepts developed techniques reasoning about them, using RDF as the model. flourish, needs to be able accommodate huge amounts of existing operating on them. achieve this, we are faced two problems. First, most world's available not but XML; XML consuming it rely only domain structure...
As XML has developed over the past few years, its role expanded beyond original domain as a semantics-preserving markup language for online documents, and it is now also de facto format interchanging data between heterogeneous systems. Data sources expert "views" their data, other system can directly import or query these views. result, there been great interest in languages systems expressing queries whether stored repository generated view some storage format.