- Semantic Web and Ontologies
- Data Quality and Management
- Biomedical Text Mining and Ontologies
- Topic Modeling
- Natural Language Processing Techniques
- Advanced Database Systems and Queries
- Advanced Graph Neural Networks
- Service-Oriented Architecture and Web Services
- Advanced Text Analysis Techniques
- Web Data Mining and Analysis
- Computational Drug Discovery Methods
- Multi-Agent Systems and Negotiation
- Wikis in Education and Collaboration
- AI-based Problem Solving and Planning
- Scientific Computing and Data Management
- Data Management and Algorithms
- Graph Theory and Algorithms
- Species Distribution and Climate Change
- Privacy-Preserving Technologies in Data
- Bioinformatics and Genomic Networks
- Bayesian Modeling and Causal Inference
- Text and Document Classification Technologies
- Data Mining Algorithms and Applications
- Pharmacovigilance and Adverse Drug Reactions
- Data Visualization and Analytics
Alliance for Safe Kids
2024
IBM Research - Thomas J. Watson Research Center
2013-2023
IBM (United States)
2012-2023
University of Toronto
2007-2015
Sharif University of Technology
2005
The presence of duplicate records is a major data quality concern in large databases. To detect duplicates, entity resolution also known as duplication detection or record linkage used part the cleaning process to identify that potentially refer same real-world entity. We present Stringer system provides an evaluation framework for understanding what barriers remain towards goal truly scalable and general purpose algorithms. In this paper, we use evaluate clusters (groups potential...
There is an abundance of information about drugs available on the Web. Data sources range from medicinal chemistry results, over impact gene expression, to outcomes in clinical trials. These data are typically not connected together, which reduces ease with insights can be gained. Linking Open Drug (LODD) a task force within World Wide Web Consortium's (W3C) Health Care and Life Sciences Interest Group (HCLS IG). LODD has surveyed publicly drugs, created Linked representations sets,...
Although potential drug–drug interactions (PDDIs) are a significant source of preventable drug-related harm, there is currently no single complete PDDI information. In the current study, all publically available sources information that could be identified using comprehensive and broad search were combined into dataset. The dataset merged fourteen different including 5 clinically-oriented sources, 4 Natural Language Processing (NLP) Corpora, Bioinformatics/Pharmacovigilance sources. As...
Document stores that provide the efficiency of a schema-less interface are widely used by developers in mobile and cloud applications. However, simplicity achieved controversially leads to complexity for data management due lack schema. In this paper, we present schema framework document stores. This discovers persists schemas JSON records repository, also supports queries summarization. The major technical challenge comes from varied structures caused model evolution. discovery phase, apply...
In this paper, we study the problem of answering questions type "Could X cause Y?" where and Y are general phrases without any constraints. Answering such will assist with various decision analysis tasks as verifying extending presumed causal associations used for making. Our goal is to analyze ability an AI agent built using state-of-the-art unsupervised methods in derived from collections cause-effect pairs human experts. We focus only on weakly supervised due difficulty creating a large...
Declarative data quality has been an active research topic. The fundamental principle behind a declarative approach to is the use of statements realize primitives on top any relational source. A primary advantage such ease and integration with existing applications. Over last few years several similarity predicates have proposed for common (approximate selections, joins, etc) fully expressed using SQL statements. In this paper we propose new along their realization, based notions...
The Linked Clinical Trials (LinkedCT) project aims at publishing the first open semantic web data source for clinical trials data. database exposed by LinkedCT is generated (1) transforming existing sources of into RDF, and (2) discovering links between records in several other sources. In this paper, we discuss challenges involved these two steps present methodology used to overcome challenges. Our approach link discovery involves using state-of-the-art approximate string matching...
Discovering links between different data items in a single source or across sources is challenging problem faced by many information systems today. In particular, the recent Linking Open Data (LOD) community project has highlighted paramount importance of establishing semantic among web sources. Currently, LOD provide billions RDF triples, but only millions Many these are published using tools that operate over relational stored standard RDBMS. this paper, we present framework for discovery...
A basic step in integration is the identification of linkage points, i.e., finding attributes that are shared (or related) between data sources, and can be used to match records or entities across sources. This usually performed using a operator, associates one database another. However, massive growth amount variety unstructured semi-structured on Web has created new challenges for this task. Such sources often do not have fixed pre-defined schema contain large numbers diverse attributes....
In this paper, we present a framework for online discovery of semantic links from relational data. Our is based on declarative specification the linkage requirements by user, that allows matching data items in many real-world scenarios. These are translated to queries can run over source, potentially using knowledge enhance accuracy link discovery. lets publishers easily find and publish high-quality other sources, therefore could significantly value next generation web.
Abstract Entity resolution (ER) is the task of finding records that refer to same real-world entities. A common scenario, which we as Clean-Clean ER, resolve across two clean sources (i.e., they are duplicate-free and contain one record per entity). Matching algorithms for ER yield bipartite graphs, further processed by clustering produce end result. In this paper, perform an extensive empirical evaluation eight graph matching take input a similarity provide output set matched records. We...