- Data Management and Algorithms
- Advanced Database Systems and Queries
- Data Mining Algorithms and Applications
- Logic, Reasoning, and Knowledge
- Logic, programming, and type systems
- Data Stream Mining Techniques
- Semantic Web and Ontologies
- Formal Methods in Verification
- Distributed systems and fault tolerance
- Web Data Mining and Analysis
- Software System Performance and Reliability
- Advanced Clustering Algorithms Research
- Bayesian Modeling and Causal Inference
- Peer-to-Peer Network Technologies
- Cloud Computing and Resource Management
- Service-Oriented Architecture and Web Services
- Advanced Data Storage Technologies
- Multimedia Learning Systems
- Decision Support System Applications
- Algorithms and Data Compression
- Data Visualization and Analytics
- Image Retrieval and Classification Techniques
- Edcuational Technology Systems
- Data Quality and Management
- Internet Traffic Analysis and Secure E-voting
Microsoft (United States)
2024
Sri Manakula Vinayagar Medical College and Hospital
2023
Tata Consultancy Services (India)
2017-2021
Microsoft Research (United Kingdom)
2021
Guru Gobind Singh Indraprastha University
2017-2020
Yahoo (United States)
2006-2012
Yahoo (United Kingdom)
2008-2012
University of Wisconsin–Madison
1997-2010
Yahoo (Spain)
2007-2008
University of Virginia
1999
Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely studied problems this area is identification clusters, or densely populated regions, a multi-dimensional dataset. Prior work does not adequately address problem minimization I/O costs.This paper presents data clustering method named BIRCH (Balanced Iterative Reducing Clustering using Hierarchies), demonstrates that it especially suitable for very databases. incrementally...
Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely studied problems this area is identification clusters, or densely populated regions, a multi-dimensional dataset. Prior work does not adequately address problem minimization I/O costs.This paper presents data clustering method named BIRCH (Balanced Iterative Reducing Clustering using Hierarchies), demonstrates that it especially suitable for very databases. incrementally...
K-Anonymity has been proposed as a mechanism for protecting privacy in microdata publishing, and numerous recoding "models" have considered achieving ��anonymity. This paper proposes new multidimensional model, which provides an additional degree of flexibility not seen previous (single-dimensional) approaches. Often this leads to higher-quality anonymizations, measured both by general-purpose metrics more specific notions query answerability. Optimal anonymization is NP-hard (like optimal...
Clustering is an important data mining problem. Most of the earlier work on clustering focussed numeric attributes which have a natural ordering their attribute values. Recently, with categorical attributes, whose values do not ordering, has received some attention. However, previous algorithms give formal description clusters they discover and them assume that user post-processes output algorithm to identify final clusters. In this paper, we introduce novel formalization cluster for by...
We introduce the Iceberg-CUBE problem as a reformulation of datacube (CUBE) problem. The is to compute only those group-by partitions with an aggregate value (e.g., count) above some minimum support threshold. result can be used (1) answer queries clause such HAVING COUNT(*) >= X, where X greater than threshold, (2) for mining multidimensional association rules, and (3) complement existing strategies identifying interesting subsets CUBE precomputation. present new algorithm (BUC)...
Data management workloads are increasingly write-intensive and subject to strict latency SLAs. This presents a dilemma: Update in place systems have unmatched but poor write throughput. In contrast, existing log structured techniques improve throughput sacrifice read performance exhibit unacceptable spikes.
The networking and distributed systems communities have recently explored a variety of new network architectures, both for application-level overlay networks, as prototypes next-generation Internet architecture. In this context, we investigated declarative networking: the use recursive query engine powerful vehicle accelerating innovation in architectures [23, 24, 33]. Declarative represents significant application area database research on processing. paper, address fundamental issues...
Protecting data privacy is an important problem in microdata distribution. Anonymization algorithms typically aim to protect individual privacy, with minimal impact on the quality of resulting data. While bulk previous work has measured through one-size-fits-all measures, we argue that best judged respect workload for which will ultimately be used.This paper provides a suite anonymization produce anonymous view based target class workloads, consisting one or more mining tasks, as well...
In this paper we extend LDL, a Logic Based Database Language, to include finite sets and negation. The new language is called LDL1. We define the notion of model show that negation-free program need not have model, it may more than one minimal model. impose syntactic restriction in order deterministic language. These restrictions allow only layered (stratified) programs. prove for any satisfying layering, there can be constructed bottom-up fashion. Extensions basic grouping mechanism are...
Clustering partitions a collection of objects into groups called clusters, such that similar fall the same group. Similarity between is defined by distance function satisfying triangle inequality; this along with describes space. In space, only operation possible on data computation them. All scalable algorithms in literature assume special type namely k-dimensional vector which allows operations objects. We present two designed for clustering very large datasets spaces. Our first algorithm...
Several methods have been proposed to evaluate queries over a native XML DBMS, where the specify both path and keyword constraints. These broadly consist of graph traversal approaches, optimized with auxiliary structures known as structure indexes; approaches based on information-retrieval style inverted lists. We propose strategy that combines two forms indexes, query evaluation algorithm for branching expressions this strategy. Our technique is general applicable wide range choices indexes...
In this paper we survey recent work on incremental data mining model maintenance and change detection under block evolution. evolution, a dataset is updated periodically through insertions deletions of blocks records at time. We describe two techniques: (1) generic algorithm for that takes any traditional transforms it into an allows restrictions temporal subset the database. (2) also framework detection, quantifies difference between datasets in terms models they induce.
We make the case for developing a web of concepts by starting with current view (comprised hyperlinked pages, or documents, each seen as bag words), extracting concept-centric metadata, and stitching it together to create semantically rich aggregate all information available on concept instance. The goal building maintaining such presents many challenges, but also offers promise enabling powerful applications, including novel search discovery paradigms. present goal, motivate example usage...
Several graph-based algorithms have been proposed in the literature to compute transitive closure of a directed graph. We develop two new (Basic_TC and Gobal_DFTC) compare performance their implementations disk-based environment with well-known algorithm by Schmitz. Our use depth-first search traverse graph technique called marking avoid processing some arcs They nodes reverse topological order, building descendent sets adding children. While details these differ considerably, one important...
Important properties of users and objects will move from being tied to individual Web sites globally available.The conjunction a global object model with portable user context lead richer content structure introduce significant shifts in online communities information discovery.
In this article, we consider whether traditional index structures are effective in processing unstable nearest neighbors workloads. It is known that under broad conditions, workloads become ---distances between data points indistinguishable from each other. We complement earlier result by showing if the workload for an application unstable, you not likely to be able it efficiently using (almost all known) multidimensional structures. For a class of distributions, prove these will do no...