- Advanced Database Systems and Queries
- Data Management and Algorithms
- Cloud Computing and Resource Management
- Distributed systems and fault tolerance
- Advanced Data Storage Technologies
- Data Quality and Management
- Mobile Crowdsensing and Crowdsourcing
- Data Stream Mining Techniques
- Privacy-Preserving Technologies in Data
- Machine Learning and Data Classification
- Distributed and Parallel Computing Systems
- Scientific Computing and Data Management
- Peer-to-Peer Network Technologies
- Semantic Web and Ontologies
- Anomaly Detection Techniques and Applications
- Time Series Analysis and Forecasting
- Caching and Content Delivery
- Graph Theory and Algorithms
- Energy Efficient Wireless Sensor Networks
- Rangeland and Wildlife Management
- Machine Learning and Algorithms
- Algorithms and Data Compression
- Web Data Mining and Analysis
- Big Data and Business Intelligence
- Context-Aware Activity Recognition Systems
Wrightington Hospital
2025
Western Sydney University
2013-2025
University of Chicago
2008-2024
University of Wollongong
2006-2022
University of Illinois Chicago
2019-2020
University of California, Berkeley
2009-2018
University of Toronto
2013-2018
Agency for Toxic Substances and Disease Registry
2018
Global Affairs Canada
2018
Tsinghua University
2017
This open source computing framework unifies streaming, batch, and interactive big data workloads to unlock new applications.
We discuss the design of an acquisitional query processor for data collection in sensor networks. Acquisitional issues are those that pertain to where, when, and how often is physically acquired ( sampled ) delivered processing operators. By focusing on locations costs acquiring data, we able significantly reduce power consumption over traditional passive systems assume a priori existence data. simple extensions SQL controlling acquisition, show influence optimization, dissemination,...
Spark SQL is a new module in Apache that integrates relational processing with Spark's functional programming API. Built on our experience Shark, lets programmers leverage the benefits of (e.g. declarative queries and optimized storage), users call complex analytics libraries machine learning). Compared to previous systems, makes two main additions. First, it offers much tighter integration between procedural processing, through DataFrame API code. Second, includes highly extensible...
Apache Spark is a popular open-source platform for large-scale data processing that well-suited iterative machine learning tasks. In this paper we present MLlib, Spark's distributed library. MLlib provides efficient functionality wide range of settings and includes several underlying statistical, optimization, linear algebra primitives. Shipped with Spark, supports languages high-level API leverages rich ecosystem to simplify the development end-to-end pipelines. has experienced rapid growth...
We discuss the design of an acquisitional query processor for data collection in sensor networks. Acquisitional issues are those that pertain to where, when, and how often is physically acquired (sampled) delivered processing operators. By focusing on locations costs acquiring data, we able significantly reduce power consumption over traditional passive systems assume a priori existence data. simple extensions SQL controlling acquisition, show influence optimization, dissemination,...
In pursuit of graph processing performance, the systems community has largely abandoned general-purpose distributed dataflow frameworks in favor specialized that provide tailored programming abstractions and accelerate execution iterative algorithms. this paper we argue many advantages can be recovered a modern system. We introduce GraphX, an embedded framework built on top Apache Spark, widely used GraphX presents familiar composable abstraction is sufficient to express existing APIs, yet...
No abstract available.
The development of relational database management systems served to focus the data community for decades, with spectacular results. In recent years, however, rapidly-expanding demands "data everywhere" have led a field comprised interesting and productive efforts, but without central or coordinated agenda. most acute information challenges today stem from organizations (e.g., enterprises, government agencies, libraries, "smart" homes) relying on large number diverse, interrelated sources,...
Some queries cannot be answered by machines only. Processing such requires human input for providing information that is missing from the database, performing computationally difficult functions, and matching, ranking, or aggregating results based on fuzzy criteria. CrowdDB uses via crowdsourcing to process neither database systems nor search engines can adequately answer. It SQL both as a language posing complex way model data. While leverages many aspects of traditional systems, there are...
From social networks to targeted advertising, big graphs capture the structure in data and are central recent advances machine learning mining. Unfortunately, directly applying existing data-parallel tools graph computation tasks can be cumbersome inefficient. The need for intuitive, scalable has lead development of new graph-parallel systems (e.g., Pregel, PowerGraph) which designed efficiently execute algorithms. these do not address challenges construction transformation often just as...
Entity resolution is central to data integration and cleaning. Algorithmic approaches have been improving in quality, but remain far from perfect. Crowdsourcing platforms offer a more accurate expensive (and slow) way bring human insight into the process. Previous work has proposed batching verification tasks for presentation workers even with batching, human-only approach infeasible sets of moderate size, due large numbers matches be tested. Instead, we propose hybrid human-machine which...
If industry visionaries are correct, our lives will soon be full of sensors, connected together in loose conglomerations via wireless networks, each monitoring and collecting data about the environment at large. These sensors behave very differently from traditional database sources: they have intermittent connectivity, limited by severe power constraints, typically sample periodically push immediately, keeping no record historical information. limitations make systems inappropriate for...
We show how the database community's notion of a generic query interface for data aggregation can be applied to ad-hoc networks sensor devices. As has been noted in network literature, is important as reduction tool; networking approaches, however, have focused on application specific solutions, whereas our in-network approach driven by general purpose, SQL-style that execute queries over any type while providing opportunities significant optimization. present variety techniques improve...
To compensate for the inherent unreliability of RFID data streams, most middleware systems employ a smoothing filter, sliding-window aggregate that interpolates lost readings. In this paper, we propose SMURF, first declarative, adaptive filter cleaning. SMURF models readings by viewing streams as statistical sample tags in physical world, and exploits techniques grounded sampling theory to drive its cleaning processes. Through use tools such binomial π-estimators, continuously adapts window...
The increasing ability to interconnect computers through internet-working, wireless networks, high-bandwidth satellite, and cable networks has spawned a new class of information-centered applications based on data dissemination. These employ broadcast deliver very large client populations. We have proposed the Broadcast Disks paradigm [Zdon94, Acha95b] for organizing contents program managing resources in response such program. Our previous work focused exclusively "push-based" approach,...
Shark is a new data analysis system that marries query processing with complex analytics on large clusters. It leverages novel distributed memory abstraction to provide unified engine can run SQL queries and sophisticated functions (e.g. iterative machine learning) at scale, efficiently recovers from failures mid-query. This allows up 100X faster than Apache Hive, learning programs more Hadoop. Unlike previous systems, shows it possible achieve these speedups while retaining MapReduce-like...
The most acute information management challenges today stem from organizations relying on a large number of diverse, interrelated data sources, but having no means managing them in convenient, integrated, or principled fashion. These arise enterprise and government management, digital libraries, "smart" homes personal management. We have proposed dataspaces as abstraction for these diverse applications DataSpace Support Platforms (DSSPs) systems that should be built to provide the required...