- Web Data Mining and Analysis
- Data Quality and Management
- Cloud Computing and Resource Management
- Advanced Database Systems and Queries
- Topic Modeling
- Semantic Web and Ontologies
- IoT and Edge/Fog Computing
- Caching and Content Delivery
- Natural Language Processing Techniques
- Text and Document Classification Technologies
- Spam and Phishing Detection
- Big Data and Business Intelligence
- Multimedia Communication and Technology
- Data Mining Algorithms and Applications
- Blockchain Technology Applications and Security
- Data Management and Algorithms
- Software Engineering Research
- Advanced Text Analysis Techniques
- Distributed and Parallel Computing Systems
- Service-Oriented Architecture and Web Services
- Face and Expression Recognition
- Scientific Computing and Data Management
- Mobile Crowdsensing and Crowdsourcing
- Advanced Neural Network Applications
- Retinal Imaging and Analysis
Microsoft (United States)
2015-2025
Seattle University
2025
University of Washington
2025
University of California, Los Angeles
2025
University of Illinois Urbana-Champaign
2025
Menlo School
2025
Google (United States)
2025
Microsoft Research (United Kingdom)
2017-2023
Universidade Federal do Amazonas
2007-2013
Cloud research to date has lacked data on the characteristics of production virtual machine (VM) workloads large cloud providers. A thorough understanding these can inform providers' resource management systems, e.g. VM scheduler, power manager, server health manager. In this paper, we first introduce an extensive characterization Microsoft Azure's workload, including distributions VMs' lifetime, deployment size, and consumption. We then show that certain behaviors are fairly consistent over...
Exploring the opportunities to use ML, possible designs, and our experience with Microsoft Azure.
In this paper we propose a knowledge-base approach to help extracting the correct components of citations in any given format. Differently from related approaches that rely on manually built knowledge-bases (KBs) for recognizing citation, our case, such KB is automatically constructed an existing set sample metadata records area (e.g., computer science or health sciences). Our does not patterns encoding specific delimitators particular citation style. It also unsupervised, sense it learning...
In this paper we present a proposal for the implementation and evaluation of novel method automatically using data-rich text filling form-based input interfaces. Our solution takes as input, extracts implicit data values from it fills appropriate fields. For task, rely on knowledge obtained previous submissions each field, which are freely usage approach, called iForm , exploits features related to content style these values, combined through Bayesian framework. Through extensive...
Cloud platforms remain underutilized despite multiple proposals to improve their utilization (e.g., disaggregation, harvesting, and oversubscription). Our characterization of the resource virtual machines (VMs) in Azure reveals that, while CPU is main resource, we need provide a solution manage all resources holistically. We also observe that many VMs exhibit complementary temporal patterns, which can be leveraged oversubscription resources. Based on these insights, propose Coach: system...
Information extraction by text segmentation (IETS) applies to cases in which data values of interest are organized implicit semi-structured records available textual sources (e.g. postal addresses, bibliographic information, ads). It is an important practical problem that has been frequently addressed the recent literature. In this paper we introduce ONDUX (On Demand Unsupervised Extraction), a new unsupervised probabilistic approach for IETS. As other IETS approaches, relies on information...
In this paper we present JUDIE (Joint Unsupervised Structure Discovery and Information Extraction), a new method for automatically extracting semi-structured data records in the form of continuous text (e.g., bibliographic citations, postal addresses, classified ads, etc.) having no explicit delimiters between them. While state-of-the-art Extraction methods structure is manually supplied by user as training step, capable detecting each individual record being extracted without any...
Cloud providers often have resources that are not being fully utilized, and they may offer them at a lower cost to make up for the reduced availability of these resources. However, customers be hesitant use such offerings (such as spot VMs) making trade-offs between resource is always straightforward. In this work, we propose Snape (Spot On-demand Perfect Mixture), an intelligent framework optimize by dynamically mixing on-demand VMs with VMs. Through detailed characterization based on real...
Today, cloud workloads are essentially opaque to the platform. Typically, only information platform receives is virtual machine (VM) type and possibly a decoration (e.g., VM evictable). Similarly, receive little no from platform; generally, might telemetry their VMs or exceptional signals shortly before evicted). The narrow interface between platforms has several drawbacks: (1) surge in types decorations public complicates customer selection; (2) essential workload characteristics low...
Abstract In this article we present FLUX‐CiM, a novel method for extracting components (e.g., author names, titles, venues, page numbers) from bibliographic citations. Our does not rely on patterns encoding specific delimiters used in particular citation style. This feature yields high degree of automation and flexibility, allows FLUX‐CiM to extract citations any given format. Differently previous methods that are based models learned user‐driven training, our relies knowledge base...
In this poster paper, we present an overview of CienciaBrasil, a research social network involving researchers within the Brazilian INCT program. We describe its architecture and solutions adopted for data collection, extraction, deduplication, materializing visualizing network.
In large enterprises, data discovery is a common problem faced by users who need to find relevant information in relational databases. this scenario, schema annotation useful tool enrich database with descriptive keywords. paper, we demonstrate Barcelos, system that automatically annotates corporate Unlike existing approaches use Web oriented knowledge bases, Barcelos mines enterprise spreadsheets candidate annotations. Our experimental evaluation shows produces high quality annotations; the...
On the web of today most prevalent solution for users to interact with data-intensive applications is use form-based interfaces composed by several data input fields, such as text boxes, radio buttons, pull-down lists, check etc. Although these are popular and effective, in many cases, free preferred over ones. In this paper we discuss proposal implementation a novel IR-based method using rich interfaces. Our takes input, extracts implicitly values from it fills appropriate fields them. For...
In this article, we present a study about classification methods for large-scale categorization of product offers on e-shopping web sites. We the performance previously proposed approaches and deployed probabilistic approach to model problem. also studied an alternative way modeling information description investigated usage price store as features adopted in process. Our experiments used two collections over million categorized by human editors taxonomies hundreds categories from real site....
Information extraction by text segmentation (IETS) applies to cases in which data values of interest are organized implicit semi-structured records available textual sources (e.g. postal addresses, bibliographic information, ads). It is an important practical problem that has been frequently addressed the recent literature. We report here partial results from a PhD thesis work we introduce ONDUX (On Demand Unsupervised Extraction), new unsupervised probabilistic approach for IETS. As other...
Azure Spot Virtual Machines (Spot VMs) utilize unused compute capacity at significant cost savings. They can be evicted when needs the back, therefore suitable for workloads that tolerate interruptions. A good prediction of VM evictions is beneficial to optimize utilization and offers users information better plan deployments by selecting clusters reduce potential evictions. The current in-service cluster-level method ignores node heterogeneity aggregating information. In this paper, we...
We propose a lightweight framework for data exchange that is suitable non-expert and casual users sharing on the Web or through peer-to-peer systems. Unlike previous work, we consider simplistic model schema formalism are describing typical online data, algorithms mapping such schemas as well translating corresponding instances. Our solution requires minimal overhead setup costs compared to existing systems, making it very attractive in setting. report experimental results indicating our...
With the rapid development of cloud systems, an increasing number service workloads are deployed in private and/or public cloud. Although large providers such as Azure and Google have published workload traces past, prior work has not focused on analyzing characterizing differences between detail. Based our experience working with Azure, one most widely used platforms world, we find that characteristics different workloads. Specifically, compared workloads, tend to be more homogeneous both...
Oversubscription is a prevalent practice in cloud services where the system offers more virtual resources, such as cores machines, to users or applications than its available physical capacity for reducing revenue loss due unused/redundant capacity. While oversubscription can potentially lead significant enhancement efficient resource utilization, caveat that it comes with risks of overloading and introducing jitter at level nodes if all co-located machines have high utilization. Thus...
The use of big data in a business revolves around monitor-mine-manage (M3) loop: is monitored real-time, while mined insights are used to manage the and derive value. While mining has traditionally been performed offline, recent years have seen an increasing need perform all phases M3 real-time. A stream processing engine (SPE) enables such seamless loop for applications as targeted advertising, recommender systems, risk analysis, call-center analytics. However, these require SPE maintain...