- Cloud Computing and Resource Management
- Distributed and Parallel Computing Systems
- IoT and Edge/Fog Computing
- Parallel Computing and Optimization Techniques
- Software System Performance and Reliability
- Caching and Content Delivery
- Advanced Data Storage Technologies
- Graph Theory and Algorithms
- Software-Defined Networks and 5G
- Network Security and Intrusion Detection
- Advanced Database Systems and Queries
- Big Data and Business Intelligence
- Service-Oriented Architecture and Web Services
- Technology and Security Systems
- Data Management and Algorithms
- Data Stream Mining Techniques
- Artificial Intelligence in Healthcare
- Advanced Computational Techniques and Applications
- Metallurgy and Material Science
Alibaba Group (United States)
2022
Guangzhou Vocational College of Science and Technology
2022
University of Colorado Colorado Springs
2017-2019
Wayne State University
2019
Shanghai University of Electric Power
2018
University of Science and Technology Beijing
2017
Chinese Academy of Sciences
2014-2015
Beijing Institute of Technology
2015
South China University of Technology
2013
Institute of Software
2012
As a rising application paradigm, cloud computing enables the resources to be virtualized and shared among applications. In typical scenario, customers, Service Providers (SP), Platform (PP) are independent participants, they have their own objectives with different revenues costs. From PPs' viewpoints, much research work reduced costs by optimizing VM placement deciding when how perform migrations. However, some ignored fact that balanced use of multi-dimensional can affect overall resource...
Datacenter clusters often run data-intensive jobs in parallel for improving resource utilization and cost efficiency. The performance of is constrained by the cluster's hard-to-scale network bisection bandwidth. Various solutions have been proposed to address issue, however, most them do not consider inter-job data dependencies schedule independently from one another. In this work, we find that aggregating co-locating tasks dependent offer an extra opportunity locality improvement can help...
Executing distributed machine learning (ML) jobs on Spark follows Bulk Synchronous Parallel (BSP) model, where parallel tasks execute the same iteration at time and generated updates must be synchronized parameters when all are finished. However, rarely have execution due to sparse data so that synchronization has wait for finished late. Moreover, running heterogeneous clusters makes it even worse because of stragglers, is significantly delayed by slowest task.
Datacenters are evolving to host heterogeneous workloads on shared clusters reduce the operational cost and achieve higher resource utilization. However, it is challenging schedule with diverse requirements QoS constraints. On one hand, latency-critical jobs need be scheduled as soon they submitted avoid any queuing delays. other best-effort long should allowed occupy cluster when there idle resources improve The challenge lies in how minimize delays of short while maximizing In this...
Data analytics workloads are shifting to shorter task execution time, higher degree of parallelism, and on faster hardware. As a result, job scheduling is becoming bottleneck, which needs offer extreme low-latency, massive throughput, high scalability. However, few efforts have been focused systematically understanding the delay. In this paper, we propose method develop tool, SD-checker, that decomposes delay into multiple components characterizes each by extensive experiments. SDchecker...
As the scale of cloud systems continues to grow, virtualized networks that provide connectivity between services within and across data centers, are becoming increasingly important performance reliability cloud. Despite many advantages, including fast deployment, ease management, programmability, require additional layers abstraction complicate monitoring diagnosis issues compared traditional on physical hardware. Virtualized usually connect components in multiple protection domains, such as...
Data-intensive applications often suffer from significant memory pressure, resulting in excessive garbage collection (GC) and out-of-memory (OOM) errors, harming system performance reliability. In this paper, we demonstrate how lightweight virtualization via OS containers opens up opportunities to address pressure realize elasticity: 1) tasks running a container can be set large heap size avoid OutOfMemory 2) that are under incur swapping activities temporarily "suspended" by depriving...
Big data processing at the production scale presents a highly complex environment for resource optimization (RO), problem crucial meeting performance goals and budgetary constraints of analytical users. The RO is challenging because it involves set decisions (the partition count, placement parallel instances on machines, allocation to each instance), requires multi-objective (MOO), compounded by complexity big systems while having meet stringent time scheduling. This paper MaxCompute-based...
Modern datacenter schedulers apply a static policy to partition resources among different tasks. The amount of allocated resource won't get changed during task's lifetime. However, we found that usage runtime demonstrates high dynamics and it only reaches full at few moments. Therefore, the allocation doesn't exploit dynamic nature usage, leading low system utilization. To address this hard problem, recently proposed task-consolidation approach packs as many tasks possible on same node based...
Spark tuning with its dozens of parameters for performance improvement is both a challenge and time consuming effort. Current techniques rely on trial-and-error or best guess utilizing expert knowledge that very few posses. Previous works are not compatible also ignore the underlying problem resource bottlenecks cause issues, potential ally, if awareness leveraged in directing to be more effective. We propose develop PETS, new method allows associated at same time, using bottleneck adjust...
Spark has become a very attractive platform for big data analytics in recent years due to its unique advantages such as parallelism, fault tolerance, and complexity associated with clusters setup. On the spark platform, users can adjust parameter configurations according different job requirements specific applications optimize performance. This leads problem that we can't ignore, already more than 180 parameters, huge combination of parameters means rely on manual tuning grasp impact all In...
Understanding and troubleshooting distributed systems in the cloud is considered a very difficult problem because execution of single user request to multiple machines. Further, multi-tenancy nature environments further introduces interference that causes performance issues. Most existing tools either focus on log analysis or intrusive tracing methods, leaving resource usage monitoring unexplored.
Computational skewness is a significant challenge in multi-tenant data-parallel clusters that introduce dynamic heterogeneity of machine capacity distributed data processing. Previous efforts to addressing mostly focus on batch jobs based the assumption processing time linearly dependent size partitioned data. However, they are illsuited for iterative learning (ML) jobs, which (1) exhibit non-linear relationship between parameters and within each iteration, (2) show an explicit binding input...
Cloud computing is a new model and technology that leverage the efficient pooling of on-demand, self-managed virtual infrastructure. Virtualization packages applications in form Virtual Machine (VM) provides significant benefits by reconfiguring VMs dynamically. VM reconfiguration hard complicated, existing work addressed problem with diverse objectives answering questions when to reconfigure, which should be reconfigured where host VMs. However, we found runtime affects total costs...
Out-of-memory (OOM) errors and excessive garbage collection (GC) activities are common issues in dataintensive parallel programs, which cause not only poor performance but also execution failures. A recent study [1] proposed a new programming model to address the memory pressure data-parallel programs. The iTask proactively reclaims avoid OOM reduce GC time. Although effective, it requires extensive changes program.In this paper, we show that lightweight virtualization, such as OS...
Open Source Private Cloud (OSPC) is a full stack of private cloud solution based on Stack to help user enable and manage environment. Intel Intelligent Power Node Manager platform resident technology with power thermal policies. In this paper, we introduce how in computing effectively by integrating OSPC define resolution as management policy. Live migration supported Xen implemented new policy balance the load our experiment, Prime95 used benchmark verify effectiveness The result shows that...
As a rising application paradigm and technology, cloud computing can leverage the efficient pooling of on-demand, self-managed virtual infrastructure. How to maximize resource utilization how reduce cost configuration are essential issues in computing. In this paper, authors propose framework achieve these objectives by optimizing VM placement deciding when perform reconfigurations. The vector arithmetic model objective balancing multiple an optimization method for static placement. Then...
As the scale of cloud systems continues to grow, virtualized networks are becoming increasingly important performance and reliability cloud. Despite many advantages, introduce additional layers abstraction more difficult monitor diagnose issues compared traditional networks. Furthermore, it is challenging reason about dynamic Therefore, there a great need for fine-grained, user customizable, reconfigurable network tracing. To address above challenges, we propose vNetTracer, an efficient...
Vast amounts of data is collected into flow measurement system for the primary purpose trade settlement, fairness and justice entire settlement depends entirely on integrity integrating system.Therefore, role very important.Measurement equipment traditional expensive, but information release ability poor operation maintenance complex.To solve this problem, paper design a based cloud computing technology.The uses centre Hadoop to manage massive data, mathematical models compute application...
A recent line of works apply machine learning techniques to assist or rebuild cost-based query optimizers in DBMS. While exhibiting superiority some benchmarks, their deficiencies, e.g., unstable performance, high training cost, and slow model updating, stem from the inherent hardness predicting cost latency execution plans using models. In this paper, we introduce a learning-to-rank optimizer, called Lero, which builds on top native optimizer continuously learns improve optimization...
Big Data processing often suffers from significant memory pressure, resulting in excessive garbage collection (GC) and out-of-memory (OOM) errors, harming system performance reliability. Therefore, users tend to give an heap size applications avoid job failure, causing low cluster utilization.