- Parallel Computing and Optimization Techniques
- Advanced Data Storage Technologies
- Cloud Computing and Resource Management
- Caching and Content Delivery
- Security and Verification in Computing
- Distributed and Parallel Computing Systems
- Peer-to-Peer Network Technologies
- Distributed systems and fault tolerance
- Radiation Effects in Electronics
- Interconnection Networks and Systems
- IoT and Edge/Fog Computing
- Algorithms and Data Compression
- Software System Performance and Reliability
- Software-Defined Networks and 5G
- Advanced Optical Network Technologies
- Ferroelectric and Negative Capacitance Devices
- Advanced Neural Network Applications
- Advanced Memory and Neural Computing
- Advanced Malware Detection Techniques
- Video Analysis and Summarization
- Multimedia Communication and Technology
- Insect symbiosis and bacterial influences
- Low-power high-performance VLSI design
- Green IT and Sustainability
- Personal Information Management and User Behavior
Intel (United States)
2012-2024
Carnegie Mellon University
2010-2023
IBM Research - Thomas J. Watson Research Center
2020-2021
University of Bologna
2020
Fondazione Bruno Kessler
2020
Hasso Plattner Institute
2020
University of Potsdam
2020
Texas Tech University
2020
KTH Royal Institute of Technology
2020
Intel (United Kingdom)
2003-2017
To better understand the challenges in developing effective cloud-based resource schedulers, we analyze first publicly available trace data from a sizable multi-purpose cluster. The most notable workload characteristic is heterogeneity: types (e.g., cores:RAM per machine) and their usage duration resources needed). Such heterogeneity reduces effectiveness of traditional slot- core-based scheduling. Furthermore, some tasks are constrained as to kind machine they can use, increasing complexity...
Data-intensive applications that operate on large volumes of data have motivated a fresh look at the design center networks. The first wave proposals focused designing pure packet-switched networks provide full bisection bandwidth. However, these significantly increase network complexity in terms number links and switches required restricted rules to wire them up. On other hand, optical circuit switching technology holds very bandwidth advantage over packet technology. This fact motivates us...
Many important applications trigger bulk bitwise operations, i.e., operations on large bit vectors. In fact, recent works design techniques that exploit fast to accelerate databases (bitmap indices, BitWeaving) and web search (BitFunnel). Unfortunately, in existing architectures, the throughput of is limited by memory bandwidth available processing unit (e.g., CPU, GPU, FPGA, processing-in-memory).
Cache compression is a promising technique to increase on-chip cache capacity and decrease off-chip bandwidth usage. Unfortunately, directly applying well-known algorithms (usually implemented in software) leads high hardware complexity unacceptable decompression/compression latencies, which turn can negatively affect performance. Hence, there need for simple yet efficient that effectively compress common in-cache data patterns, has minimal effect on access latency.
Several system-level operations trigger bulk data copy or initialization. Even though these do not require any computation, current systems transfer a large quantity of back and forth on the memory channel to perform such operations. As result, consume high latency, bandwidth, energy--degrading both system performance energy efficiency.
Energy costs for data centers continue to rise, already exceeding $15 billion yearly. Sadly much of this power is wasted. Servers are only busy 10--30% the time on average, but they often left on, while idle, utilizing 60% or more peak when in idle state. We introduce a dynamic capacity management policy, AutoScale , that greatly reduces number servers needed driven by unpredictable, time-varying load, meeting response SLAs. scales center capacity, adding removing as needed. has two key...
Bitwise operations are an important component of modern day programming, and used in a variety applications such as databases. In this work, we propose new simple mechanism to implement <i>bulk</i> bitwise AND OR DRAM, which is faster more efficient than existing mechanisms. Our exploits DRAM operation perform AND/OR two rows completely within DRAM. The key idea simultaneously connect three cells bitline before the sense-amplification. By controlling value one cells, sense amplifier forces...
We identify a new capability for mobile computing that mimics the opening and closing of laptop, but avoids physical transport hardware. Through rapid easy personalization depersonalization anonymous hardware, user is able to suspend work at one machine resume it another. Our key insight this can be achieved by layering virtual technology on distributed file system. report an initial implementation describe our plans improving efficiency, portability, security.
Data-intensive applications that operate on large volumes of data have motivated a fresh look at the design center networks. The first wave proposals focused designing pure packet-switched networks provide full bisection bandwidth. However, these significantly increase network complexity in terms number links and switches required restricted rules to wire them up. On other hand, optical circuit switching technology holds very bandwidth advantage over packet technology. This fact motivates us...
Power-proportional cluster-based storage is an important component of overall cloud computing infrastructure. With it, substantial subsets nodes in the cluster can be turned off to save power during periods low utilization. Rabbit a distributed file system that arranges its data-layout provide ideal power-proportionality down very minimum number powered-up (enough store primary replica available datasets). addresses node failure rates large-scale clusters with data layouts minimize must if...
In chip multiprocessors (CMPs), limiting the number of offchip cache misses is crucial for good performance. Many multithreaded programs provide opportunities constructive sharing, in which concurrently scheduled threads share a largely overlapping working set. this paper, we compare performance two state-of-the-art schedulers proposed fine-grained programs: Parallel Depth First (PDF), specifically designed and Work Stealing (WS), more traditional design. Our experimental results indicate...
TetriSched is a scheduler that works in tandem with calendaring reservation system to continuously re-evaluate the immediate-term scheduling plan for all pending jobs (including those reservations and best-effort jobs) on each cycle. leverages information supplied by about jobs' deadlines estimated runtimes ahead deciding whether wait busy preferred resource type (e.g., machine GPU) or fall back less placement options. Plan-ahead affords significant flexibility handling mis-estimates job...
Open Cirrus is a cloud computing testbed that, unlike existing alternatives, federates distributed data centers. It aims to spur innovation in systems and applications research catalyze development of an open source service stack for the cloud.
Many data structures (e.g., matrices) are typically accessed with multiple access patterns. Depending on the layout of structure in physical address space, some patterns result non-unit strides. In existing systems, which optimized to store and cache lines, strided accesses exhibit low spatial locality. Therefore, they incur high latency, waste memory bandwidth space.
Data compression is a promising approach for meeting the increasing memory capacity demands expected in future systems. Unfortunately, existing algorithms do not translate well when directly applied to main because they require controller perform non-trivial computation locate cache line within compressed page, thereby access latency and degrading system performance. Prior proposals addressing this performance degradation problem are either costly or energy inefficient.
Off-chip main memory has long been a bottleneck for system performance. With increasing pressure due to multiple on-chip cores, effective cache utilization is important. In with limited space, we would ideally like prevent 1) pollution, i.e., blocks low reuse evicting high from the cache, and 2) thrashing, each other cache.
Instruction-grain program monitoring tools, which check and analyze executing programs at the granularity of individual instructions, are invaluable for quickly detecting bugs security attacks then limiting their damage (via containment and/or recovery). Unfortunately, fine-grain nature implies very high overheads software-only typically based on dynamic binary instrumentation. Previous hardware proposals either focus mechanisms that target specific or address only cost In this paper, we...
The Internet suspend/resume model of mobile computing cuts the tight binding between PC state and hardware. By layering a virtual machine on distributed storage, ISR lets VM encapsulate execution user customization state; storage then transports that across space time. This article explores implications for an infrastructure-based approach to computing. It reports experiences with three versions describes work in progress toward OpenISR version
While sleep states have existed for mobile devices and workstations some time, these not been incorporated into most of the servers in today's data centers. High setup times make center administrators fearful any form dynamic power management, whereby are suspended or shut down when load drops. This general reluctance has stalled research whether there might be feasible state (with sufficiently low overhead and/or power) that would actually beneficial paper investigates regime advantageous...
Meeting service level objectives (SLOs) for tail latency is an important and challenging open problem in cloud computing infrastructures. The challenges are exacerbated by burstiness the workloads. This paper describes PriorityMeister -- a system that employs combination of per-workload priorities rate limits to provide QoS shared networked storage, even with bursty automatically proactively configures workload across multiple stages (e.g., storage stage followed network stage) meet...
We introduce a set of new Compression-Aware Management Policies (CAMP) for on-chip caches that employ data compression. Our management policies are based on two key ideas. First, we show it is possible to build more efficient policy compressed if the block size directly used in calculating value (importance) cache. This leads Minimal-Value Eviction (MVE), evicts cache blocks with least value, both and expected future reuse. Second, that, some cases, can be as an indicator reuse block. use...