- Parallel Computing and Optimization Techniques
- Interconnection Networks and Systems
- Advanced Data Storage Technologies
- Distributed and Parallel Computing Systems
- Low-power high-performance VLSI design
- Cloud Computing and Resource Management
- Embedded Systems Design Techniques
- Radiation Effects in Electronics
- Distributed systems and fault tolerance
- Advanced Memory and Neural Computing
- Real-Time Systems Scheduling
- Graph Theory and Algorithms
- Software System Performance and Reliability
- Ferroelectric and Negative Capacitance Devices
- Semiconductor materials and devices
- Caching and Content Delivery
- Computer Graphics and Visualization Techniques
- Software-Defined Networks and 5G
- Advancements in Semiconductor Devices and Circuit Design
- Embedded Systems and FPGA Applications
- Eosinophilic Disorders and Syndromes
- Sarcoma Diagnosis and Treatment
- VLSI and FPGA Design Techniques
- Real-time simulation and control systems
- Complexity and Algorithms in Graphs
Intel (United Kingdom)
2017-2023
Intel (United States)
2017-2021
Hewlett-Packard (United States)
2019
Ghent University
2008-2017
Ghent University Hospital
2006-2017
Assessing the performance of multiprogram workloads running on multithreaded hardware is difficult because it involves a balance between single-program and overall system performance. This article argues for developing metrics in top-down fashion starting from system-level objectives. The authors propose two metrics: average normalized turnaround time, user-oriented metric, throughput, system-oriented metric.
Large core counts and complex cache hierarchies are increasing the burden placed on commonly used simulation modeling techniques. Although analytical models provide fast results, they do not apply to complex, many-core shared-memory systems. In contrast, detailed cycle-level can be accurate but also tends slow, which limits number of configurations that evaluated. A middle ground is needed provides for processors while still providing results. this article, we explore, analyze, compare...
A mechanistic model for out-of-order superscalar processors is developed and then applied to the study of microarchitecture resource scaling. The divides execution time into intervals separated by disruptive miss events such as branch mispredictions cache misses. Each type event results in characterizable performance behavior interval. By considering an interval's length (measured instructions), can be predicted Overall determined aggregating over all intervals. provides several advantages...
A common way of representing processor performance is to use Cycles per Instruction (CPI) `stacks' which break into a baseline CPI plus number individual miss event components. stacks can be very helpful in gaining insight the behavior an application on given microprocessor; consequently, they are widely used by software developers and computer architects. However, computing superscalar out-of-order processors challenging because various overlaps among execution events (cache misses, TLB...
Limit studies on Dynamic Voltage and Frequency Scaling (DVFS) provide apparently contradictory conclusions. On the one hand early limit report that DVFS is effective at large timescales (on order of million(s) cycles) with scaling overheads tens microseconds), they conclude there no need for small overhead timescales. Recent work other hand—motivated by surge on-chip voltage regulator research—explores potential fine-grained reports substantial energy savings hundreds cycles (while assuming...
Detailed architectural simulators suffer from a long development cycle and extremely evaluation times. This longstanding problem is further exacerbated in the multi-core processor era. Existing solutions address simulation by either sampling simulated instruction stream or mapping models on FPGAs; these approaches achieve substantial speedups while simulating performance cycle-accurate manner. paper proposes interval which takes completely different approach: raises level of abstraction...
This paper presents a fundamental law for parallel performance: it shows that performance is not only limited by sequential code (as suggested Amdahl's law) but also fundamentally synchronization through critical sections. Extending software model to include sections, we derive the surprising result impact of sections on can be modeled as completely part and part. The determined probability entering section contention (i.e., multiple threads wanting enter same section). reveals at least...
Analyzing multi-threaded programs is quite challenging, but necessary to obtain good multicore performance while saving energy. Due synchronization, certain threads make others wait, because they hold a lock or have yet reach barrier. We call these critical threads, i.e., whose determinative of program as whole. Identifying can reveal numerous optimization opportunities, for the software developer and hardware.
Optimizing processors for (a) specific application(s) can substantially improve energy-efficiency. With the end of Dennard scaling, and corresponding reduction in energy-efficiency gains from technology such approaches may become increasingly important. However, designing application-specific requires fast design space exploration tools to optimize targeted application(s). Analytical models be a good fit as they provide performance power estimates insight into interaction between an...
This paper proposes a cycle accounting architecture for Simultaneous Multithreading (SMT) processors that estimates the execution times each of threads had they been executed alone, while are running simultaneously on SMT processor. is done by to either base, miss event or waiting component during multi-threaded execution. Single-threaded alone time then estimated as sum base and components; represents lost count due The incurs reasonable hardware cost (around 1KB storage) single-threaded...
Dynamic voltage and frequency scaling (DVFS) is a well known effective technique for reducing power consumption in modern microprocessors. An important concern though to estimate its profitability terms of performance energy. Current DVFS estimation approaches, however, lack accuracy or incur runtime and/or energy overhead. This paper proposes counter architecture online on superscalar out-of-order processors. The teases apart the fraction execution time that susceptible clock versus...
Symbiotic job scheduling boosts simultaneous multithreading (SMT) processor performance by co-scheduling jobs that have `compatible' demands on the processor's shared resources. Existing approaches however require a sampling phase, evaluate limited number of possible co-schedules, use heuristics to gauge symbiosis, are rigid in their optimization target, and do not preserve system-level priorities/shares. This paper proposes probabilistic symbiosis modeling, which predicts whether will...
Symbiotic job scheduling boosts simultaneous multithreading (SMT) processor performance by co-scheduling jobs that have `compatible' demands on the processor's shared resources. Existing approaches however require a sampling phase, evaluate limited number of possible co-schedules, use heuristics to gauge symbiosis, are rigid in their optimization target, and do not preserve system-level priorities/shares.
Multi-threaded workloads typically show sublinear speedup on multi-core hardware, i.e., the achieved is not proportional to number of cores and threads. Sublinear scaling may have multiple causes, such as poorly scalable synchronization leading spinning and/or yielding, interference in shared resources last-level cache (LLC) well main memory subsystem. It vital for programmers processor designers understand bottlenecks existing emerging order optimize application performance design future...
Understanding and analyzing multi-threaded program performance scalability is far from trivial, which severely complicates parallel software development optimization. In this paper, we present bottle graphs, a powerful analysis tool that visualizes performance, in regards to both per-thread parallelism execution time. Each thread represented as box, with its height equal the share of total time, width parallelism, area running The boxes all threads are stacked upon each other, leading stack...
Previous work on efficient customized processor design primarily focused in-order architectures. However, with the recent introduction of out-of-order processors for high-end high-performance embedded applications, researchers and designers need to address how automate process processors. Because parallel execution independent instructions in processors, methodologies which subdivide search space components are unlikely be effective terms accuracy designing In this paper we propose evaluate...
Despite years of study, branch mispredictions remain as a significant performance impediment in pipelined superscalar processors. In general, the misprediction penalty can be substantially larger than frontend pipeline length (which is often equated with penalty). We identify and quantify five contributors to penalty: (i) length, (ii) number instructions since last miss event (branch misprediction, I-cache miss, long D-cache miss)-this related burstiness events, (iii) inherent ILP program,...
Analytical processor performance modeling has received increased interest over the past few years. There are basically two approaches to constructing an analytical model: mechanistic and empirical modeling. Mechanistic builds up model starting from a basic understanding of underlying system - white-box approach whereas constructs through statistical inference machine learning training data, e.g., regression or neural networks black-box approach. While is typically easier construct, it...
Weighted speedup is nowadays the most commonly used multiprogram workload performance metric. a weighted-IPC metric, i.e., IPC of each program first weighted with its isolated IPC. Recently, Michaud questions validity metrics by arguing that they are inconsistent and favors unfairness [4]. Instead, he advocates using arithmetic or harmonic mean raw values programs in workload. We show not inconsistent, fair giving equal importance to program. argue that, contrast raw-IPC metrics, have...
Optimizing processors for specific application(s) can substantially improve energy-efficiency. With the end of Dennard scaling, and corresponding reduction in energyefficiency gains from technology such approaches may become increasingly important. However, designing applicationspecific require fast design space exploration tools to optimize targeted application(s). Analytical models be a good fit as they provide performance estimations insight into interaction between an application's...
This paper presents a fundamental law for parallel performance: it shows that performance is not only limited by sequential code (as suggested Amdahl's law) but also fundamentally synchronization through critical sections. Extending software model to include sections, we derive the surprising result impact of sections on can be modeled as completely part and part. The determined probability entering section contention (i.e., multiple threads wanting enter same section). reveals at least...
In modern processors, prefetching is an essential component for hiding long-latency memory accesses. However, too aggressively can easily degrade performance by evicting useful data from cache, or saturating precious bandwidth. Tuning the prefetcher's activity thus important problem. Existing techniques tend to focus on detecting negative symptoms of aggressive prefetching, such as unused prefetches being evicted bandwidth saturation, and throttle prefetcher in response.
Soft error reliability has become a first-order design criterion for modern microprocessors. Architectural Vulnerability Factor (AVF) modeling is often used to capture the probability that radiation-induced fault in hardware structure will manifest as an at program output. AVF estimation requires detailed microarchitectural simulations which are time-consuming and typically present aggregate metrics. Moreover, it large number of derive insight into impact events on AVF. In this work we...
While multicore processors improve overall chip throughput and hardware utilization, resource sharing among the cores leads to unpredictable performance for individual threads running on a processor. Unpredictable per-thread becomes problem when considered in context of scheduling: system software assumes that all make equal progress, however, this is not what provides. This may lead problems at level such as missed deadlines, reduced quality-of-service, non-satisfied service-level...