- Parallel Computing and Optimization Techniques
- Distributed systems and fault tolerance
- Cloud Computing and Resource Management
- Distributed and Parallel Computing Systems
- Advanced Data Storage Technologies
- Interconnection Networks and Systems
- Embedded Systems Design Techniques
- Caching and Content Delivery
- Optimization and Search Problems
- Intelligent Tutoring Systems and Adaptive Learning
- Advanced Neural Network Applications
- Network Packet Processing and Optimization
- Adversarial Robustness in Machine Learning
- Stochastic Gradient Optimization Techniques
- Computability, Logic, AI Algorithms
- Privacy-Preserving Technologies in Data
- Real-Time Systems Scheduling
- Graph Theory and Algorithms
- Explainable Artificial Intelligence (XAI)
University of Toronto
2011-2024
MIT University
2020
Massachusetts Institute of Technology
2016-2018
We present Swarm, a novel architecture that exploits ordered irregular parallelism, which is abundant but hard to mine with current software and hardware techniques. In this architecture, programs consist of short tasks programmer-specified timestamps. Swarm executes speculatively out order, efficiently speculates thousands ahead the earliest active task uncover parallelism. builds on prior TLS HTM schemes, contributes several new techniques allow it scale large core counts speculation...
The authors present Swarm, a parallel architecture that exploits ordered parallelism, which is abundant but hard to mine with current software and hardware techniques. Swarm programs consist of short tasks, as small tens instructions each, programmer-specified order constraints. executes tasks speculatively out efficiently speculates thousands ahead the earliest active task uncover enough parallelism. Several techniques allow scale large core counts speculation windows. evaluate on graph...
Multicores are now ubiquitous, but programmers still write sequential code. Speculative parallelization is an enticing approach to parallelize code while retaining the ease of programming, making parallelism pervasive. However, prior speculative parallelizing compilers and architectures achieved limited speedups due high costs recovering from misspeculation hardware scalability bottlenecks. We present T4, a compiler that successfully leverages recent features for execution, which new...
Multicore systems must exploit locality to scale, scheduling tasks minimize data movement. While locality-aware parallelism is well studied in non-speculative systems, it has received little attention speculative (e.g., HTM or TLS), which hinders their scalability. We present spatial hints, a technique that leverages program knowledge reveal and parallel programs. A hint an abstract integer, given when task created, denotes the likely access. show easy modify programs convey through hints....
Multicore systems must exploit locality to scale, scheduling tasks minimize data movement. While locality-aware parallelism is well studied in non-speculative systems, it has received little attention speculative (e.g., HTM or TLS), which hinders their scalability. We present spatial hints, a technique that leverages program knowledge reveal and parallel programs. A hint an abstract integer, given when task created, denotes the likely access. show easy modify programs convey through hints....
Most systems that support speculative parallelization, like hardware transactional memory (HTM), do not nested parallelism. This sacrifices substantial parallelism and precludes composing parallel algorithms. And the few HTMs focus on parallelizing at coarsest (shallowest) levels, incurring large overheads squander most of their potential.
Multicore systems should support both speculative and non-speculative parallelism. Speculative parallelism is easy to use crucial scale many challenging applications, while more efficient allows parallel irrevocable actions (e.g., I/O). Unfortunately, prior techniques are far from this goal. Hardware transactional memory (HTM) (transactional) (non-transactional) work, but lack coordination mechanisms between the two, limited unordered Prior work has extended HTMs avoid limitations of...
A Bloom filter is a probabilistic bit-array-based set representation that has recently been applied to address-set disambiguation in systems ease the burden of parallel programming. However, many these intersect bit-arrays approximate intersection and decide disjointness. This contrast with conventional well-studied approach making individual membership queries into filter. In this paper we present much-needed models for unconventional application testing disjointness using filters....
Online services in modern datacenters use Remote Procedure Calls (RPCs) to communicate between different software layers. Despite RPCs using just a few small functions, inefficient RPC handling can cause delays propagate across the system and degrade end-to-end performance. Prior work has reduced processing time less than 1 $\mu$ s, which now shifts bottleneck scheduling of RPCs. Existing schedulers suffer from either high overheads, inability effectively utilize core-count CPUs or do not...
This work studies the interplay between multithreaded cores and speculative parallelism (e.g., transactional memory or thread-level speculation). These techniques are often used together, yet they have been developed independently. disconnect causes major performance pathologies: increasing number of threads per core adds conflicts wasted work, puts pressure on execution resources. pathologies squander benefits multithreading.We present speculation-aware multithreading (SAM), a simple policy...
As reconfigurable computing hardware and in particular FPGA-based systems-on-chip comprise an increasing number of processor accelerator cores, supporting sharing synchronization a way that is scalable easy to program becomes challenge. Transactional Memory (TM) potential solution this problem, system provides the opportunity support TM (HTM). Although there are many proposed approaches HTM for ASICs, these do not necessarily map well FPGAs. In work we demonstrate while signature -based...
Most systems that support speculative parallelization, like hardware transactional memory (HTM), do not nested parallelism. This sacrifices substantial parallelism and precludes composing parallel algorithms. And the few HTMs focus on parallelizing at coarsest (shallowest) levels, incurring large overheads squander most of their potential. We present FRACTAL, a new execution model supports unordered timestamp-ordered FRACTAL lets programmers seamlessly compose algorithms, architecture...
Many algorithms schedule their work, or tasks, according to a priority order for correctness faster convergence. While schedulers commonly implement task enqueue and dequeueMin operations, some need update operation that alters the scheduling metadata task. Prior software hardware systems support with updates compromise on either parallelism, work-efficiency, both, leading missed performance opportunities. Moreover, incorrectly navigating these compromises violates in those are not resilient...
The paper proposes and optimizes a partial recovery training system, CPR, for recommendation models. CPR relaxes the consistency requirement by enabling non-failed nodes to proceed without loading checkpoints when node fails during training, improving failure-related overheads. is first extent of our knowledge perform data-driven, in-depth analysis applying models identified trade-off between accuracy performance. Motivated analysis, we present system that can reduce time maintain desired...
The economics of Moore's Law are stumbling, so vendors many-core architectures transitioning from single-die monolithic designs to multi-chiplet disintegrated systems within a package. Disintegration lowers cost for the same number cores but bottlenecks interconnect. Ideally, disintegration should increase performance per dollar: savings outweigh slowdown. Although industry has reported savings, penalty is not well studied.
Rust aims to combine safety and performance claims provide fearless concurrency. We present a case study evaluate the extent which makes parallel programming by porting programs from C++-based PBBS benchmark suite Rust. with Rayon provides fearlessness for regular parallelism but not irregular parallelism. introduce Rusty-PBBS: Rust-based both
No abstract available.