- Parallel Computing and Optimization Techniques
- Interconnection Networks and Systems
- Cloud Computing and Resource Management
- Distributed systems and fault tolerance
- Advanced Neural Network Applications
- Distributed and Parallel Computing Systems
- Advanced Memory and Neural Computing
- Neural Networks and Applications
- Software-Defined Networks and 5G
- Embedded Systems Design Techniques
- Advanced Data Storage Technologies
- Advanced Optical Network Technologies
- Fuzzy and Soft Set Theory
- Machine Learning and Data Classification
- Adversarial Robustness in Machine Learning
- Machine Learning and Algorithms
- Network Traffic and Congestion Control
- Data Mining Algorithms and Applications
- Intuitionistic Fuzzy Systems Applications
- Ferroelectric and Negative Capacitance Devices
- Photonic and Optical Devices
- Model Reduction and Neural Networks
- Digital Rights Management and Security
- Stochastic Gradient Optimization Techniques
- Generative Adversarial Networks and Image Synthesis
Google (United States)
2023
Massachusetts Institute of Technology
2013-2018
System Simulation (United Kingdom)
2014
Moscow Institute of Thermal Technology
2013
Switches today provide a small menu of scheduling algorithms. While we can tweak parameters, cannot modify algorithmic logic, or add completely new algorithm, after the switch has been designed. This paper presents design for {\em programmable} packet scheduler, which allows algorithms---potentially algorithms that are unknown today---to be programmed into without requiring hardware redesign.
In response to innovations in machine learning (ML) models, production workloads changed radically and rapidly. TPU v4 is the fifth Google domain specific architecture (DSA) its third supercomputer for such ML models. Optical circuit switches (OCSes) dynamically reconfigure interconnect topology improve scale, availability, utilization, modularity, deployment, security, power, performance; users can pick a twisted 3D torus if desired. Much cheaper, lower faster than Infiniband, OCSes...
In the many-core era, scalable coherence and on-chip interconnects are crucial for shared memory processors. While snoopy is common in small multicore systems, directory-based de facto choice scalability to many cores, as relies on ordered which do not scale. However, does scale beyond tens of cores due excessive directory area overhead or inaccurate sharer tracking. Prior techniques supporting ordering arbitrary unordered networks impractical full chip designs We present SCORPIO, an mesh...
The data plane is in a continuous state of flux. Every few months, researchers publish the design new high-performance queueing or scheduling scheme that runs inside network fabric. Many such schemes have been queen for day, only to be surpassed soon after as methods --- evaluation metrics evolve.
We present Swarm, a novel architecture that exploits ordered irregular parallelism, which is abundant but hard to mine with current software and hardware techniques. In this architecture, programs consist of short tasks programmer-specified timestamps. Swarm executes speculatively out order, efficiently speculates thousands ahead the earliest active task uncover parallelism. builds on prior TLS HTM schemes, contributes several new techniques allow it scale large core counts speculation...
Attention mechanisms, primarily designed to capture pairwise correlations between words, have become the backbone of machine learning, expanding beyond natural language processing into other domains. This growth in adaptation comes at cost prohibitively large memory requirements and computational complexity, especially higher number input elements. limitation is due inherently limited data reuse opportunities quadratic footprints, leading severe memory-boundedness scalability work addresses...
As technology scales, SoCs are increasing in core counts, leading to the need for scalable NoCs interconnect multiple cores on chip. Given aggressive SoC design targets, have deliver low latency, high bandwidth, at power and area overheads. In this paper, we propose Single-cycle Multi-hop Asynchronous Repeated Traversal (SMART) NoC, a NoC that reconfigures tailors generic mesh topology applications runtime. The heart of our SMART is novel low-swing clockless repeated link circuit embedded...
As technology scales, SoCs are increasing in core counts, leading to the need for scalable NoCs interconnect multiple cores on chip. Given aggressive SoC design targets, have deliver low latency, high bandwidth, at power and area overheads. In this paper, we propose Single-cycle Multi-hop Asynchronous Repeated Traversal (SMART) NoC, a NoC that reconfigures tailors generic mesh topology applications runtime. The heart of our SMART is novel low-swing clockless repeated link circuit embedded...
Pruning eliminates unnecessary parameters in neural networks; it offers a promising solution to the growing computational demands of large language models (LLMs). While many focus on post-training pruning, sparse pre-training--which combines pruning and pre-training into single phase--provides simpler alternative. In this work, we present first systematic exploration optimal configurations for LLMs through an examination 80 unique schedules across different sparsity levels training...
Decoding with autoregressive large language models (LLMs) traditionally occurs sequentially, generating one token after another. An emerging line of work explored parallel decoding by identifying and simultaneously semantically independent chunks LLM responses. However, these techniques rely on hand-crafted heuristics tied to syntactic structures like lists paragraphs, making them rigid imprecise. We present PASTA, a learning-based system that teaches LLMs identify semantic independence...
In the many-core era, scalable coherence and on-chip interconnects are crucial for shared memory processors. While snoopy is common in small multicore systems, directory-based de facto choice scalability to many cores, as relies on ordered which do not scale. However, does scale beyond tens of cores due excessive directory area overhead or inaccurate sharer tracking. Prior techniques supporting ordering arbitrary unordered networks impractical full chip designs.
Packet scheduling in switches is not programmable; operators only choose among a handful of algorithms implemented by the manufacturer. In contrast, other switch functions such as packet parsing and header processing are becoming programmable [10, 3, 6]. This paper presents scheduler that allows to program variety algorithms.
The authors present Swarm, a parallel architecture that exploits ordered parallelism, which is abundant but hard to mine with current software and hardware techniques. Swarm programs consist of short tasks, as small tens instructions each, programmer-specified order constraints. executes tasks speculatively out efficiently speculates thousands ahead the earliest active task uncover enough parallelism. Several techniques allow scale large core counts speculation windows. evaluate on graph...
N:M Structured sparsity has garnered significant interest as a result of relatively modest overhead and improved efficiency. Additionally, this form holds considerable appeal for reducing the memory footprint owing to their representation overhead. There have been efforts develop training recipes structured sparsity, they primarily focus on low-sparsity regions ($\sim$50\%). Nonetheless, performance models trained using these approaches tends decline when confronted with high-sparsity...
Future scalability for kilo-core architectures requires solutions beyond the capabilities of protocol and software design. Single-cycle multihop asynchronous repeated traversal (SMART) creates virtual single-cycle paths across shared network between cores, potentially offering significant reductions in runtime latency energy expenditure.
Multicore systems must exploit locality to scale, scheduling tasks minimize data movement. While locality-aware parallelism is well studied in non-speculative systems, it has received little attention speculative (e.g., HTM or TLS), which hinders their scalability. We present spatial hints, a technique that leverages program knowledge reveal and parallel programs. A hint an abstract integer, given when task created, denotes the likely access. show easy modify programs convey through hints....
Multicore systems must exploit locality to scale, scheduling tasks minimize data movement. While locality-aware parallelism is well studied in non-speculative systems, it has received little attention speculative (e.g., HTM or TLS), which hinders their scalability. We present spatial hints, a technique that leverages program knowledge reveal and parallel programs. A hint an abstract integer, given when task created, denotes the likely access. show easy modify programs convey through hints....
In response to innovations in machine learning (ML) models, production workloads changed radically and rapidly. TPU v4 is the fifth Google domain specific architecture (DSA) its third supercomputer for such ML models. Optical circuit switches (OCSes) dynamically reconfigure interconnect topology improve scale, availability, utilization, modularity, deployment, security, power, performance; users can pick a twisted 3D torus if desired. Much cheaper, lower faster than Infiniband, OCSes...
Most systems that support speculative parallelization, like hardware transactional memory (HTM), do not nested parallelism. This sacrifices substantial parallelism and precludes composing parallel algorithms. And the few HTMs focus on parallelizing at coarsest (shallowest) levels, incurring large overheads squander most of their potential.
The increasing size of deep neural networks necessitates effective model compression to improve computational efficiency and reduce their memory footprint. Sparsity quantization are two prominent methods that have individually demonstrated significant reduction in footprints while preserving accuracy. While effective, the interplay between these remains an open question. In this paper, we investigate interaction assess whether combination impacts final We mathematically prove applying...
Multicore systems should support both speculative and non-speculative parallelism. Speculative parallelism is easy to use crucial scale many challenging applications, while more efficient allows parallel irrevocable actions (e.g., I/O). Unfortunately, prior techniques are far from this goal. Hardware transactional memory (HTM) (transactional) (non-transactional) work, but lack coordination mechanisms between the two, limited unordered Prior work has extended HTMs avoid limitations of...
This work studies the interplay between multithreaded cores and speculative parallelism (e.g., transactional memory or thread-level speculation). These techniques are often used together, yet they have been developed independently. disconnect causes major performance pathologies: increasing number of threads per core adds conflicts wasted work, puts pressure on execution resources. pathologies squander benefits multithreading.We present speculation-aware multithreading (SAM), a simple policy...
Most systems that support speculative parallelization, like hardware transactional memory (HTM), do not nested parallelism. This sacrifices substantial parallelism and precludes composing parallel algorithms. And the few HTMs focus on parallelizing at coarsest (shallowest) levels, incurring large overheads squander most of their potential. We present FRACTAL, a new execution model supports unordered timestamp-ordered FRACTAL lets programmers seamlessly compose algorithms, architecture...
Switches today provide a small set of scheduling algorithms. While we can tweak parameters, cannot modify algorithmic logic, or add completely new algorithm, after the switch has been designed. This paper presents design for programmable packet scheduler, which allows algorithms---potentially algorithms that are unknown today---to be programmed into without requiring hardware redesign. Our builds on observation make two decisions: in what order to schedule packets and when them. Further,...