Suvinay Subramanian

ORCID: 0000-0002-8715-8964
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Parallel Computing and Optimization Techniques
  • Interconnection Networks and Systems
  • Cloud Computing and Resource Management
  • Distributed systems and fault tolerance
  • Advanced Neural Network Applications
  • Distributed and Parallel Computing Systems
  • Advanced Memory and Neural Computing
  • Neural Networks and Applications
  • Software-Defined Networks and 5G
  • Embedded Systems Design Techniques
  • Advanced Data Storage Technologies
  • Advanced Optical Network Technologies
  • Fuzzy and Soft Set Theory
  • Machine Learning and Data Classification
  • Adversarial Robustness in Machine Learning
  • Machine Learning and Algorithms
  • Network Traffic and Congestion Control
  • Data Mining Algorithms and Applications
  • Intuitionistic Fuzzy Systems Applications
  • Ferroelectric and Negative Capacitance Devices
  • Photonic and Optical Devices
  • Model Reduction and Neural Networks
  • Digital Rights Management and Security
  • Stochastic Gradient Optimization Techniques
  • Generative Adversarial Networks and Image Synthesis

Google (United States)
2023

Massachusetts Institute of Technology
2013-2018

System Simulation (United Kingdom)
2014

Moscow Institute of Thermal Technology
2013

Switches today provide a small menu of scheduling algorithms. While we can tweak parameters, cannot modify algorithmic logic, or add completely new algorithm, after the switch has been designed. This paper presents design for {\em programmable} packet scheduler, which allows algorithms---potentially algorithms that are unknown today---to be programmed into without requiring hardware redesign.

10.1145/2934872.2934899 article EN 2016-08-01

In response to innovations in machine learning (ML) models, production workloads changed radically and rapidly. TPU v4 is the fifth Google domain specific architecture (DSA) its third supercomputer for such ML models. Optical circuit switches (OCSes) dynamically reconfigure interconnect topology improve scale, availability, utilization, modularity, deployment, security, power, performance; users can pick a twisted 3D torus if desired. Much cheaper, lower faster than Infiniband, OCSes...

10.1145/3579371.3589350 article EN 2023-06-16

In the many-core era, scalable coherence and on-chip interconnects are crucial for shared memory processors. While snoopy is common in small multicore systems, directory-based de facto choice scalability to many cores, as relies on ordered which do not scale. However, does scale beyond tens of cores due excessive directory area overhead or inaccurate sharer tracking. Prior techniques supporting ordering arbitrary unordered networks impractical full chip designs We present SCORPIO, an mesh...

10.1145/2678373.2665680 article EN ACM SIGARCH Computer Architecture News 2014-06-14

The data plane is in a continuous state of flux. Every few months, researchers publish the design new high-performance queueing or scheduling scheme that runs inside network fabric. Many such schemes have been queen for day, only to be surpassed soon after as methods --- evaluation metrics evolve.

10.1145/2535771.2535796 article EN 2013-11-21

We present Swarm, a novel architecture that exploits ordered irregular parallelism, which is abundant but hard to mine with current software and hardware techniques. In this architecture, programs consist of short tasks programmer-specified timestamps. Swarm executes speculatively out order, efficiently speculates thousands ahead the earliest active task uncover parallelism. builds on prior TLS HTM schemes, contributes several new techniques allow it scale large core counts speculation...

10.1145/2830772.2830777 article EN 2015-12-05

Attention mechanisms, primarily designed to capture pairwise correlations between words, have become the backbone of machine learning, expanding beyond natural language processing into other domains. This growth in adaptation comes at cost prohibitively large memory requirements and computational complexity, especially higher number input elements. limitation is due inherently limited data reuse opportunities quadratic footprints, leading severe memory-boundedness scalability work addresses...

10.1145/3575693.3575747 article EN 2023-01-27

As technology scales, SoCs are increasing in core counts, leading to the need for scalable NoCs interconnect multiple cores on chip. Given aggressive SoC design targets, have deliver low latency, high bandwidth, at power and area overheads. In this paper, we propose Single-cycle Multi-hop Asynchronous Repeated Traversal (SMART) NoC, a NoC that reconfigures tailors generic mesh topology applications runtime. The heart of our SMART is novel low-swing clockless repeated link circuit embedded...

10.5555/2485288.2485371 article EN Design, Automation, and Test in Europe 2013-03-18

As technology scales, SoCs are increasing in core counts, leading to the need for scalable NoCs interconnect multiple cores on chip. Given aggressive SoC design targets, have deliver low latency, high bandwidth, at power and area overheads. In this paper, we propose Single-cycle Multi-hop Asynchronous Repeated Traversal (SMART) NoC, a NoC that reconfigures tailors generic mesh topology applications runtime. The heart of our SMART is novel low-swing clockless repeated link circuit embedded...

10.7873/date.2013.080 article EN Design, Automation & Test in Europe Conference & Exhibition (DATE), 2015 2013-01-01

Pruning eliminates unnecessary parameters in neural networks; it offers a promising solution to the growing computational demands of large language models (LLMs). While many focus on post-training pruning, sparse pre-training--which combines pruning and pre-training into single phase--provides simpler alternative. In this work, we present first systematic exploration optimal configurations for LLMs through an examination 80 unique schedules across different sparsity levels training...

10.48550/arxiv.2501.12486 preprint EN arXiv (Cornell University) 2025-01-21

Decoding with autoregressive large language models (LLMs) traditionally occurs sequentially, generating one token after another. An emerging line of work explored parallel decoding by identifying and simultaneously semantically independent chunks LLM responses. However, these techniques rely on hand-crafted heuristics tied to syntactic structures like lists paragraphs, making them rigid imprecise. We present PASTA, a learning-based system that teaches LLMs identify semantic independence...

10.48550/arxiv.2502.11517 preprint EN arXiv (Cornell University) 2025-02-17

In the many-core era, scalable coherence and on-chip interconnects are crucial for shared memory processors. While snoopy is common in small multicore systems, directory-based de facto choice scalability to many cores, as relies on ordered which do not scale. However, does scale beyond tens of cores due excessive directory area overhead or inaccurate sharer tracking. Prior techniques supporting ordering arbitrary unordered networks impractical full chip designs.

10.1109/isca.2014.6853232 article EN 2014-06-01

Packet scheduling in switches is not programmable; operators only choose among a handful of algorithms implemented by the manufacturer. In contrast, other switch functions such as packet parsing and header processing are becoming programmable [10, 3, 6]. This paper presents scheduler that allows to program variety algorithms.

10.1145/2834050.2834106 article EN 2015-11-09

The authors present Swarm, a parallel architecture that exploits ordered parallelism, which is abundant but hard to mine with current software and hardware techniques. Swarm programs consist of short tasks, as small tens instructions each, programmer-specified order constraints. executes tasks speculatively out efficiently speculates thousands ahead the earliest active task uncover enough parallelism. Several techniques allow scale large core counts speculation windows. evaluate on graph...

10.1109/mm.2016.12 article EN IEEE Micro 2016-03-18

N:M Structured sparsity has garnered significant interest as a result of relatively modest overhead and improved efficiency. Additionally, this form holds considerable appeal for reducing the memory footprint owing to their representation overhead. There have been efforts develop training recipes structured sparsity, they primarily focus on low-sparsity regions ($\sim$50\%). Nonetheless, performance models trained using these approaches tends decline when confronted with high-sparsity...

10.48550/arxiv.2402.04744 preprint EN arXiv (Cornell University) 2024-02-07

Future scalability for kilo-core architectures requires solutions beyond the capabilities of protocol and software design. Single-cycle multihop asynchronous repeated traversal (SMART) creates virtual single-cycle paths across shared network between cores, potentially offering significant reductions in runtime latency energy expenditure.

10.1109/mc.2013.260 article EN Computer 2013-07-19

Multicore systems must exploit locality to scale, scheduling tasks minimize data movement. While locality-aware parallelism is well studied in non-speculative systems, it has received little attention speculative (e.g., HTM or TLS), which hinders their scalability. We present spatial hints, a technique that leverages program knowledge reveal and parallel programs. A hint an abstract integer, given when task created, denotes the likely access. show easy modify programs convey through hints....

10.1109/micro.2016.7783708 article EN 2016-10-01

Multicore systems must exploit locality to scale, scheduling tasks minimize data movement. While locality-aware parallelism is well studied in non-speculative systems, it has received little attention speculative (e.g., HTM or TLS), which hinders their scalability. We present spatial hints, a technique that leverages program knowledge reveal and parallel programs. A hint an abstract integer, given when task created, denotes the likely access. show easy modify programs convey through hints....

10.5555/3195638.3195644 article EN International Symposium on Microarchitecture 2016-10-15

In response to innovations in machine learning (ML) models, production workloads changed radically and rapidly. TPU v4 is the fifth Google domain specific architecture (DSA) its third supercomputer for such ML models. Optical circuit switches (OCSes) dynamically reconfigure interconnect topology improve scale, availability, utilization, modularity, deployment, security, power, performance; users can pick a twisted 3D torus if desired. Much cheaper, lower faster than Infiniband, OCSes...

10.48550/arxiv.2304.01433 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Most systems that support speculative parallelization, like hardware transactional memory (HTM), do not nested parallelism. This sacrifices substantial parallelism and precludes composing parallel algorithms. And the few HTMs focus on parallelizing at coarsest (shallowest) levels, incurring large overheads squander most of their potential.

10.1145/3079856.3080218 article EN 2017-06-24

The increasing size of deep neural networks necessitates effective model compression to improve computational efficiency and reduce their memory footprint. Sparsity quantization are two prominent methods that have individually demonstrated significant reduction in footprints while preserving accuracy. While effective, the interplay between these remains an open question. In this paper, we investigate interaction assess whether combination impacts final We mathematically prove applying...

10.48550/arxiv.2405.20935 preprint EN arXiv (Cornell University) 2024-05-31

Multicore systems should support both speculative and non-speculative parallelism. Speculative parallelism is easy to use crucial scale many challenging applications, while more efficient allows parallel irrevocable actions (e.g., I/O). Unfortunately, prior techniques are far from this goal. Hardware transactional memory (HTM) (transactional) (non-transactional) work, but lack coordination mechanisms between the two, limited unordered Prior work has extended HTMs avoid limitations of...

10.1109/micro.2018.00026 article EN 2018-10-01

This work studies the interplay between multithreaded cores and speculative parallelism (e.g., transactional memory or thread-level speculation). These techniques are often used together, yet they have been developed independently. disconnect causes major performance pathologies: increasing number of threads per core adds conflicts wasted work, puts pressure on execution resources. pathologies squander benefits multithreading.We present speculation-aware multithreading (SAM), a simple policy...

10.1109/pact.2017.37 article EN 2017-09-01

Most systems that support speculative parallelization, like hardware transactional memory (HTM), do not nested parallelism. This sacrifices substantial parallelism and precludes composing parallel algorithms. And the few HTMs focus on parallelizing at coarsest (shallowest) levels, incurring large overheads squander most of their potential. We present FRACTAL, a new execution model supports unordered timestamp-ordered FRACTAL lets programmers seamlessly compose algorithms, architecture...

10.1145/3140659.3080218 article EN ACM SIGARCH Computer Architecture News 2017-06-24

Switches today provide a small set of scheduling algorithms. While we can tweak parameters, cannot modify algorithmic logic, or add completely new algorithm, after the switch has been designed. This paper presents design for programmable packet scheduler, which allows algorithms---potentially algorithms that are unknown today---to be programmed into without requiring hardware redesign. Our builds on observation make two decisions: in what order to schedule packets and when them. Further,...

10.48550/arxiv.1602.06045 preprint EN other-oa arXiv (Cornell University) 2016-01-01
Coming Soon ...