NFDI4DS | UHH-SEMS - Publication Details

Programmable Packet Scheduling at Line Rate

OPENALEX - Publications

Anirudh Sivaraman Suvinay Subramanian Mohammad Alizadeh Sharad Chole Shang-Tse Chuang and 5 more

Switches today provide a small menu of scheduling algorithms. While we can tweak parameters, cannot modify algorithmic logic, or add completely new algorithm, after the switch has been designed. This paper presents design for {\em programmable} packet scheduler, which allows algorithms---potentially algorithms that are unknown today---to be programmed into without requiring hardware redesign.

10.1145/2934872.2934899 article EN 2016-08-01

TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings

OPENALEX - Publications

Norman P. Jouppi George Thomas Kurian Sheng Li Peter Ma Rahul Nagarajan and 9 more

In response to innovations in machine learning (ML) models, production workloads changed radically and rapidly. TPU v4 is the fifth Google domain specific architecture (DSA) its third supercomputer for such ML models. Optical circuit switches (OCSes) dynamically reconfigure interconnect topology improve scale, availability, utilization, modularity, deployment, security, power, performance; users can pick a twisted 3D torus if desired. Much cheaper, lower faster than Infiniband, OCSes...

10.1145/3579371.3589350 article EN 2023-06-16

SCORPIO

OPENALEX - Publications

Bhavya K. Daya Chia-Hsin Owen Chen Suvinay Subramanian Woo-Cheol Kwon Sunghyun Park and 4 more

In the many-core era, scalable coherence and on-chip interconnects are crucial for shared memory processors. While snoopy is common in small multicore systems, directory-based de facto choice scalability to many cores, as relies on ordered which do not scale. However, does scale beyond tens of cores due excessive directory area overhead or inaccurate sharer tracking. Prior techniques supporting ordering arbitrary unordered networks impractical full chip designs We present SCORPIO, an mesh...

10.1145/2678373.2665680 article EN ACM SIGARCH Computer Architecture News 2014-06-14

No silver bullet

OPENALEX - Publications

Anirudh Sivaraman Keith Winstein Suvinay Subramanian Hari Balakrishnan

The data plane is in a continuous state of flux. Every few months, researchers publish the design new high-performance queueing or scheduling scheme that runs inside network fabric. Many such schemes have been queen for day, only to be surpassed soon after as methods --- evaluation metrics evolve.

10.1145/2535771.2535796 article EN 2013-11-21

A scalable architecture for ordered parallelism

OPENALEX - Publications

Mark C. Jeffrey Suvinay Subramanian Cong Yan Joel Emer Daniel Sánchez

We present Swarm, a novel architecture that exploits ordered irregular parallelism, which is abundant but hard to mine with current software and hardware techniques. In this architecture, programs consist of short tasks programmer-specified timestamps. Swarm executes speculatively out order, efficiently speculates thousands ahead the earliest active task uncover parallelism. builds on prior TLS HTM schemes, contributes several new techniques allow it scale large core counts speculation...

10.1145/2830772.2830777 article EN 2015-12-05

FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks

OPENALEX - Publications

Sheng-Chun Kao Suvinay Subramanian Gaurav Agrawal Amir Yazdanbakhsh Tushar Krishna

Attention mechanisms, primarily designed to capture pairwise correlations between words, have become the backbone of machine learning, expanding beyond natural language processing into other domains. This growth in adaptation comes at cost prohibitively large memory requirements and computational complexity, especially higher number input elements. limitation is due inherently limited data reuse opportunities quadratic footprints, leading severe memory-boundedness scalability work addresses...

10.1145/3575693.3575747 article EN 2023-01-27

SMART: a single-cycle reconfigurable NoC for SoC applications

OPENALEX - Publications

Chia-Hsin Owen Chen Sunghyun Park Tushar Krishna Suvinay Subramanian Anantha P. Chandrakasan and 1 more

As technology scales, SoCs are increasing in core counts, leading to the need for scalable NoCs interconnect multiple cores on chip. Given aggressive SoC design targets, have deliver low latency, high bandwidth, at power and area overheads. In this paper, we propose Single-cycle Multi-hop Asynchronous Repeated Traversal (SMART) NoC, a NoC that reconfigures tailors generic mesh topology applications runtime. The heart of our SMART is novel low-swing clockless repeated link circuit embedded...

10.5555/2485288.2485371 article EN Design, Automation, and Test in Europe 2013-03-18

SMART: A Single-Cycle Reconfigurable NoC for SoC Applications

OPENALEX - Publications

Chia-Hsin Owen Chen Sunghyun Park Tushar Krishna Suvinay Subramanian Anantha P. Chandrakasan and 1 more

As technology scales, SoCs are increasing in core counts, leading to the need for scalable NoCs interconnect multiple cores on chip. Given aggressive SoC design targets, have deliver low latency, high bandwidth, at power and area overheads. In this paper, we propose Single-cycle Multi-hop Asynchronous Repeated Traversal (SMART) NoC, a NoC that reconfigures tailors generic mesh topology applications runtime. The heart of our SMART is novel low-swing clockless repeated link circuit embedded...

10.7873/date.2013.080 article EN Design, Automation & Test in Europe Conference & Exhibition (DATE), 2015 2013-01-01

The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws

OPENALEX - Publications

Jin Tian Ahmed Imtiaz Humayun Utku Evci Suvinay Subramanian Amir Yazdanbakhsh and 2 more

Pruning eliminates unnecessary parameters in neural networks; it offers a promising solution to the growing computational demands of large language models (LLMs). While many focus on post-training pruning, sparse pre-training--which combines pruning and pre-training into single phase--provides simpler alternative. In this work, we present first systematic exploration optimal configurations for LLMs through an examination 80 unique schedules across different sparsity levels training...

10.48550/arxiv.2501.12486 preprint EN arXiv (Cornell University) 2025-01-21

Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding

OPENALEX - Publications

Jin Tian Ellie Y. Cheng Zack Ankner Nikunj Saunshi Blake M. Elias and 4 more

Decoding with autoregressive large language models (LLMs) traditionally occurs sequentially, generating one token after another. An emerging line of work explored parallel decoding by identifying and simultaneously semantically independent chunks LLM responses. However, these techniques rely on hand-crafted heuristics tied to syntactic structures like lists paragraphs, making them rigid imprecise. We present PASTA, a learning-based system that teaches LLMs identify semantic independence...

10.48550/arxiv.2502.11517 preprint EN arXiv (Cornell University) 2025-02-17

SCORPIO: A 36-core research chip demonstrating snoopy coherence on a scalable mesh NoC with in-network ordering

OPENALEX - Publications

Bhavya K. Daya Chia-Hsin Owen Chen Suvinay Subramanian Woo-Cheol Kwon Sunghyun Park and 4 more

In the many-core era, scalable coherence and on-chip interconnects are crucial for shared memory processors. While snoopy is common in small multicore systems, directory-based de facto choice scalability to many cores, as relies on ordered which do not scale. However, does scale beyond tens of cores due excessive directory area overhead or inaccurate sharer tracking. Prior techniques supporting ordering arbitrary unordered networks impractical full chip designs.

10.1109/isca.2014.6853232 article EN 2014-06-01

Towards Programmable Packet Scheduling

OPENALEX - Publications

Anirudh Sivaraman Suvinay Subramanian Anurag Agrawal Sharad Chole Shang-Tse Chuang and 5 more

Packet scheduling in switches is not programmable; operators only choose among a handful of algorithms implemented by the manufacturer. In contrast, other switch functions such as packet parsing and header processing are becoming programmable [10, 3, 6]. This paper presents scheduler that allows to program variety algorithms.

10.1145/2834050.2834106 article EN 2015-11-09

Unlocking Ordered Parallelism with the Swarm Architecture

OPENALEX - Publications

Mark C. Jeffrey Suvinay Subramanian Cong Yan Joel Emer Daniel Sánchez

The authors present Swarm, a parallel architecture that exploits ordered parallelism, which is abundant but hard to mine with current software and hardware techniques. Swarm programs consist of short tasks, as small tens instructions each, programmer-specified order constraints. executes tasks speculatively out efficiently speculates thousands ahead the earliest active task uncover enough parallelism. Several techniques allow scale large core counts speculation windows. evaluate on graph...

10.1109/mm.2016.12 article EN IEEE Micro 2016-03-18

Progressive Gradient Flow for Robust N:M Sparsity Training in Transformers

OPENALEX - Publications

Abhimanyu Rajeshkumar Bambhaniya Amir Yazdanbakhsh Suvinay Subramanian Sheng-Chun Kao Shivani Agrawal and 2 more

N:M Structured sparsity has garnered significant interest as a result of relatively modest overhead and improved efficiency. Additionally, this form holds considerable appeal for reducing the memory footprint owing to their representation overhead. There have been efforts develop training recipes structured sparsity, they primarily focus on low-sparsity regions ($\sim$50\%). Nonetheless, performance models trained using these approaches tends decline when confronted with high-sparsity...

10.48550/arxiv.2402.04744 preprint EN arXiv (Cornell University) 2024-02-07

Single-Cycle Multihop Asynchronous Repeated Traversal: A SMART Future for Reconfigurable On-Chip Networks

OPENALEX - Publications

Tushar Krishna Chia-Hsin Owen Chen Sunghyun Park Woo-Cheol Kwon Suvinay Subramanian and 2 more

Future scalability for kilo-core architectures requires solutions beyond the capabilities of protocol and software design. Single-cycle multihop asynchronous repeated traversal (SMART) creates virtual single-cycle paths across shared network between cores, potentially offering significant reductions in runtime latency energy expenditure.

10.1109/mc.2013.260 article EN Computer 2013-07-19

Data-centric execution of speculative parallel programs

OPENALEX - Publications

Mark C. Jeffrey Suvinay Subramanian Maleen Abeydeera Joel Emer Daniel Sánchez

Multicore systems must exploit locality to scale, scheduling tasks minimize data movement. While locality-aware parallelism is well studied in non-speculative systems, it has received little attention speculative (e.g., HTM or TLS), which hinders their scalability. We present spatial hints, a technique that leverages program knowledge reveal and parallel programs. A hint an abstract integer, given when task created, denotes the likely access. show easy modify programs convey through hints....

10.1109/micro.2016.7783708 article EN 2016-10-01

Data-centric execution of speculative parallel programs

OPENALEX - Publications

Mark C. Jeffrey Suvinay Subramanian Maleen Abeydeera Joel Emer Daniel Sánchez

Multicore systems must exploit locality to scale, scheduling tasks minimize data movement. While locality-aware parallelism is well studied in non-speculative systems, it has received little attention speculative (e.g., HTM or TLS), which hinders their scalability. We present spatial hints, a technique that leverages program knowledge reveal and parallel programs. A hint an abstract integer, given when task created, denotes the likely access. show easy modify programs convey through hints....

10.5555/3195638.3195644 article EN International Symposium on Microarchitecture 2016-10-15

TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings

OPENALEX - Publications

Norman P. Jouppi George Thomas Kurian Sheng Li Peter Ma Rahul Nagarajan and 9 more

In response to innovations in machine learning (ML) models, production workloads changed radically and rapidly. TPU v4 is the fifth Google domain specific architecture (DSA) its third supercomputer for such ML models. Optical circuit switches (OCSes) dynamically reconfigure interconnect topology improve scale, availability, utilization, modularity, deployment, security, power, performance; users can pick a twisted 3D torus if desired. Much cheaper, lower faster than Infiniband, OCSes...

10.48550/arxiv.2304.01433 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Fractal

OPENALEX - Publications

Suvinay Subramanian Mark C. Jeffrey Maleen Abeydeera Hyun Ryong Lee Victor A. Ying and 2 more

Most systems that support speculative parallelization, like hardware transactional memory (HTM), do not nested parallelism. This sacrifices substantial parallelism and precludes composing parallel algorithms. And the few HTMs focus on parallelizing at coarsest (shallowest) levels, incurring large overheads squander most of their potential.

10.1145/3079856.3080218 article EN 2017-06-24

Effective Interplay between Sparsity and Quantization: From Theory to Practice

OPENALEX - Publications

Simla Burcu Harma Ayan Chakraborty Elizaveta Kostenok Danila Mishin Dongho Ha and 6 more

The increasing size of deep neural networks necessitates effective model compression to improve computational efficiency and reduce their memory footprint. Sparsity quantization are two prominent methods that have individually demonstrated significant reduction in footprints while preserving accuracy. While effective, the interplay between these remains an open question. In this paper, we investigate interaction assess whether combination impacts final We mathematically prove applying...

10.48550/arxiv.2405.20935 preprint EN arXiv (Cornell University) 2024-05-31

Harmonizing Speculative and Non-Speculative Execution in Architectures for Ordered Parallelism

OPENALEX - Publications

Mark C. Jeffrey Victor A. Ying Suvinay Subramanian Hyun Ryong Lee Joel Emer and 1 more

Multicore systems should support both speculative and non-speculative parallelism. Speculative parallelism is easy to use crucial scale many challenging applications, while more efficient allows parallel irrevocable actions (e.g., I/O). Unfortunately, prior techniques are far from this goal. Hardware transactional memory (HTM) (transactional) (non-transactional) work, but lack coordination mechanisms between the two, limited unordered Prior work has extended HTMs avoid limitations of...

10.1109/micro.2018.00026 article EN 2018-10-01

SAM: Optimizing Multithreaded Cores for Speculative Parallelism

OPENALEX - Publications

Maleen Abeydeera Suvinay Subramanian Mark C. Jeffrey Joel Emer Daniel Sánchez

This work studies the interplay between multithreaded cores and speculative parallelism (e.g., transactional memory or thread-level speculation). These techniques are often used together, yet they have been developed independently. disconnect causes major performance pathologies: increasing number of threads per core adds conflicts wasted work, puts pressure on execution resources. pathologies squander benefits multithreading.We present speculation-aware multithreading (SAM), a simple policy...

10.1109/pact.2017.37 article EN 2017-09-01

Fractal

OPENALEX - Publications

Suvinay Subramanian Mark C. Jeffrey Maleen Abeydeera Hyun Ryong Lee Victor A. Ying and 2 more

Most systems that support speculative parallelization, like hardware transactional memory (HTM), do not nested parallelism. This sacrifices substantial parallelism and precludes composing parallel algorithms. And the few HTMs focus on parallelizing at coarsest (shallowest) levels, incurring large overheads squander most of their potential. We present FRACTAL, a new execution model supports unordered timestamp-ordered FRACTAL lets programmers seamlessly compose algorithms, architecture...

10.1145/3140659.3080218 article EN ACM SIGARCH Computer Architecture News 2017-06-24

Programmable Packet Scheduling

OPENALEX - Publications

Anirudh Sivaraman Suvinay Subramanian Anurag Agrawal Sharad Chole Shang-Tse Chuang and 5 more

Switches today provide a small set of scheduling algorithms. While we can tweak parameters, cannot modify algorithmic logic, or add completely new algorithm, after the switch has been designed. This paper presents design for programmable packet scheduler, which allows algorithms---potentially algorithms that are unknown today---to be programmed into without requiring hardware redesign. Our builds on observation make two decisions: in what order to schedule packets and when them. Further,...

10.48550/arxiv.1602.06045 preprint EN other-oa arXiv (Cornell University) 2016-01-01