NFDI4DS | UHH-SEMS - Publication Details

DRILL

OPENALEX - Publications

Soudeh Ghorbani Zibin Yang P. Brighten Godfrey Yashar Ganjali Amin Firoozshahian

The trend towards simple datacenter network fabric strips most functionality, including load balancing, out of the core and pushes it to edge. This slows reaction microbursts, main culprit packet loss in datacenters. We investigate opposite direction: could slightly smarter significantly improve balancing? paper presents DRILL, a for Clos networks which performs micro balancing distribute as evenly possible on microsecond timescales. DRILL employs per-packet decisions at each switch based...

10.1145/3098822.3098839 article EN 2017-08-04

RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing

OPENALEX - Publications

Liu Ke Udit Gupta Benjamin Youngjae Cho David Brooks Vikas Chandra and 16 more

Personalized recommendation systems leverage deep learning models and account for the majority of data center AI cycles. Their performance is dominated by memory-bound sparse embedding operations with unique irregular memory access patterns that pose a fundamental challenge to accelerate. This paper proposes lightweight, commodity DRAM compliant, near-memory processing solution accelerate personalized inference. The in-depth characterization production-grade shows high model-, operator...

10.1109/isca45697.2020.00070 article EN 2020-05-01

Comparing memory systems for chip multiprocessors

OPENALEX - Publications

Jacob Leverich Hideho Arakida Alex Solomatnikov Amin Firoozshahian Mark Horowitz and 1 more

There are two basic models for the on-chip memory in CMP systems:hardware-managed coherent caches and software-managed streaming memory. This paper performs a direct comparison of modelsunder same set assumptions about technology, area, computational capabilities. The goal is to quantify how when they differ terms performance, energy consumption, bandwidth requirements, latency tolerance general-purpose CMPs. We demonstrate that data-parallel applications, cache-based perform scale equally...

10.1145/1250662.1250707 article EN 2007-06-09

Micro Load Balancing in Data Centers with DRILL

OPENALEX - Publications

Soudeh Ghorbani Brighten Godfrey Yashar Ganjali Amin Firoozshahian

The trend towards simple data center network fabric strips most functionality, including load balancing capabilities, out of the core and pushes them to edge. We investigate a different direction incorporating minimal intelligence into show that this slightly smarter significantly enhances performance. provide very in-network scheduling algorithm called DRILL which is purely local each switch. leverages sensing randomization concepts distribute among multiple paths. Through simulation, we...

10.1145/2834050.2834107 article EN 2015-11-09

MTIA: First Generation Silicon Targeting Meta's Recommendation Systems

OPENALEX - Publications

Amin Firoozshahian Joel Coburn Roman Levenstein Rakesh Nattoji Ashwin Kamath and 51 more

Meta has traditionally relied on using CPU-based servers for running inference workloads, specifically Deep Learning Recommendation Models (DLRM), but the increasing compute and memory requirements of these models have pushed company towards specialized solutions such as GPUs or other hardware accelerators. This paper describes company's effort in constructing its first silicon designed recommendation systems; it accelerator architecture platform design, software stack enabling optimizing...

10.1145/3579371.3589348 article EN 2023-06-16

SI-TM

OPENALEX - Publications

Heiner Litz David R. Cheriton Amin Firoozshahian Omid Azizi John P. Stevenson

Transactional memory represents an attractive conceptual model for programming concurrent applications. Unfortunately, high transaction abort rates can cause significant performance degradation. Conventional transactional realizations not only pessimistically transactions on every read-write conflict but also because of false sharing, cache evictions, TLB misses, page faults and interrupts. Consequently, the use needs to be restricted a very small number operations achieve predictable...

10.1145/2541940.2541952 article EN 2014-02-24

HICAMP

OPENALEX - Publications

David R. Cheriton Amin Firoozshahian Alex Solomatnikov John P. Stevenson Omid Azizi

Programming language and operating system support for efficient concurrency-safe access to shared data is a key concern the effective use of multi-core processors. Most research has focused on software model multiple threads accessing this within single address space. However, many real applications are actually structured as separate processes fault isolation simplified synchronization. In paper, we describe HICAMP architecture its innovative memory system, which supports concurrency safe...

10.1145/2150976.2151007 article EN 2012-03-03

Comparing memory systems for chip multiprocessors

OPENALEX - Publications

Jacob Leverich Hideho Arakida Alex Solomatnikov Amin Firoozshahian Mark Horowitz and 1 more

There are two basic models for the on-chip memory in CMP systems: hardware-managed coherent caches and software-managed streaming memory. This paper performs a direct comparison of modelsunder same set assumptions about technology, area, computational capabilities. The goal is to quantify how when they differ terms performance, energy consumption, bandwidth requirements, latency tolerance general-purpose CMPs. We demonstrate that data-parallel applications, cache-based perform scale equally...

10.1145/1273440.1250707 article EN ACM SIGARCH Computer Architecture News 2007-06-09

Verification of chip multiprocessor memory systems using a relaxed scoreboard

OPENALEX - Publications

Ofer Shacham Megan Wachs Alex Solomatnikov Amin Firoozshahian Stephen Richardson and 1 more

Verification of chip multiprocessor memory systems remains challenging. While formal methods have been used to validate protocols, simulation is still the dominant method system implementation. Having a scoreboard, high-level model memory, greatly aids based validation, but accurate score-boards are complex create since often they depend not only on and consistency also its specific This paper describes methodology using relaxed which reduces complexity creating these models. The scoreboard...

10.1109/micro.2008.4771799 article EN 2008-11-01

A memory system design framework

OPENALEX - Publications

Amin Firoozshahian Alex Solomatnikov Ofer Shacham Zain Asgar Stephen Richardson and 2 more

As CPU cores become building blocks, we see a great expansion in the types of on-chip memory systems proposed for CMPs. Unfortunately, designing cache and protocol controllers to support these is complex, their concurrency latency characteristics significantly affect performance any CMP. To address this problem, paper presents microarchitecture framework controllers, which can aid generating RTL new systems. The consists three pipelined engines' request-tracking, state-manipulation, data...

10.1145/1555754.1555805 article EN 2009-06-20

Comparative evaluation of memory models for chip multiprocessors

OPENALEX - Publications

Jacob Leverich Hideho Arakida Alex Solomatnikov Amin Firoozshahian Mark Horowitz and 1 more

There are two competing models for the on-chip memory in Chip Multiprocessor (CMP) systems: hardware-managed coherent caches and software-managed streaming . This paper performs a direct comparison of under same set assumptions about technology, area, computational capabilities. The goal is to quantify how when they differ terms performance, energy consumption, bandwidth requirements, latency tolerance general-purpose CMPs. We demonstrate that data-parallel applications on systems with up 16...

10.1145/1455650.1455651 article EN ACM Transactions on Architecture and Code Optimization 2008-11-01

Chip multi-processor generator

OPENALEX - Publications

Alex Solomatnikov Amin Firoozshahian Wajahat Qadeer Ofer Shacham Kyle Kelley and 4 more

The drive for low-power, high performance computation coupled with the extremely design costs ASIC designs, has driven a number of designers to try create flexible, universal computing platform that will supersede microprocessor. We argue these general chips are trying accomplish more than is commercially needed. Since NRE an order magnitude larger fabrication costs, two-step system seems attractive. First, users configure/program flexible framework run their application desired performance....

10.1145/1278480.1278544 article EN Proceedings - ACM IEEE Design Automation Conference 2007-01-01

Using a configurable processor generator for computer architecture prototyping

OPENALEX - Publications

Alex Solomatnikov Amin Firoozshahian Ofer Shacham Zain Asgar Megan Wachs and 3 more

Building hardware prototypes for computer architecture research is challenging. Unfortunately, development of the required software tools (compilers, debuggers, runtime) even more challenging, which means these systems rarely run real applications. To overcome this issue, when developing our prototype platform, we used Tensilica processor generator to produce a customized and corresponding libraries. While base was very different from streamlined custom initially imagined, it allowed us...

10.1145/1669112.1669159 article EN 2009-12-12

Sparse matrix-vector multiply on the HICAMP architecture

OPENALEX - Publications

John P. Stevenson Amin Firoozshahian Alex Solomatnikov Mark Horowitz David R. Cheriton

Sparse matrix-vector multiply (SpMV) is a critical task in the inner loop of modern iterative linear system solvers and exhibits very little data reuse. This low reuse means that its performance bounded by main-memory bandwidth. Moreover, random patterns indirection make it difficult to achieve this bound. We present sparse matrix storage formats based on deduplicated memory. These reduce memory traffic during SpMV thus show significantly improved bounds: 90x better best case. Additionally,...

10.1145/2304576.2304603 article EN 2012-06-25

Efficient, Fully Local Algorithms for CIOQ Switches

OPENALEX - Publications

Amin Firoozshahian Vahideh Manshadi Ashish Goel Balaji Prabhakar

A number of algorithms have been proposed in the literature for scheduling CIOQ switches. The which proven to provide strict performance guarantees on delay (via emulation an output-queued switch) too complicated implement because they require exchange a large amount information between inputs and outputs. With implementation as our primary focus, we consider that are "fully local." This means outputs must be able make decisions regarding matchings using only local (except requests, grants...

10.1109/infcom.2007.307 article EN 2007-01-01

Bringing up a chip on the cheap

OPENALEX - Publications

Megan Wachs Ofer Shacham Zain Asgar Amin Firoozshahian Stephen Richardson and 1 more

Booting and debugging the functionality of silicon samples are known to be challenging time-consuming tasks, even more so in cost-constrained environments. The authors describe their creative solutions used bring up Stanford Smart Memories (SSM), a 55-million transistor research chip.

10.1109/mdt.2011.2179849 article EN IEEE Design & Test of Computers 2011-12-15

RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing

OPENALEX - Publications

Liu Ke Udit Gupta Carole-Jean Wu Benjamin Youngjae Cho Mark Hempstead and 16 more

Personalized recommendation systems leverage deep learning models and account for the majority of data center AI cycles. Their performance is dominated by memory-bound sparse embedding operations with unique irregular memory access patterns that pose a fundamental challenge to accelerate. This paper proposes lightweight, commodity DRAM compliant, near-memory processing solution accelerate personalized inference. The in-depth characterization production-grade shows high model-, operator-...

10.48550/arxiv.1912.12953 preprint EN other-oa arXiv (Cornell University) 2019-01-01

HICAMP

OPENALEX - Publications

David R. Cheriton Amin Firoozshahian Alex Solomatnikov John P. Stevenson Omid Azizi

Programming language and operating system support for efficient concurrency-safe access to shared data is a key concern the effective use of multi-core processors. Most research has focused on software model multiple threads accessing this within single address space. However, many real applications are actually structured as separate processes fault isolation simplified synchronization. In paper, we describe HICAMP architecture its innovative memory system, which supports concurrency safe...

10.1145/2248487.2151007 article EN ACM SIGPLAN Notices 2012-03-03

SI-TM

OPENALEX - Publications

Heiner Litz David R. Cheriton Amin Firoozshahian Omid Azizi John P. Stevenson

Transactional memory represents an attractive conceptual model for programming concurrent applications. Unfortunately, high transaction abort rates can cause significant performance degradation. Conventional transactional realizations not only pessimistically transactions on every read-write conflict but also because of false sharing, cache evictions, TLB misses, page faults and interrupts. Consequently, the use needs to be restricted a very small number operations achieve predictable...

10.1145/2654822.2541952 article EN ACM SIGARCH Computer Architecture News 2014-02-24

SI-TM

OPENALEX - Publications

Heiner Litz David R. Cheriton Amin Firoozshahian Omid Azizi John P. Stevenson

Transactional memory represents an attractive conceptual model for programming concurrent applications. Unfortunately, high transaction abort rates can cause significant performance degradation. Conventional transactional realizations not only pessimistically transactions on every read-write conflict but also because of false sharing, cache evictions, TLB misses, page faults and interrupts. Consequently, the use needs to be restricted a very small number operations achieve predictable...

10.1145/2644865.2541952 article EN ACM SIGPLAN Notices 2014-02-24

A memory system design framework

OPENALEX - Publications

Amin Firoozshahian Alex Solomatnikov Ofer Shacham Zain Asgar Stephen Richardson and 2 more

As CPU cores become building blocks, we see a great expansion in the types of on-chip memory systems proposed for CMPs. Unfortunately, designing cache and protocol controllers to support these is complex, their concurrency latency characteristics significantly affect performance any CMP. To address this problem, paper presents microarchitecture framework controllers, which can aid generating RTL new systems. The consists three pipelined engines' request-tracking, state-manipulation, data...

10.1145/1555815.1555805 article EN ACM SIGARCH Computer Architecture News 2009-06-15

HICAMP

OPENALEX - Publications

David R. Cheriton Amin Firoozshahian Alex Solomatnikov John P. Stevenson Omid Azizi

Programming language and operating system support for efficient concurrency-safe access to shared data is a key concern the effective use of multi-core processors. Most research has focused on software model multiple threads accessing this within single address space. However, many real applications are actually structured as separate processes fault isolation simplified synchronization. In paper, we describe HICAMP architecture its innovative memory system, which supports concurrency safe...

10.1145/2189750.2151007 article EN ACM SIGARCH Computer Architecture News 2012-03-03

Chip Multi-Processor Generator

OPENALEX - Publications

Alex Solomatnikov Amin Firoozshahian Wajahat Qadeer Ofer Shacham Kyle Kelley and 4 more

10.1109/dac.2007.375164 article EN cc-by Proceedings - ACM IEEE Design Automation Conference 2007-06-01