Amin Firoozshahian

ORCID: 0009-0009-0128-298X
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Parallel Computing and Optimization Techniques
  • Interconnection Networks and Systems
  • Advanced Data Storage Technologies
  • Distributed systems and fault tolerance
  • Embedded Systems Design Techniques
  • Low-power high-performance VLSI design
  • Stochastic Gradient Optimization Techniques
  • Recommender Systems and Techniques
  • Cloud Computing and Resource Management
  • Caching and Content Delivery
  • Software-Defined Networks and 5G
  • Advancements in Battery Materials
  • Network Packet Processing and Optimization
  • Semiconductor materials and devices
  • Formal Methods in Verification
  • Integrated Circuits and Semiconductor Failure Analysis
  • VLSI and Analog Circuit Testing

Meta (United States)
2023

Meta (Israel)
2020

Intel (United Kingdom)
2015-2017

Menlo School
2012-2014

Stanford University
2007-2011

The trend towards simple datacenter network fabric strips most functionality, including load balancing, out of the core and pushes it to edge. This slows reaction microbursts, main culprit packet loss in datacenters. We investigate opposite direction: could slightly smarter significantly improve balancing? paper presents DRILL, a for Clos networks which performs micro balancing distribute as evenly possible on microsecond timescales. DRILL employs per-packet decisions at each switch based...

10.1145/3098822.3098839 article EN 2017-08-04

Personalized recommendation systems leverage deep learning models and account for the majority of data center AI cycles. Their performance is dominated by memory-bound sparse embedding operations with unique irregular memory access patterns that pose a fundamental challenge to accelerate. This paper proposes lightweight, commodity DRAM compliant, near-memory processing solution accelerate personalized inference. The in-depth characterization production-grade shows high model-, operator...

10.1109/isca45697.2020.00070 article EN 2020-05-01

There are two basic models for the on-chip memory in CMP systems:hardware-managed coherent caches and software-managed streaming memory. This paper performs a direct comparison of modelsunder same set assumptions about technology, area, computational capabilities. The goal is to quantify how when they differ terms performance, energy consumption, bandwidth requirements, latency tolerance general-purpose CMPs. We demonstrate that data-parallel applications, cache-based perform scale equally...

10.1145/1250662.1250707 article EN 2007-06-09

The trend towards simple data center network fabric strips most functionality, including load balancing capabilities, out of the core and pushes them to edge. We investigate a different direction incorporating minimal intelligence into show that this slightly smarter significantly enhances performance. provide very in-network scheduling algorithm called DRILL which is purely local each switch. leverages sensing randomization concepts distribute among multiple paths. Through simulation, we...

10.1145/2834050.2834107 article EN 2015-11-09

Meta has traditionally relied on using CPU-based servers for running inference workloads, specifically Deep Learning Recommendation Models (DLRM), but the increasing compute and memory requirements of these models have pushed company towards specialized solutions such as GPUs or other hardware accelerators. This paper describes company's effort in constructing its first silicon designed recommendation systems; it accelerator architecture platform design, software stack enabling optimizing...

10.1145/3579371.3589348 article EN 2023-06-16

Transactional memory represents an attractive conceptual model for programming concurrent applications. Unfortunately, high transaction abort rates can cause significant performance degradation. Conventional transactional realizations not only pessimistically transactions on every read-write conflict but also because of false sharing, cache evictions, TLB misses, page faults and interrupts. Consequently, the use needs to be restricted a very small number operations achieve predictable...

10.1145/2541940.2541952 article EN 2014-02-24

Programming language and operating system support for efficient concurrency-safe access to shared data is a key concern the effective use of multi-core processors. Most research has focused on software model multiple threads accessing this within single address space. However, many real applications are actually structured as separate processes fault isolation simplified synchronization. In paper, we describe HICAMP architecture its innovative memory system, which supports concurrency safe...

10.1145/2150976.2151007 article EN 2012-03-03

There are two basic models for the on-chip memory in CMP systems: hardware-managed coherent caches and software-managed streaming memory. This paper performs a direct comparison of modelsunder same set assumptions about technology, area, computational capabilities. The goal is to quantify how when they differ terms performance, energy consumption, bandwidth requirements, latency tolerance general-purpose CMPs. We demonstrate that data-parallel applications, cache-based perform scale equally...

10.1145/1273440.1250707 article EN ACM SIGARCH Computer Architecture News 2007-06-09

Verification of chip multiprocessor memory systems remains challenging. While formal methods have been used to validate protocols, simulation is still the dominant method system implementation. Having a scoreboard, high-level model memory, greatly aids based validation, but accurate score-boards are complex create since often they depend not only on and consistency also its specific This paper describes methodology using relaxed which reduces complexity creating these models. The scoreboard...

10.1109/micro.2008.4771799 article EN 2008-11-01

As CPU cores become building blocks, we see a great expansion in the types of on-chip memory systems proposed for CMPs. Unfortunately, designing cache and protocol controllers to support these is complex, their concurrency latency characteristics significantly affect performance any CMP. To address this problem, paper presents microarchitecture framework controllers, which can aid generating RTL new systems. The consists three pipelined engines' request-tracking, state-manipulation, data...

10.1145/1555754.1555805 article EN 2009-06-20

There are two competing models for the on-chip memory in Chip Multiprocessor (CMP) systems: hardware-managed coherent caches and software-managed streaming . This paper performs a direct comparison of under same set assumptions about technology, area, computational capabilities. The goal is to quantify how when they differ terms performance, energy consumption, bandwidth requirements, latency tolerance general-purpose CMPs. We demonstrate that data-parallel applications on systems with up 16...

10.1145/1455650.1455651 article EN ACM Transactions on Architecture and Code Optimization 2008-11-01

The drive for low-power, high performance computation coupled with the extremely design costs ASIC designs, has driven a number of designers to try create flexible, universal computing platform that will supersede microprocessor. We argue these general chips are trying accomplish more than is commercially needed. Since NRE an order magnitude larger fabrication costs, two-step system seems attractive. First, users configure/program flexible framework run their application desired performance....

10.1145/1278480.1278544 article EN Proceedings - ACM IEEE Design Automation Conference 2007-01-01

Building hardware prototypes for computer architecture research is challenging. Unfortunately, development of the required software tools (compilers, debuggers, runtime) even more challenging, which means these systems rarely run real applications. To overcome this issue, when developing our prototype platform, we used Tensilica processor generator to produce a customized and corresponding libraries. While base was very different from streamlined custom initially imagined, it allowed us...

10.1145/1669112.1669159 article EN 2009-12-12

Sparse matrix-vector multiply (SpMV) is a critical task in the inner loop of modern iterative linear system solvers and exhibits very little data reuse. This low reuse means that its performance bounded by main-memory bandwidth. Moreover, random patterns indirection make it difficult to achieve this bound. We present sparse matrix storage formats based on deduplicated memory. These reduce memory traffic during SpMV thus show significantly improved bounds: 90x better best case. Additionally,...

10.1145/2304576.2304603 article EN 2012-06-25

A number of algorithms have been proposed in the literature for scheduling CIOQ switches. The which proven to provide strict performance guarantees on delay (via emulation an output-queued switch) too complicated implement because they require exchange a large amount information between inputs and outputs. With implementation as our primary focus, we consider that are "fully local." This means outputs must be able make decisions regarding matchings using only local (except requests, grants...

10.1109/infcom.2007.307 article EN 2007-01-01

Booting and debugging the functionality of silicon samples are known to be challenging time-consuming tasks, even more so in cost-constrained environments. The authors describe their creative solutions used bring up Stanford Smart Memories (SSM), a 55-million transistor research chip.

10.1109/mdt.2011.2179849 article EN IEEE Design & Test of Computers 2011-12-15

Personalized recommendation systems leverage deep learning models and account for the majority of data center AI cycles. Their performance is dominated by memory-bound sparse embedding operations with unique irregular memory access patterns that pose a fundamental challenge to accelerate. This paper proposes lightweight, commodity DRAM compliant, near-memory processing solution accelerate personalized inference. The in-depth characterization production-grade shows high model-, operator-...

10.48550/arxiv.1912.12953 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Programming language and operating system support for efficient concurrency-safe access to shared data is a key concern the effective use of multi-core processors. Most research has focused on software model multiple threads accessing this within single address space. However, many real applications are actually structured as separate processes fault isolation simplified synchronization. In paper, we describe HICAMP architecture its innovative memory system, which supports concurrency safe...

10.1145/2248487.2151007 article EN ACM SIGPLAN Notices 2012-03-03

Transactional memory represents an attractive conceptual model for programming concurrent applications. Unfortunately, high transaction abort rates can cause significant performance degradation. Conventional transactional realizations not only pessimistically transactions on every read-write conflict but also because of false sharing, cache evictions, TLB misses, page faults and interrupts. Consequently, the use needs to be restricted a very small number operations achieve predictable...

10.1145/2654822.2541952 article EN ACM SIGARCH Computer Architecture News 2014-02-24

Transactional memory represents an attractive conceptual model for programming concurrent applications. Unfortunately, high transaction abort rates can cause significant performance degradation. Conventional transactional realizations not only pessimistically transactions on every read-write conflict but also because of false sharing, cache evictions, TLB misses, page faults and interrupts. Consequently, the use needs to be restricted a very small number operations achieve predictable...

10.1145/2644865.2541952 article EN ACM SIGPLAN Notices 2014-02-24

As CPU cores become building blocks, we see a great expansion in the types of on-chip memory systems proposed for CMPs. Unfortunately, designing cache and protocol controllers to support these is complex, their concurrency latency characteristics significantly affect performance any CMP. To address this problem, paper presents microarchitecture framework controllers, which can aid generating RTL new systems. The consists three pipelined engines' request-tracking, state-manipulation, data...

10.1145/1555815.1555805 article EN ACM SIGARCH Computer Architecture News 2009-06-15

Programming language and operating system support for efficient concurrency-safe access to shared data is a key concern the effective use of multi-core processors. Most research has focused on software model multiple threads accessing this within single address space. However, many real applications are actually structured as separate processes fault isolation simplified synchronization. In paper, we describe HICAMP architecture its innovative memory system, which supports concurrency safe...

10.1145/2189750.2151007 article EN ACM SIGARCH Computer Architecture News 2012-03-03
Coming Soon ...