NFDI4DS | UHH-SEMS - Publication Details

22.1 A 12.4TOPS/W @ 136GOPS AI-IoT System-on-Chip with 16 RISC-V, 2-to-8b Precision-Scalable DNN Acceleration and 30%-Boost Adaptive Body Biasing

OPENALEX - Publications

Francesco Conti Davide Rossi Gianna Paulin Angelo Garofalo Alfio Di Mauro and 12 more

Emerging Artificial Intelligence-enabled Internet-of-Things (Al-loT) SoCs [1–4] for augmented reality, personalized healthcare and nano-robotics need to run a large variety of tasks within power envelope few tens mW: compute-intensive but bit-precision-tolerant Deep Neural Networks (DNNs), as well signal processing control requiring high-precision floating-point. Performance energy constraints vary greatly between different applications even stages the same application. We present Marsellus...

10.1109/isscc42615.2023.10067643 article EN 2022 IEEE International Solid- State Circuits Conference (ISSCC) 2023-02-19

Chipmunk: A systolically scalable 0.9 mm2, 3.08Gop/s/mW @ 1.2 mW accelerator for near-sensor recurrent neural network inference

OPENALEX - Publications

Francesco Conti Lukas Cavigelli Gianna Paulin Igor Susmelj Luca Benini

Recurrent neural networks (RNNs) are state-of-the-art in voice awareness/understanding and speech recognition. On-device computation of RNNs on low-power mobile wearable devices would be key to applications such as zero-latency voice-based human-machine interfaces. Here we present CHIPMUNK, a small (<;1 mm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> ) hardware accelerator for Long-Short Term Memory UMC 65 nm technology capable...

10.1109/cicc.2018.8357068 article EN 2022 IEEE Custom Integrated Circuits Conference (CICC) 2018-04-01

Occamy: A 432-Core 28.1 DP-GFLOP/s/W 83% FPU Utilization Dual-Chiplet, Dual-HBM2E RISC-V-Based Accelerator for Stencil and Sparse Linear Algebra Computations with 8-to-64-bit Floating-Point Support in 12nm FinFET

OPENALEX - Publications

Gianna Paulin Paul Scheffler Thomas Benz Matheus Cavalcante Tim Fischer and 9 more

10.1109/vlsitechnologyandcir46783.2024.10631529 article EN 2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits) 2024-06-16

Occamy: A 432-Core Dual-Chiplet Dual-HBM2E 768-DP-GFLOP/s RISC-V System for 8-to-64-bit Dense and Sparse Computing in 12-nm FinFET

OPENALEX - Publications

Paul Scheffler Thomas Benz Viviane Potocnik Tim Fischer Luca Colagrande and 10 more

10.1109/jssc.2025.3529249 article EN IEEE Journal of Solid-State Circuits 2025-01-01

Marsellus: A Heterogeneous RISC-V AI-IoT End-Node SoC With 2–8 b DNN Acceleration and 30%-Boost Adaptive Body Biasing

OPENALEX - Publications

Francesco Conti Gianna Paulin Angelo Garofalo Davide Rossi Alfio Di Mauro and 5 more

Emerging artificial intelligence-enabled Internet-of-Things (AI-IoT) system-on-chip (SoC) for augmented reality, personalized healthcare, and nanorobotics need to run many diverse tasks within a power envelope of few tens mW over wide range operating conditions: compute-intensive but strongly quantized deep neural network (DNN) inference, as well signal processing control requiring high-precision floating point. We present MARSELLUS, an all-digital heterogeneous SoC AI-IoT end-nodes...

10.1109/jssc.2023.3318301 article EN IEEE Journal of Solid-State Circuits 2023-10-03

MiniFloats on RISC-V Cores: ISA Extensions with Mixed-Precision Short Dot Products

OPENALEX - Publications

Luca Bertaccini Gianna Paulin Matheus Cavalcante Tim Fischer Stefan Mach and 1 more

10.1109/tetc.2024.3365354 article EN IEEE Transactions on Emerging Topics in Computing 2024-02-19

MiniFloat-NN and ExSdotp: An ISA Extension and a Modular Open Hardware Unit for Low-Precision Training on RISC-V Cores

OPENALEX - Publications

Luca Bertaccini Gianna Paulin Tim Fischer Stefan Mach Luca Benini

Low-precision formats have recently driven major breakthroughs in neural network (NN) training and inference by reducing the memory footprint of NN models improving energy efficiency underlying hardware architectures. Narrow integer data types been vastly investigated for successfully pushed to extreme ternary binary representations. In contrast, most training-oriented platforms use at least 16-bit floating-point (FP) formats. Lower-precision such as 8-bit FP mixed-precision techniques only...

10.1109/arith54963.2022.00010 article EN 2022-09-01

ITA: An Energy-Efficient Attention and Softmax Accelerator for Quantized Transformers

OPENALEX - Publications

Gamze İslamoğlu Moritz Scherer Gianna Paulin Tim Fischer Victor J. B. Jung and 2 more

Transformer networks have emerged as the state-of-the-art approach for natural language processing tasks and are gaining popularity in other domains such computer vision audio processing. However, efficient hardware acceleration of transformer models poses new challenges due to their high arithmetic intensities, large memory requirements, complex dataflow dependencies. In this work, we propose ITA, a novel accelerator architecture transformers related that targets inference on embedded...

10.1109/islped58423.2023.10244348 article EN 2023-08-07

RNN-Based Radio Resource Management on Multicore RISC-V Accelerator Architectures

OPENALEX - Publications

Gianna Paulin Renzo Andri Francesco Conti Luca Benini

Radio resource management (RRM) is critical in 5G mobile communications due to its ubiquity on every radio device and low latency constraints. The rapidly evolving RRM algorithms with requirements combined the dense massive base station deployment ask for an on-the-edge acceleration system a tradeoff between flexibility, efficiency, cost-making application-specific instruction-set processors (ASIPs) optimal choice. In this work, we start from baseline, simple RISC-V core introduce...

10.1109/tvlsi.2021.3093242 article EN IEEE Transactions on Very Large Scale Integration (VLSI) Systems 2021-07-12

Vau Da Muntanialas: Energy-Efficient Multi-Die Scalable Acceleration of RNN Inference

OPENALEX - Publications

Gianna Paulin Francesco Conti Lukas Cavigelli Luca Benini

Recurrent neural networks such as Long Short-Term Memories (LSTMs) learn temporal dependencies by keeping an internal state, making them ideal for time-series problems speech recognition. However, the output-to-input feedback creates distinctive memory bandwidth and scalability challenges in designing accelerators RNNs. We present Muntaniala, RNN accelerator architecture LSTM inference with a silicon-measured energy-efficiency of 3.25$TOP/s/W$ performance 30.53$GOP/s$ UMC 65 $nm$ technology....

10.1109/tcsi.2021.3099716 article EN IEEE Transactions on Circuits and Systems I Regular Papers 2021-07-30

Occamy: A 432-Core 28.1 DP-GFLOP/s/W 83% FPU Utilization Dual-Chiplet, Dual-HBM2E RISC-V-based Accelerator for Stencil and Sparse Linear Algebra Computations with 8-to-64-bit Floating-Point Support in 12nm FinFET

OPENALEX - Publications

Gianna Paulin Paul Scheffler Thomas Benz Matheus Cavalcante Tim Fischer and 9 more

We present Occamy, a 432-core RISC-V dual-chiplet 2.5D system for efficient sparse linear algebra and stencil computations on FP64 narrow (32-, 16-, 8-bit) SIMD FP data. Occamy features 48 clusters of cores with custom extensions, two 64-bit host cores, latency-tolerant multi-chiplet interconnect memory 32 GiB HBM2E. It achieves leading-edge utilization stencils (83 %), sparse-dense (42 sparse-sparse (49 %) matrix multiply.

10.48550/arxiv.2406.15068 preprint EN arXiv (Cornell University) 2024-06-21

Key Technologies and Design Aspects for Wafer Level Packaging of High Performance Computing Modules

OPENALEX - Publications

Kai Zoschke Hermann Oppermann Michael Schiffer Ivan Ndip Karl‐Friedrich Becker and 5 more

As contribution to projects like European Processor Initiative (EPI) as well Stencil- and Tensor Accelerator (STX), Fraunhofer IZM has further developed its advanced packaging portfolio with special focus on wafer level of high performance computing (HPC) modules. This includes the scaling well-established multi-layer copper redistribution technology enable a 4 μm line / space routing (8 pitch) over multiple layers 6 thick polymer interlayer dielectric micro vias 8 diameter. The (RDL)...

10.1109/ectc51529.2024.00340 article EN 2024-05-28

Chipmunk: A Systolically Scalable 0.9 mm${}^2$, 3.08 Gop/s/mW @ 1.2 mW Accelerator for Near-Sensor Recurrent Neural Network Inference

OPENALEX - Publications

Francesco Conti Lukas Cavigelli Gianna Paulin Igor Susmelj Luca Benini

Recurrent neural networks (RNNs) are state-of-the-art in voice awareness/understanding and speech recognition. On-device computation of RNNs on low-power mobile wearable devices would be key to applications such as zero-latency voice-based human-machine interfaces. Here we present Chipmunk, a small (<1 mm${}^2$) hardware accelerator for Long-Short Term Memory UMC 65 nm technology capable operate at measured peak efficiency up 3.08 Gop/s/mW 1.24 mW power. To implement big RNN models without...

10.48550/arxiv.1711.05734 preprint EN other-oa arXiv (Cornell University) 2017-01-01

Soft Tiles: Capturing Physical Implementation Flexibility for Tightly-Coupled Parallel Processing Clusters

OPENALEX - Publications

Gianna Paulin Matheus Cavalcante Paul Scheffler Luca Bertaccini Yichao Zhang and 2 more

Modern high-performance computing architectures (Multicore, GPU, Manycore) are based on tightly-coupled clusters of processing elements, physically implemented as rectangular tiles. Their size and aspect ratio strongly impact the achievable operating frequency energy efficiency, but they should be flexible possible to achieve a high utilization for top-level die floorplan. In this paper, we explore flexibility range cluster RISC-V cores with shared L1 memory used build scalable accelerators,...

10.1109/isvlsi54635.2022.00021 article EN 2022-07-01

Marsellus: A Heterogeneous RISC-V AI-IoT End-Node SoC with 2-to-8b DNN Acceleration and 30%-Boost Adaptive Body Biasing

OPENALEX - Publications

Francesco Conti Gianna Paulin Angelo Garofalo Davide Rossi Alfio Di Mauro and 5 more

Emerging Artificial Intelligence-enabled Internet-of-Things (AI-IoT) System-on-a-Chip (SoC) for augmented reality, personalized healthcare, and nano-robotics need to run many diverse tasks within a power envelope of few tens mW over wide range operating conditions: compute-intensive but strongly quantized Deep Neural Network (DNN) inference, as well signal processing control requiring high-precision floating-point. We present Marsellus, an all-digital heterogeneous SoC AI-IoT end-nodes...

10.48550/arxiv.2305.08415 preprint EN other-oa arXiv (Cornell University) 2023-01-01

PetaOps/W edge-AI $\mu$ Processors: Myth or reality?

OPENALEX - Publications

Manil Dev Gomony Floran De Putter Anteneh Gebregiorgis Gianna Paulin Linyan Mei and 23 more

With the rise of deep learning (DL), our world braces for artificial intelligence (AI) in every edge device, creating an urgent need edge-AI SoCs. This SoC hardware needs to support high throughput, reliable and secure AI processing at ultra-low power (ULP), with a very short time market. its strong legacy solutions open platforms, EU is well-positioned become leader this However, requires least 100 times more energy-efficient, while offering sufficient flexibility scalability deal as...

10.23919/date56975.2023.10136926 article EN Design, Automation & Test in Europe Conference & Exhibition (DATE), 2015 2023-04-01

ITA: An Energy-Efficient Attention and Softmax Accelerator for Quantized Transformers

OPENALEX - Publications

Gamze İslamoğlu Moritz Scherer Gianna Paulin Tim Fischer Victor J. B. Jung and 2 more

Transformer networks have emerged as the state-of-the-art approach for natural language processing tasks and are gaining popularity in other domains such computer vision audio processing. However, efficient hardware acceleration of transformer models poses new challenges due to their high arithmetic intensities, large memory requirements, complex dataflow dependencies. In this work, we propose ITA, a novel accelerator architecture transformers related that targets inference on embedded...

10.48550/arxiv.2307.03493 preprint EN other-oa arXiv (Cornell University) 2023-01-01

MiniFloat-NN and ExSdotp: An ISA Extension and a Modular Open Hardware Unit for Low-Precision Training on RISC-V cores

OPENALEX - Publications

Luca Bertaccini Gianna Paulin Tim Fischer Stefan Mach Luca Benini

Low-precision formats have recently driven major breakthroughs in neural network (NN) training and inference by reducing the memory footprint of NN models improving energy efficiency underlying hardware architectures. Narrow integer data types been vastly investigated for successfully pushed to extreme ternary binary representations. In contrast, most training-oriented platforms use at least 16-bit floating-point (FP) formats. Lower-precision such as 8-bit FP mixed-precision techniques only...

10.48550/arxiv.2207.03192 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Soft Tiles: Capturing Physical Implementation Flexibility for Tightly-Coupled Parallel Processing Clusters

OPENALEX - Publications

Gianna Paulin Matheus Cavalcante Paul Scheffler Luca Bertaccini Yichao Zhang and 2 more

Modern high-performance computing architectures (Multicore, GPU, Manycore) are based on tightly-coupled clusters of processing elements, physically implemented as rectangular tiles. Their size and aspect ratio strongly impact the achievable operating frequency energy efficiency, but they should be flexible possible to achieve a high utilization for top-level die floorplan. In this paper, we explore flexibility range cluster RISC-V cores with shared L1 memory used build scalable accelerators,...

10.48550/arxiv.2209.00889 preprint EN cc-by arXiv (Cornell University) 2022-01-01