- Parallel Computing and Optimization Techniques
- Advanced Memory and Neural Computing
- Advanced Neural Network Applications
- Ferroelectric and Negative Capacitance Devices
- CCD and CMOS Imaging Sensors
- Neural Networks and Applications
- Advancements in Semiconductor Devices and Circuit Design
- Embedded Systems Design Techniques
- Analog and Mixed-Signal Circuit Design
- Speech Recognition and Synthesis
- Interconnection Networks and Systems
- Speech and Audio Processing
- Low-power high-performance VLSI design
- Advanced Data Storage Technologies
- Semiconductor materials and devices
- Advanced Wireless Communication Techniques
- Advanced Data Compression Techniques
- Electronic Packaging and Soldering Technologies
- Blind Source Separation Techniques
- PAPR reduction in OFDM
- 3D IC and TSV technologies
- Energy Harvesting in Wireless Networks
- VLSI and Analog Circuit Testing
ETH Zurich
2018-2024
University of Bologna
2023
Innovation Cluster (Canada)
2023
National University of Singapore
2023
Emerging Artificial Intelligence-enabled Internet-of-Things (Al-loT) SoCs [1–4] for augmented reality, personalized healthcare and nano-robotics need to run a large variety of tasks within power envelope few tens mW: compute-intensive but bit-precision-tolerant Deep Neural Networks (DNNs), as well signal processing control requiring high-precision floating-point. Performance energy constraints vary greatly between different applications even stages the same application. We present Marsellus...
Recurrent neural networks (RNNs) are state-of-the-art in voice awareness/understanding and speech recognition. On-device computation of RNNs on low-power mobile wearable devices would be key to applications such as zero-latency voice-based human-machine interfaces. Here we present CHIPMUNK, a small (<;1 mm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> ) hardware accelerator for Long-Short Term Memory UMC 65 nm technology capable...
Emerging artificial intelligence-enabled Internet-of-Things (AI-IoT) system-on-chip (SoC) for augmented reality, personalized healthcare, and nanorobotics need to run many diverse tasks within a power envelope of few tens mW over wide range operating conditions: compute-intensive but strongly quantized deep neural network (DNN) inference, as well signal processing control requiring high-precision floating point. We present MARSELLUS, an all-digital heterogeneous SoC AI-IoT end-nodes...
Low-precision formats have recently driven major breakthroughs in neural network (NN) training and inference by reducing the memory footprint of NN models improving energy efficiency underlying hardware architectures. Narrow integer data types been vastly investigated for successfully pushed to extreme ternary binary representations. In contrast, most training-oriented platforms use at least 16-bit floating-point (FP) formats. Lower-precision such as 8-bit FP mixed-precision techniques only...
Transformer networks have emerged as the state-of-the-art approach for natural language processing tasks and are gaining popularity in other domains such computer vision audio processing. However, efficient hardware acceleration of transformer models poses new challenges due to their high arithmetic intensities, large memory requirements, complex dataflow dependencies. In this work, we propose ITA, a novel accelerator architecture transformers related that targets inference on embedded...
Radio resource management (RRM) is critical in 5G mobile communications due to its ubiquity on every radio device and low latency constraints. The rapidly evolving RRM algorithms with requirements combined the dense massive base station deployment ask for an on-the-edge acceleration system a tradeoff between flexibility, efficiency, cost-making application-specific instruction-set processors (ASIPs) optimal choice. In this work, we start from baseline, simple RISC-V core introduce...
Recurrent neural networks such as Long Short-Term Memories (LSTMs) learn temporal dependencies by keeping an internal state, making them ideal for time-series problems speech recognition. However, the output-to-input feedback creates distinctive memory bandwidth and scalability challenges in designing accelerators RNNs. We present Muntaniala, RNN accelerator architecture LSTM inference with a silicon-measured energy-efficiency of 3.25$TOP/s/W$ performance 30.53$GOP/s$ UMC 65 $nm$ technology....
We present Occamy, a 432-core RISC-V dual-chiplet 2.5D system for efficient sparse linear algebra and stencil computations on FP64 narrow (32-, 16-, 8-bit) SIMD FP data. Occamy features 48 clusters of cores with custom extensions, two 64-bit host cores, latency-tolerant multi-chiplet interconnect memory 32 GiB HBM2E. It achieves leading-edge utilization stencils (83 %), sparse-dense (42 sparse-sparse (49 %) matrix multiply.
As contribution to projects like European Processor Initiative (EPI) as well Stencil- and Tensor Accelerator (STX), Fraunhofer IZM has further developed its advanced packaging portfolio with special focus on wafer level of high performance computing (HPC) modules. This includes the scaling well-established multi-layer copper redistribution technology enable a 4 μm line / space routing (8 pitch) over multiple layers 6 thick polymer interlayer dielectric micro vias 8 diameter. The (RDL)...
Recurrent neural networks (RNNs) are state-of-the-art in voice awareness/understanding and speech recognition. On-device computation of RNNs on low-power mobile wearable devices would be key to applications such as zero-latency voice-based human-machine interfaces. Here we present Chipmunk, a small (<1 mm${}^2$) hardware accelerator for Long-Short Term Memory UMC 65 nm technology capable operate at measured peak efficiency up 3.08 Gop/s/mW 1.24 mW power. To implement big RNN models without...
Modern high-performance computing architectures (Multicore, GPU, Manycore) are based on tightly-coupled clusters of processing elements, physically implemented as rectangular tiles. Their size and aspect ratio strongly impact the achievable operating frequency energy efficiency, but they should be flexible possible to achieve a high utilization for top-level die floorplan. In this paper, we explore flexibility range cluster RISC-V cores with shared L1 memory used build scalable accelerators,...
Emerging Artificial Intelligence-enabled Internet-of-Things (AI-IoT) System-on-a-Chip (SoC) for augmented reality, personalized healthcare, and nano-robotics need to run many diverse tasks within a power envelope of few tens mW over wide range operating conditions: compute-intensive but strongly quantized Deep Neural Network (DNN) inference, as well signal processing control requiring high-precision floating-point. We present Marsellus, an all-digital heterogeneous SoC AI-IoT end-nodes...
With the rise of deep learning (DL), our world braces for artificial intelligence (AI) in every edge device, creating an urgent need edge-AI SoCs. This SoC hardware needs to support high throughput, reliable and secure AI processing at ultra-low power (ULP), with a very short time market. its strong legacy solutions open platforms, EU is well-positioned become leader this However, requires least 100 times more energy-efficient, while offering sufficient flexibility scalability deal as...
Transformer networks have emerged as the state-of-the-art approach for natural language processing tasks and are gaining popularity in other domains such computer vision audio processing. However, efficient hardware acceleration of transformer models poses new challenges due to their high arithmetic intensities, large memory requirements, complex dataflow dependencies. In this work, we propose ITA, a novel accelerator architecture transformers related that targets inference on embedded...
Low-precision formats have recently driven major breakthroughs in neural network (NN) training and inference by reducing the memory footprint of NN models improving energy efficiency underlying hardware architectures. Narrow integer data types been vastly investigated for successfully pushed to extreme ternary binary representations. In contrast, most training-oriented platforms use at least 16-bit floating-point (FP) formats. Lower-precision such as 8-bit FP mixed-precision techniques only...
Modern high-performance computing architectures (Multicore, GPU, Manycore) are based on tightly-coupled clusters of processing elements, physically implemented as rectangular tiles. Their size and aspect ratio strongly impact the achievable operating frequency energy efficiency, but they should be flexible possible to achieve a high utilization for top-level die floorplan. In this paper, we explore flexibility range cluster RISC-V cores with shared L1 memory used build scalable accelerators,...