- Advanced Memory and Neural Computing
- Advanced Neural Network Applications
- Advanced Data Storage Technologies
- Semiconductor materials and devices
- Parallel Computing and Optimization Techniques
- Speech Recognition and Synthesis
- Speech and Audio Processing
- Ferroelectric and Negative Capacitance Devices
- Topic Modeling
- Cellular Automata and Applications
- Advanced Image and Video Retrieval Techniques
- Numerical Methods and Algorithms
- Robotics and Sensor-Based Localization
- Music and Audio Processing
- Blind Source Separation Techniques
- Interconnection Networks and Systems
- Adversarial Robustness in Machine Learning
- Distributed and Parallel Computing Systems
- Natural Language Processing Techniques
- Multimodal Machine Learning Applications
- Embedded Systems Design Techniques
- Cloud Computing and Resource Management
- CCD and CMOS Imaging Sensors
- Advanced Vision and Imaging
- Robotic Path Planning Algorithms
Harvard University Press
2020-2024
Harvard University
2022
National Tsing Hua University
2001-2018
Powerchip (Taiwan)
2005
Foxnum Technology (Taiwan)
2005
Taiwan Semiconductor Manufacturing Company (Taiwan)
2003
Many artificial intelligence (AI) edge devices use nonvolatile memory (NVM) to store the weights for neural network (trained off-line on an AI server), and require low-energy fast I/O accesses. The deep networks (DNN) used by processors [1,2] commonly p-layers of a convolutional (CNN) q-layers fully-connected (FCN). Current DNN that conventional (von-Neumann) structure are limited high access latencies, energy consumption, hardware costs. Large working data sets result in heavy accesses...
For deep-neural-network (DNN) processors [1-4], the product-sum (PS) operation predominates computational workload for both convolution (CNVL) and fully-connect (FCNL) neural-network (NN) layers. This hinders adoption of DNN to on edge artificial-intelligence (AI) devices, which require low-power, low-cost fast inference. Binary DNNs [5-6] are used reduce computation hardware costs AI devices; however, a memory bottleneck still remains. In Fig. 31.5.1 conventional PE arrays exploit...
Transformer-based language models such as BERT provide significant accuracy improvement to a multitude of natural processing (NLP) tasks. However, their hefty computational and memory demands make them challenging deploy resource-constrained edge platforms with strict latency requirements.
Conventional hardware-friendly quantization methods, such as fixed-point or integer, tend to perform poorly at very low precision their shrunken dynamic ranges cannot adequately capture the wide data distributions commonly seen in sequence transduction models. We present an algorithm-hardware co-design centered around a novel floating-point inspired number format, AdaptivFloat, that dynamically maximizes and optimally clips its available range, layer granularity, order create faithful...
Automatic speech recognition (ASR) using deep learning is essential for user interfaces on IoT devices. However, previously published ASR chips [4-7] do not consider realistic operating conditions, which are typically noisy and may include more than one speaker. Furthermore, several of these works have implemented only small-vocabulary tasks, such as keyword-spotting (KWS), where context-blind neural network (DNN) algorithms adequate. large-vocabulary tasks (e.g., >100k words), the complex...
Modern heterogeneous SoCs feature a mix of many hardware accelerators and general-purpose cores that run applications in parallel. This brings challenges managing how the access shared resources, e.g., memory hierarchy, communication channels, on-chip power. We address these through flexible orchestration data on 74Tbps network-on-chip (NoC) for dynamic management resources under contention distributed power (DHPM) scheme. Developing testing ideas requires comprehensive evaluation platform....
Conventional hardware-friendly quantization methods, such as fixed-point or integer, tend to perform poorly at very low word sizes their shrinking dynamic ranges cannot adequately capture the wide data distributions commonly seen in sequence transduction models. We present AdaptivFloat, a floating-point inspired number representation format for deep learning that dynamically maximizes and optimally clips its available range, layer granularity, order create faithful encoding of neural network...
The proliferation of personal artificial intelligence (AI) -assistant technologies with speech-based conversational AI interfaces is driving the exponential growth in consumer Internet Things (IoT) market. As these are being applied to keyword spotting (KWS), automatic speech recognition (ASR), natural language processing (NLP), and text-to-speech (TTS) applications, it paramount importance that they provide uncompromising performance for context learning long sequences, which a key benefit...
Autonomous machines (e.g., vehicles, mobile robots, drones) require sophisticated 3D mapping to perceive the dynamic environment. However, maintaining a real-time map is expensive both in terms of compute and memory requirements, especially for resource-constrained edge machines. Probabilistic OctoMap reliable memory-efficient dense model represent full environment, with voxel node pruning expansion capacity. It widely used but limited by its single-thread design. This paper presents first...
The design of heterogeneous systems that include domain specific accelerators is a challenging and time-consuming process. While taking into account area constraints, designers must decide which parts an application to accelerate in hardware leave software. Moreover, applications domains such as Extended Reality (XR) offer opportunities for various forms parallel execution, including loop level, task pipeline parallelism. To assist the process expose every possible level parallelism, we...
Deep neural networks (DNN) have become ubiquitous and dominant in various application domains due to its state-of-the-art learning capabilities. To run compute memory intensive DNN models, designing specialized hardware accelerators becomes the common choice. However, performance improvement comes with limitations on programmability, which has crucial given rapid evolution of models. In this work, we first conduct workload analysis a diverse set including CNN, LSTM, Transformer, GCN...
An AND-type split-gate Flash memory cell with a trench select gate and buried n/sup +/ source is proposed. This cell, programmed by ballistic side injection (BSSI), can provide high programming efficiency size of 5F/sup 2/. Furthermore, both the speed read current are enhanced shared configuration.
The design of heterogeneous systems that include domain specific accelerators is a challenging and time-consuming process. While taking into account area constraints, designers must decide which parts an application to accelerate in hardware leave software. Moreover, applications domains such as Extended Reality (XR) offer opportunities for various forms parallel execution, including loop level, task level pipeline parallelism. To assist the process expose every possible parallelism, we...
In this paper a recently proposed bidirectional tunneling program/erase (P/E) NOR-type (BiNOR) flash memory is extensively investigated. With the designated localized p-well structure, uniform Fowler-Nordheim (FN) first fulfilled for both program and erase operations in array architecture to facilitate low power applications. The BiNOR guarantees excellent tunnel oxide reliability provided with fast random access capability. Furthermore, three-dimensional (3D) current path addition...
A novel 3D flash memory, BiNOR, with a localized shallow P-well is proposed for high speed, low power and reliability applications. Low bi-directional tunneling program/erase realized in NOR array, which guarantees better tunnel oxide reliability, where previously could only be performed NAND arrays. Moreover, read performance achieved by more than 15% conduction current enhancement due to the cell structure.
Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural processing (NLP) tasks. However, their hefty computational and memory demands make them challenging to deploy resource-constrained edge platforms with strict latency requirements. We present EdgeBERT, an in-depth algorithm-hardware co-design latency-aware energy optimization multi-task NLP. EdgeBERT employs entropy-based early exit predication in order perform dynamic...
Autonomous machines (e.g., vehicles, mobile robots, drones) require sophisticated 3D mapping to perceive the dynamic environment. However, maintaining a real-time map is expensive both in terms of compute and memory requirements, especially for resource-constrained edge machines. Probabilistic OctoMap reliable memory-efficient dense model represent full environment, with voxel node pruning expansion capacity. This paper presents first efficient accelerator solution, i.e. OMU, enable...
For the first time, a new flash cell, called buried bit-line AND (BiAND), is proposed. Buried can achieve low voltage programming/erase. The major difference of current cell from conventional special design contact. With use bit-line, required high program/erase for FN tunneling be divided between word-line and such that lower operation feasible. Further, comparison reliability different schemes, i.e., F-N (HV F-N) Bi operating has been studied. Results show BiAND scheme gives much better...
In this work, we present SM6, an SoC architecture for real-time denoised speech and NLP pipelines, featuring (1) MSSE: unsupervised probabilistic sound source separation accelerator, (2) FlexNLP: a programmable inference accelerator attention-based seq2seq DNNs using adaptive floating-point datatypes wide dynamic range computations, (3) dual-core Arm Cortex A53 CPU cluster, which provides on-demand SIMD FFT processing, operating system support. adverse acoustic conditions, MSSE allows...
This paper presents a novel Bi-directional channel FN tunneling program/erase NOR (BiNOR) type flash memory cell for the reliable, high speed, and low power operation. With localized shallow p-well at bit-line, BiNOR realizes in NOR-type array architecture, which could only be done previously NAND architecture. Furthermore, read current is enhanced greatly by 3-D conduction effect due to designated p-well.