An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks

Pooling Hardware acceleration Interleaving
DOI: 10.3390/electronics8040371 Publication Date: 2019-03-29T07:50:21Z
ABSTRACT
Convolutional Neural Networks (CNNs) have been widely applied in various fields, such as image recognition, speech processing, well many big-data analysis tasks. However, their large size and intensive computation hinder deployment hardware, especially on the embedded systems with stringent latency, power, area requirements. To address this issue, low bit-width CNNs are proposed a highly competitive candidate. In paper, we propose an efficient, scalable accelerator for based parallel streaming architecture. With novel coarse grain task partitioning (CGTP) strategy, heterogeneous computing units, supporting multi-pattern dataflows, can nearly double throughput CNN models average. Besides, hardware-friendly algorithm is to simplify activation quantification process, which reduce power dissipation overhead. Based optimized algorithm, efficient reconfigurable three-stage activation-quantification-pooling (AQP) unit staged blocking strategy developed, process activation, quantification, max-pooling operations simultaneously. Moreover, interleaving memory scheduling scheme support The implemented TSMC 40 nm technology core of 0.17 mm 2 . It achieve 7.03 TOPS/W energy efficiency 4.14 TOPS/mm at 100.1 mW, makes it promising design devices.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (36)
CITATIONS (4)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....