- Digital Image Processing Techniques
- CCD and CMOS Imaging Sensors
- Medical Image Segmentation Techniques
- Advanced Memory and Neural Computing
- Parallel Computing and Optimization Techniques
- Advanced Neural Network Applications
- Advanced Data Compression Techniques
- Advanced machining processes and optimization
- Embedded Systems Design Techniques
- Ferroelectric and Negative Capacitance Devices
- Interconnection Networks and Systems
- Image and Signal Denoising Methods
- Manufacturing Process and Optimization
- Advanced Surface Polishing Techniques
- Digital Filter Design and Implementation
- Advanced Image Processing Techniques
- Domain Adaptation and Few-Shot Learning
- Image Processing Techniques and Applications
- Advanced Machining and Optimization Techniques
- Machine Learning and Data Classification
- Graph Theory and Algorithms
- Low-power high-performance VLSI design
- Laser Material Processing Techniques
- Microfluidic and Bio-sensing Technologies
- Flexible and Reconfigurable Manufacturing Systems
Baden-Wuerttemberg Cooperative State University
2021-2022
Robert Bosch (Germany)
2019-2021
IBM Research - Tokyo
2020
Stuttgart University of Applied Sciences
2019
IBM (United States)
2018
IBM (Germany)
2018
University of Stuttgart
2012-2016
Schunk (Germany)
2016
A multi-TOPS AI core is presented for acceleration of deep learning training and inference in systems from edge devices to data centers. With a programmable architecture custom ISA, this engine achieves >90% sustained utilization across the range neural network topologies by employing dataflow an on-chip scratchpad hierarchy. Compute precision optimized at 16b floating point (fp 16) high model accuracy as well 1b/2b (bi-nary/ternary) integer aggressive performance. At 1.5 GHz, prototype...
Advances in deep neural networks (DNNs) and the availability of massive real-world data have enabled superhuman levels accuracy on many AI tasks ushered explosive growth workloads across spectrum computing devices. However, their superior comes at a high computational cost, which necessitates approaches beyond traditional paradigms to improve operational efficiency. Leveraging application-level insight error resilience, we demonstrate how approximate (AxC) can significantly boost efficiency...
A resource-efficient hardware architecture for connected component analysis (CCA) of streamed video data is presented, which reduces the required resources, especially larger image widths. On-chip memory requirements increase with width and dominate resources state-of-the-art CCA single-pass architectures. reduction in on-chip essential to meet ever increasing sizes high-definition (HD) ultra-HD standards. The proposed resource efficient due several innovations. An improved label recycling...
This letter presents a multi-TOPS AI accelerator core for deep learning training and inference. With programmable architecture custom ISA, this engine achieves >90% sustained utilization across the range of neural network topologies by employing dataflow to provide high throughput an on-chip scratchpad hierarchy meet bandwidth demands compute units. A 16b floating point (fp16) representation with 1 sign bit, 6 exponent bits, 9 mantissa bits has also been developed model accuracy in inference...
In this paper, an adaptive architecture for dynamic management and allocation of on-chip FPGA Block Random Access Memory (BRAM) resources is presented. This facilitates the sharing valuable scarce memory among several processing elements (PEs), according to their run-time requirements. Different real-time applications are becoming increasingly which leads unexpected variable footprints, static worst-case requirements would result in costly overheads inefficient utilization. The proposed...
In classical connected component labeling algorithms the image has to be scanned two times. The amount of memory required for these is at least as high storing a full image. By using single pass algorithms, requirement can reduced by one order magnitude only row. This reduction which avoids bandwidth external essential obtain hardware efficient implementation on FPGAs. These mapped one-to-one resources FPGAs process pixel per clock cycle in best case. enhance performance scalable parallel...
Single-pass connected components analysis (CCA) algorithms suffer from a time overhead to resolve labels at the end of each image row. This work demonstrates how this can be eliminated by replacing conventional raster scan zig-zag scan. enables chains correctly resolved while processing next The effect is faster in worst case with no row overheads. CCA hardware architectures using novel algorithm proposed paper are, therefore, able process images higher throughput than other state-of-the-art...
A memory efficient architecture for single-pass connected components analysis suited high throughput embedded image processing systems is proposed which achieves a by partitioning the into several vertical slices processed in parallel. The low latency of allows reuse labels associated with objects. This reduces amount factor more than 5 compared to previous work. significant, since critical resource on FPGAs.
Gigantic rates of data production in the era Big Data, Internet Thing (IoT), and Smart Cyber Physical Systems (CPS) pose incessantly escalating demands for massive processing, storage, transmission while continuously interacting with physical world using edge sensors actuators. For IoT systems, there is now a strong trend to move intelligence from cloud or extreme (known as TinyML). Yet, this shift AI systems requires design powerful machine learning under very strict resource constraints....
The quality of machined surfaces is significantly influenced by machine vibrations caused the cutting process. Whereas most publications ignore influence tool holder, this paper considers dynamic behaviour whole system consisting spindle, and workpiece. Therefore modal operational vibration analyses were performed to describe damping characteristics two competing holder technologies, namely heatshrink (HS) hydraulic expansion (HE). It shown that HE has higher rates than HS. Therefore, showed...
Calculation of mean, variance and standard deviation are often required for segmentation or feature extraction. In image processing, an integer approximation is adequate. Conventional methods require division square root operations, which expensive to realize in hardware terms both the amount resources latency. A new class iterative algorithms developed based on arithmetic. An implementation as a architecture Field-Programmable Gate Array (FPGA) compared with architectures using conventional...
High performance image analytics is an important challenge for big data processing as and video a huge portion of e.g. generated by tremendous amount sensors worldwide. This paper presents case study namely the parallel connected component labeling (CCL) which one first steps in general. It shown that high CCL implementation can be obtained on heterogeneous platform if parts algorithm are processed fine grain field programmable gate array (FPGA) multi-core processor simultaneously. The...
JPEG-LS has a large number of different and independent context sets that provide the opportunity for par-allelism. As JPEG-LS, many lossless image compression standards have "adaptive" error modeling as core part. This, however, leads to data dependency loops scheme such parallel neighboring pixels is not possible. In this paper, hardware architecture proposed in order achieve parallelism compression. adaptive part algorithm, update pixel belonging depends on previous having same number. On...
End-to-end performance estimation and measurement of deep neural network (DNN) systems become more important with increasing complexity DNN consisting hardware software components. The methodology proposed in this paper aims at a reduced turn-around time for evaluating different design choices components systems. This reduction is achieved by moving the from implementation phase to concept employing virtual models instead gathering results physical prototypes. Deep learning compilers...
A significant challenge in laser drilling is the optimization of process parameters and strategies to achieve highquality holes. This further complicated by fact that quality assessment a manual time-consuming task. paper presents methodology designed significantly reduce effort required optimizing for single-pulse 0.3mm thick stainless steel. The objective precisely drill holes with an entry diameter 70μm exit 20 μm, achieving high roundness. features drilled were extracted automatically...
A spatial domain perceptual image codec based on subsampling and quantization (SSPQ) guided by the just-noticeable distortion (JND) profile is proposed. SSPQ integrates coding progressive transmission in one framework. The input first subsampled a factor of two both dimensions compressed without loss. provides basis for predicting pixels interpolation estimating JND values each pixel. Residual thresholds are set to estimated perceptually tuned compression. Quantized residuals progressively...
The combination of growth in compute capabilities and availability large datasets has led to a re-birth deep learning. Deep Neural Networks (DNNs) have become state-of-the-art variety machine learning tasks spanning domains across vision, speech, translation. Learning (DL) achieves high accuracy these at the expense 100s ExaOps computation; posing significant challenges efficient large-scale deployment both resource-constrained environments data centers.
The Union-Retire CCA (UR-CCA) algorithm started a new paradigm for connected components analysis. Instead of using directed tree structures, UR-CCA focuses on connectivity. This algorithmic change leads to reduction in required memory, with no end-of-row processing overhead. In this paper we describe hardware architecture based and its realisation an FPGA. memory bandwidth pipelining challenges are analysed resolved. It is shown that up 36% resources can be saved the proposed architecture....
A key issue in system design is the lack of communication between hardware, software and domain expert. Recent research work shows progress automatic HW/SW co-design flows neural accelerators that seems to make this kind obsolete. Most real-world systems, however, are a composition multiple processing units, networks memories. process (reconfigurable) accelerators, therefore, an important sub-problem towards common methodology. The ultimate challenge define constraints for space exploration...
Recently automated frameworks have been proposed, mapping neural networks from a high-level description onto embedded devices, most of them in an end-to-end manner. This paper aims to give overview their main characteristics and achievements. A special focus is lying on internal predictions during design space exploration (DSE) regarding hardware targets (performance, area or power consumption), enabling fast the individually defined search spaces, especially early stages. Additionally,...