NFDI4DS | UHH-SEMS - Publication Details

Michael J. Klaiber

ORCID: 0000-0001-8286-7000

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5000627792

Research Areas

Digital Image Processing Techniques
CCD and CMOS Imaging Sensors
Medical Image Segmentation Techniques
Advanced Memory and Neural Computing
Parallel Computing and Optimization Techniques
Advanced Neural Network Applications
Advanced Data Compression Techniques
Advanced machining processes and optimization
Embedded Systems Design Techniques
Ferroelectric and Negative Capacitance Devices
Interconnection Networks and Systems
Image and Signal Denoising Methods
Manufacturing Process and Optimization
Advanced Surface Polishing Techniques
Digital Filter Design and Implementation
Advanced Image Processing Techniques
Domain Adaptation and Few-Shot Learning
Image Processing Techniques and Applications
Advanced Machining and Optimization Techniques
Machine Learning and Data Classification
Graph Theory and Algorithms
Low-power high-performance VLSI design
Laser Material Processing Techniques
Microfluidic and Bio-sensing Technologies
Flexible and Reconfigurable Manufacturing Systems

Baden-Wuerttemberg Cooperative State University
2021-2022

Robert Bosch (Germany)
2019-2021

IBM Research - Tokyo
2020

Stuttgart University of Applied Sciences
2019

IBM (United States)
2018

IBM (Germany)
2018

University of Stuttgart
2012-2016

Schunk (Germany)
2016

A Scalable Multi- TeraOPS Deep Learning Processor Core for AI Trainina and Inference

OPENALEX - Publications

Bruce Fleischer Sunil Shukla Matthew M. Ziegler J. A. Silberman Jinwook Oh and 26 more

A multi-TOPS AI core is presented for acceleration of deep learning training and inference in systems from edge devices to data centers. With a programmable architecture custom ISA, this engine achieves >90% sustained utilization across the range neural network topologies by employing dataflow an on-chip scratchpad hierarchy. Compute precision optimized at 16b floating point (fp 16) high model accuracy as well 1b/2b (bi-nary/ternary) integer aggressive performance. At 1.5 GHz, prototype...

10.1109/vlsic.2018.8502276 article EN 2018-06-01

Efficient AI System Design With Cross-Layer Approximate Computing

OPENALEX - Publications

Swagath Venkataramani Xiao Sun Naigang Wang Chia‐Yu Chen Jungwook Choi and 35 more

Advances in deep neural networks (DNNs) and the availability of massive real-world data have enabled superhuman levels accuracy on many AI tasks ushered explosive growth workloads across spectrum computing devices. However, their superior comes at a high computational cost, which necessitates approaches beyond traditional paradigms to improve operational efficiency. Leveraging application-level insight error resilience, we demonstrate how approximate (AxC) can significantly boost efficiency...

10.1109/jproc.2020.3029453 article EN Proceedings of the IEEE 2020-11-10

A Resource-Efficient Hardware Architecture for Connected Component Analysis

OPENALEX - Publications

Michael J. Klaiber Donald G. Bailey Yousef Baroud Sven Simon

A resource-efficient hardware architecture for connected component analysis (CCA) of streamed video data is presented, which reduces the required resources, especially larger image widths. On-chip memory requirements increase with width and dominate resources state-of-the-art CCA single-pass architectures. reduction in on-chip essential to meet ever increasing sizes high-definition (HD) ultra-HD standards. The proposed resource efficient due several innovations. An improved label recycling...

10.1109/tcsvt.2015.2450371 article EN IEEE Transactions on Circuits and Systems for Video Technology 2015-07-21

A Scalable Multi-TeraOPS Core for AI Training and Inference

OPENALEX - Publications

Sunil Shukla Bruce Fleischer Matthew M. Ziegler J. A. Silberman Jinwook Oh and 26 more

This letter presents a multi-TOPS AI accelerator core for deep learning training and inference. With programmable architecture custom ISA, this engine achieves >90% sustained utilization across the range of neural network topologies by employing dataflow to provide high throughput an on-chip scratchpad hierarchy meet bandwidth demands compute units. A 16b floating point (fp16) representation with 1 sign bit, 6 exponent bits, 9 mantissa bits has also been developed model accuracy in inference...

10.1109/lssc.2019.2902738 article EN IEEE Solid-State Circuits Letters 2018-12-01

Adaptive Dynamic On-chip Memory Management for FPGA-based reconfigurable architectures

OPENALEX - Publications

Ghada Dessouky Michael J. Klaiber Donald G. Bailey Sven Simon

In this paper, an adaptive architecture for dynamic management and allocation of on-chip FPGA Block Random Access Memory (BRAM) resources is presented. This facilitates the sharing valuable scarce memory among several processing elements (PEs), according to their run-time requirements. Different real-time applications are becoming increasingly which leads unexpected variable footprints, static worst-case requirements would result in costly overheads inefficient utilization. The proposed...

10.1109/fpl.2014.6927471 article EN 2014-09-01

A memory-efficient parallel single pass architecture for connected component labeling of streamed images

OPENALEX - Publications

Michael J. Klaiber Lars Rockstroh Zhe Wang Yousef Baroud Sven Simon

In classical connected component labeling algorithms the image has to be scanned two times. The amount of memory required for these is at least as high storing a full image. By using single pass algorithms, requirement can reduced by one order magnitude only row. This reduction which avoids bandwidth external essential obtain hardware efficient implementation on FPGAs. These mapped one-to-one resources FPGAs process pixel per clock cycle in best case. enhance performance scalable parallel...

10.1109/fpt.2012.6412129 article EN 2012-12-01

Zig-Zag Based Single-Pass Connected Components Analysis

OPENALEX - Publications

Donald G. Bailey Michael J. Klaiber

Single-pass connected components analysis (CCA) algorithms suffer from a time overhead to resolve labels at the end of each image row. This work demonstrates how this can be eliminated by replacing conventional raster scan zig-zag scan. enables chains correctly resolved while processing next The effect is faster in worst case with no row overheads. CCA hardware architectures using novel algorithm proposed paper are, therefore, able process images higher throughput than other state-of-the-art...

10.3390/jimaging5040045 article EN cc-by Journal of Imaging 2019-04-06

A high-throughput FPGA architecture for parallel connected components analysis based on label reuse

OPENALEX - Publications

Michael J. Klaiber Donald G. Bailey Silvia Ahmed Yousef Baroud Sven Simon

A memory efficient architecture for single-pass connected components analysis suited high throughput embedded image processing systems is proposed which achieves a by partitioning the into several vertical slices processed in parallel. The low latency of allows reuse labels associated with objects. This reduces amount factor more than 5 compared to previous work. significant, since critical resource on FPGAs.

10.1109/fpt.2013.6718372 article EN 2013-12-01

A single-cycle parallel multi-slice connected components analysis hardware architecture

OPENALEX - Publications

Michael J. Klaiber Donald G. Bailey Sven Simon

10.1007/s11554-016-0610-2 article EN Journal of Real-Time Image Processing 2016-06-20

Automated HW/SW co-design for edge AI

OPENALEX - Publications

Oliver Bringmann Wolfgang Ecker Ingo Feldner Adrian Frischknecht Christoph Gerum and 7 more

Gigantic rates of data production in the era Big Data, Internet Thing (IoT), and Smart Cyber Physical Systems (CPS) pose incessantly escalating demands for massive processing, storage, transmission while continuously interacting with physical world using edge sensors actuators. For IoT systems, there is now a strong trend to move intelligence from cloud or extreme (known as TinyML). Yet, this shift AI systems requires design powerful machine learning under very strict resource constraints....

10.1145/3478684.3479261 article EN 2021-09-30

Comparative Study and Proof of Single-Pass Connected Components Algorithms

OPENALEX - Publications

Michael J. Klaiber Donald G. Bailey Sven Simon

10.1007/s10851-019-00891-2 article EN Journal of Mathematical Imaging and Vision 2019-06-28

The Influence of Tool Holder Technologies on Milling Performance

OPENALEX - Publications

Jürgen Fleischer Volker Schulze Michael J. Klaiber J. Bauer Frederik Zanger and 4 more

The quality of machined surfaces is significantly influenced by machine vibrations caused the cutting process. Whereas most publications ignore influence tool holder, this paper considers dynamic behaviour whole system consisting spindle, and workpiece. Therefore modal operational vibration analyses were performed to describe damping characteristics two competing holder technologies, namely heatshrink (HS) hydraulic expansion (HE). It shown that HE has higher rates than HS. Therefore, showed...

10.1016/j.procir.2016.03.183 article EN Procedia CIRP 2016-01-01

Efficient hardware calculation of running statistics

OPENALEX - Publications

Donald G. Bailey Michael J. Klaiber

Calculation of mean, variance and standard deviation are often required for segmentation or feature extraction. In image processing, an integer approximation is adequate. Conventional methods require division square root operations, which expensive to realize in hardware terms both the amount resources latency. A new class iterative algorithms developed based on arithmetic. An implementation as a architecture Field-Programmable Gate Array (FPGA) compared with architectures using conventional...

10.1109/ivcnz.2013.6727015 article EN 2013-11-01

Stream Processing of Scientific Big Data on Heterogeneous Platforms -- Image Analytics on Big Data in Motion

OPENALEX - Publications

Seyyed Mahdi Najmabadi Michael J. Klaiber Z. Wang Yousef Baroud S. Simón

High performance image analytics is an important challenge for big data processing as and video a huge portion of e.g. generated by tremendous amount sensors worldwide. This paper presents case study namely the parallel connected component labeling (CCL) which one first steps in general. It shown that high CCL implementation can be obtained on heterogeneous platform if parts algorithm are processed fine grain field programmable gate array (FPGA) multi-core processor simultaneously. The...

10.1109/cse.2013.142 article EN 2013-12-01

Parallel hardware architecture for JPEG-LS based on domain decomposition

OPENALEX - Publications

S. Ahmed Z. Wang Michael J. Klaiber S. Wahl Marek Wróblewski and 1 more

JPEG-LS has a large number of different and independent context sets that provide the opportunity for par-allelism. As JPEG-LS, many lossless image compression standards have "adaptive" error modeling as core part. This, however, leads to data dependency loops scheme such parallel neighboring pixels is not possible. In this paper, hardware architecture proposed in order achieve parallelism compression. adaptive part algorithm, update pixel belonging depends on previous having same number. On...

10.1117/12.929650 article EN Proceedings of SPIE, the International Society for Optical Engineering/Proceedings of SPIE 2012-10-15

An End-to-End HW/SW Co-Design Methodology to Design Efficient Deep Neural Network Systems using Virtual Models

OPENALEX - Publications

Michael J. Klaiber Sebastian Vogel Axel Acosta Robert W. Korn Leonardo Ecco and 3 more

End-to-end performance estimation and measurement of deep neural network (DNN) systems become more important with increasing complexity DNN consisting hardware software components. The methodology proposed in this paper aims at a reduced turn-around time for evaluating different design choices components systems. This reduction is achieved by moving the from implementation phase to concept employing virtual models instead gathering results physical prototypes. Deep learning compilers...

10.1145/3372394.3372396 preprint EN 2019-10-13

Automated image-based parameter optimization for single-pulse laser drilling

OPENALEX - Publications

Michael J. Klaiber Marc Hug Lukas Schneller Ömer Can Andreas Jahn and 3 more

A significant challenge in laser drilling is the optimization of process parameters and strategies to achieve highquality holes. This further complicated by fact that quality assessment a manual time-consuming task. paper presents methodology designed significantly reduce effort required optimizing for single-pulse 0.3mm thick stainless steel. The objective precisely drill holes with an entry diameter 70μm exit 20 μm, achieving high roundness. features drilled were extracted automatically...

10.58895/ksp/1000174496-1 article EN 2024-09-24

SSPQ - spatial domain perceptual image codec based on subsampling and perceptual quantization

OPENALEX - Publications

Z. Wang S. Simón Michael J. Klaiber S. Ahmed Th. Richter

A spatial domain perceptual image codec based on subsampling and quantization (SSPQ) guided by the just-noticeable distortion (JND) profile is proposed. SSPQ integrates coding progressive transmission in one framework. The input first subsampled a factor of two both dimensions compressed without loss. provides basis for predicting pixels interpolation estimating JND values each pixel. Residual thresholds are set to estimated perceptually tuned compression. Quantized residuals progressively...

10.1109/icip.2012.6467046 article EN 2012-09-01

Across the Stack Opportunities for Deep Learning Acceleration

OPENALEX - Publications

Vijayalakshmi Srinivasan Bruce Fleischer Sunil Shukla Matthew M. Ziegler J. A. Silberman and 26 more

The combination of growth in compute capabilities and availability large datasets has led to a re-birth deep learning. Deep Neural Networks (DNNs) have become state-of-the-art variety machine learning tasks spanning domains across vision, speech, translation. Learning (DL) achieves high accuracy these at the expense 100s ExaOps computation; posing significant challenges efficient large-scale deployment both resource-constrained environments data centers.

10.1145/3218603.3241339 article EN Proceedings of the International Symposium on Low Power Electronics and Design 2018-07-23

Union-Retire for Connected Components Analysis on FPGA

OPENALEX - Publications

Donald G. Bailey Michael J. Klaiber

The Union-Retire CCA (UR-CCA) algorithm started a new paradigm for connected components analysis. Instead of using directed tree structures, UR-CCA focuses on connectivity. This algorithmic change leads to reduction in required memory, with no end-of-row processing overhead. In this paper we describe hardware architecture based and its realisation an FPGA. memory bandwidth pipelining challenges are analysed resolved. It is shown that up 36% resources can be saved the proposed architecture....

10.3390/jimaging8040089 article EN cc-by Journal of Imaging 2022-03-24

Enabling Cross-Domain Communication: How to Bridge the Gap between AI and HW Engineers

OPENALEX - Publications

Michael J. Klaiber Axel Acosta Ingo Feldner Falk Rehm

A key issue in system design is the lack of communication between hardware, software and domain expert. Recent research work shows progress automatic HW/SW co-design flows neural accelerators that seems to make this kind obsolete. Most real-world systems, however, are a composition multiple processing units, networks memories. process (reconfigurable) accelerators, therefore, an important sub-problem towards common methodology. The ultimate challenge define constraints for space exploration...

10.48550/arxiv.2104.03780 preprint EN other-oa arXiv (Cornell University) 2021-01-01

End-to-End Automation Frameworks for Mapping Neural Networks onto Embedded Devices and Early Performance Predictions: A Survey

OPENALEX - Publications

Yannick Braatz Michael J. Klaiber

Recently automated frameworks have been proposed, mapping neural networks from a high-level description onto embedded devices, most of them in an end-to-end manner. This paper aims to give overview their main characteristics and achievements. A special focus is lying on internal predictions during design space exploration (DSE) regarding hardware targets (performance, area or power consumption), enabling fast the individually defined search spaces, especially early stages. Additionally,...

10.1109/ssi52265.2021.9467015 article EN 2021-04-27

Coming Soon ...