- Parallel Computing and Optimization Techniques
- Embedded Systems Design Techniques
- Interconnection Networks and Systems
- Numerical Methods and Algorithms
- Advanced Neural Network Applications
- Low-power high-performance VLSI design
- Digital Holography and Microscopy
- Cell Image Analysis Techniques
- Digital Filter Design and Implementation
- Image Processing Techniques and Applications
- VLSI and FPGA Design Techniques
- Advanced Memory and Neural Computing
- Advanced Fluorescence Microscopy Techniques
- CCD and CMOS Imaging Sensors
- Cryptography and Residue Arithmetic
- Neural Networks and Applications
- Optical measurement and interference techniques
- Analog and Mixed-Signal Circuit Design
- Advanced Vision and Imaging
- Ultrasound Imaging and Elastography
- Adversarial Robustness in Machine Learning
- Graph Theory and Algorithms
- Distributed and Parallel Computing Systems
- Machine Learning and ELM
- Advancements in PLL and VCO Technologies
University of Hong Kong
2016-2025
University of California, Berkeley
1979-2024
Hong Kong Baptist University
2024
Bridge University
2024
Hong Kong Science and Technology Parks Corporation
2022
City University of Hong Kong
2017-2020
Chinese University of Hong Kong
2012-2020
Princeton University
2017
Stanford University
2017
Xidian University
2017
We consider the problem of high-dimensional light field reconstruction and develop a learning-based framework for spatial angular super-resolution. Many current approaches either require disparity clues or restore details separately. Such methods have difficulties with non-Lambertian surfaces occlusions. In contrast, we formulate super-resolution (LFSR) as tensor restoration learning based on two-stage 4-dimensional (4D) convolution. This allows our model to learn features capturing geometry...
New single-cell technologies continue to fuel the explosive growth in scale of heterogeneous data. However, existing computational methods are inadequately scalable large datasets and therefore cannot uncover complex cellular heterogeneity.
Recent advances in ultra-high-throughput microscopy have enabled a new generation of cell classification methodologies using image-based phenotypes alone. In contrast to current single-cell analysis techniques that rely solely on slow and costly genetic/epigenetic analysis, these analyses allow morphological profiling screening thousands or even millions single cells at fraction the cost, been proven demonstrate statistical significance required for understanding role heterogeneity diverse...
Deep Neural Networks (DNNs) have achieved extraordinary performance in various application domains. To support diverse DNN models, efficient implementations of inference on edge-computing platforms, e.g., ASICs, FPGAs, and embedded systems, are extensively investigated. Due to the huge model size computation amount, compression is a critical step deploy models edge devices. This paper focuses weight quantization, hardware-friendly approach that complementary pruning.Unlike existing methods...
This paper proposes open-source hardware Posit Arithmetic Core Generator (PACoGen) for the recently developed universal number posit system, along with a set of pipelined architectures. The system composed run-time varying exponent component, which is defined by composition length "regime-bit" and "exponent-bit" (with maximum size ES bits, size). in effect also makes fraction part to vary at position. These variations inherit an interesting design challenge arithmetic being infant stage its...
ABSTRACT Cellular biophysical properties are the effective label‐free phenotypes indicative of differences in cell types, states, and functions. However, current phenotyping methods largely lack throughput specificity required majority cell‐based assays that involve large‐scale single‐cell characterization for inquiring inherently complex heterogeneity many biological systems. Further confounded by reported robust reproducibility quality control, widespread adoption mainstream cytometry...
This paper explores the design and implementation of BORPH, an operating system designed for FPGA-based reconfigurable computers. Hardware designs execute as normal UNIX processes under having access to standard OS services, such file support. software components user may, therefore, run communicating within BORPH's runtime environment. The familiar language independent kernel interface facilitates easy reuse rapid application development. To develop hardware designs, a Simulink-based flow...
Medical ultrasound imaging stands out from other modalities in providing real-time diagnostic capability at an affordable price while being physically portable. This article explores the suitability of using GPUs as primary signal and image processors for future medical systems. A case study on synthetic aperture (SA) illustrates promise high-performance such
Digital holographic imaging is a powerful technique that can provide wavefront information of three-dimensional object for biological and industrial applications. However, due to the constraint cost sensors, acquired digital hologram limited in terms pixel count, thus affecting resolution reconstruction. To overcome this constraint, paper we propose deep learning-based method super-resolve holograms improve quality low-resolution by training convolutional neural network with large-scale data...
Time-stretch imaging has been regarded as an attractive technique for high-throughput flow cytometry primarily owing to its real-time, continuous ultrafast operation. Nevertheless, two key challenges remain: (1) sufficiently high time-stretch image resolution and contrast is needed visualizing sub-cellular complexity of single cells, (2) the ability unravel heterogeneity highly diverse population cells - a central problem single-cell analysis in life sciences required. We here demonstrate...
Posit number system format includes a run-time varying exponent component, defined by combination of regime-bit (with length) and exponent-bit size up to ES bits, the size). This also leads variation in its mantissa field position. posit poses hardware design challenge. Being recent development, lacks for adequate arithmetic architectures. Thus, this paper is aimed towards algorithmic development their generic generator. It focused on basic (floating-point conversion, floating point...
The association of the intrinsic optical and biophysical properties cells to homeostasis pathogenesis has long been acknowledged. Defining these label-free cellular features obviates need for costly time-consuming labelling protocols that perturb living cells. However, wide-ranging applicability such cell-based assays requires sufficient throughput, statistical power sensitivity are unattainable with current technologies. To close this gap, we present a large-scale, integrative imaging flow...
A capsule network, as an advanced technique in deep learning, is designed to overcome information loss the pooling operation and internal data representation of a convolutional neural network (CNN). It has shown promising results several applications, such digit recognition image segmentation. In this work, we investigate for first time use digital holographic reconstruction. The proposed residual encoder-decoder which call RedCap, uses novel windowed spatial dynamic routing algorithm block,...
Low bitwidth integer arithmetic has been widely adopted in hardware implementations of deep neural network inference applications. However, despite the promised energy-efficiency improvements demanding edge applications, use low for training remains limited. Unlike inference, demands high dynamic range and numerical accuracy quality results, making low-bitwidth particularly challenging. To address this challenge, we present a novel framework called NITI that exclusively utilizes arithmetic....
Laser speckle imaging (LSI) is a powerful tool for motion analysis owing to the high sensitivity of laser speckles. Traditional LSI techniques rely on identifying changes from sequential intensity patterns, where each pixel performs synchronous measurements. However, lot redundant data static speckles without information in scene will also be recorded, resulting considerable resources consumption processing and storage. Moreover, cues are inevitably lost during "blind" time interval between...
Visual sensors, including 3D light detection and ranging, neuromorphic dynamic vision sensor, conventional frame cameras, are increasingly integrated into edge-side intelligent machines. However, their data heterogeneous, causing complexity in system development. Moreover, digital hardware is constrained by von Neumann bottleneck the physical limit of transistor scaling. The computational demands training ever-growing models further exacerbate these challenges. We propose a hardware-software...
Modern transformer-based deep neural networks present unique technical challenges for effective acceleration in real-world applications. Apart from the vast amount of linear operations needed due to their sizes, modern transformer models are increasingly reliance on precise non-linear computations that make traditional low-bitwidth quantization methods and fixed-dataflow matrix accelerators ineffective end-to-end acceleration. To address this need accelerate both a unified programmable...
This paper presents a hw/sw codesign methodology based on BORPH, an operating system designed for FPGA-based reconfigurable computers (RC's). By providing native kernel support FPGA hardware, BORPH offers homogeneous UNIX interface both software and hardware processes. Hardware processes inherit the same level of service from kernel, such as file support, typical components design therefore run within BORPH's run-time environment. The familiar language independent facilitates easy reuse...
The success of future intelligent power deliver and transmission systems across the globe relies critically on availability a fast, calable, most importantly secure communication infrastructure between energy producers consumers. One major obstacle to ensure among various parties in smart grid network hinges technical implementation difficulties associated with key distribution such large-scale often-time disinterested This paper proposes use an identity-based signcryption (IBS) system...
Ternary content-addressable memories (TCAMs) are high speed memories; however, compared to static random-access (SRAMs), TCAMs suffer from low storage density, relatively slow access time, poor scalability, complexity in circuitry, and higher cost. To the benefits of SRAM, several SRAM-based TCAMs, specifically on field-programmable gate array (FPGA) platforms, were proposed. further improve performance this paper presents UE-TCAM, which reduces memory requirement, latency, power...
The design and implementation of the k-means clustering algorithm on an FPGA-accelerated computer cluster is presented. followed Map-Reduce programming model, with both map reduce functions executing autonomously to CPU multiple FPGAs. A hardware/software framework was developed manage gateware execution FPGAs across cluster. Using this as example, system-level tradeoff study between computation I/O performance in target multi-FPGA environment performed. When compared a similar software over...
This paper is aimed towards the hardware architecture aspect of a recently proposed posit number system under type-3 unum (universal system). Here, an algorithmic flow for addition/subtraction arithmetic developed and its designed. Compare to floating point, provides better dynamic range accuracy over same word size, along with more accurate exact support. Posit format includes run-time varying exponent component, provided by combination regime-bits (of length) exponent-bits size up ES...
The use of FPGAs as compute accelerators has been demonstrated by numerous researchers an effective solution to meet the performance requirement across many application domains. However, design productivity developing FPGA remains much lower compared a typical software development flow. Although high-level tools may partly alleviate this shortcoming, lengthy low-level implementation process including synthesis, placing and routing still dramatically limits number compile-debug-edit cycles...