- Advanced Image and Video Retrieval Techniques
- Advanced Vision and Imaging
- Generative Adversarial Networks and Image Synthesis
- Image Enhancement Techniques
- Advanced Image Processing Techniques
- Video Surveillance and Tracking Methods
- Visual Attention and Saliency Detection
- Advanced Neural Network Applications
- Human Pose and Action Recognition
- Multimodal Machine Learning Applications
- Domain Adaptation and Few-Shot Learning
- Image and Signal Denoising Methods
- Face recognition and analysis
- Computer Graphics and Visualization Techniques
- Anomaly Detection Techniques and Applications
- Video Analysis and Summarization
- Advanced Image Fusion Techniques
- Medical Image Segmentation Techniques
- Image Processing Techniques and Applications
- Face and Expression Recognition
- Digital Media Forensic Detection
- Robotics and Sensor-Based Localization
- Face Recognition and Perception
- Natural Language Processing Techniques
- Biometric Identification and Security
Singapore Management University
2022-2025
South China University of Technology
2016-2024
Hong Kong University of Science and Technology
2021
University of Hong Kong
2021
City University of Hong Kong
2012-2016
Macau University of Science and Technology
2011
This paper presents a novel locality sensitive histogram algorithm for visual tracking. Unlike the conventional image that counts frequency of occurrences each intensity value by adding ones to corresponding bin, is computed at pixel location and floating-point added bin occurrence an value. The declines exponentially with respect distance where computed, thus every considered but those are far away can be neglected due very small weights assigned. An efficient proposed enables histograms in...
In this paper, we present a real-time salient object detection system based on the minimum spanning tree. Due to fact that background regions are typically connected image boundaries, objects can be extracted by computing distances boundaries. However, measuring boundary connectivity efficiently is challenging problem. Existing methods either rely superpixel representation reduce processing units or approximate distance transform. Instead, propose an exact and iteration free solution The...
Vision-based vehicle detection approaches achieve incredible success in recent years with the development of deep convolutional neural network (CNN). However, existing CNN-based algorithms suffer from problem that features are scale-sensitive object task but it is common traffic images and videos contain vehicles a large variance scales. In this paper, we delve into source scale sensitivity, reveal two key issues: 1) RoI pooling destroys structure small objects 2) intra-class distance for...
Shadow removal is a challenging task as it requires the detection/annotation of shadows well semantic understanding scene. In this paper, we propose an automatic and end-to-end deep neural network (DeshadowNet) to tackle these problems in unified manner. DeshadowNet designed with multi-context architecture, where output shadow matte predicted by embedding information from three different perspectives. The first global extracts features view. Two levels are derived transferred two parallel...
We address the problem of video representation learning without human-annotated labels. While previous efforts by designing novel self-supervised tasks using data, learned features are merely on a frame-by-frame basis, which not applicable to many analytic where spatio-temporal prevailing. In this paper we propose approach learn for representation. Inspired success two-stream approaches in classification, visual regressing both motion and appearance statistics along spatial temporal...
Numerous efforts have been made to design different low level saliency cues for the RGBD detection, such as color or depth contrast features, background and compactness priors. However, how these interact with each other incorporate effectively generate a master map remain challenging problem. In this paper, we new convolutional neural network (CNN) fuse into hierarchical features automatically detecting salient objects in images. existing works that directly feed raw image pixels CNN,...
Recent Vision Transformer (ViT) models have demonstrated encouraging results across various computer vision tasks, thanks to its competence in modeling long-range de-pendencies of image patches or tokens via self-attention. These models, however, usually designate the similar receptive fields each token feature within layer. Such a constraint inevitably limits ability self-attention layer capturing multi-scale features, thereby leading performance degradation handling images with multiple...
Considering the ill-posed nature, contrastive regularization has been developed for single image dehazing, introducing information from negative images as a lower bound. However, samples are non-consensual, negatives usually represented distantly clear (i.e., positive) image, leaving solution space still under-constricted. Moreover, interpretability of deep dehazing models is underexplored towards physics hazing process. In this paper, we propose novel curricular targeted at consensual...
Due to the lack of paired data, training image reflection removal relies heavily on synthesizing images. However, existing methods model as a linear combination model, which cannot fully simulate real-world scenarios. In this paper, we inject non-linearity into from two aspects. First, instead with fixed factor or kernel, propose synthesize images by predicting non-linear alpha blending mask. This enables free different blurry kernels, leading controllable and diverse synthesis. Second,...
We propose a two-stage method for face hallucination. First, we generate facial components of the input image using CNNs. These represent basic structures. Second, synthesize fine-grained structures from high resolution training images. The details these are transferred into enhancement. Therefore, to approximate ground truth global appearance in first stage and enhance them through recovering second stage. experiments demonstrate that our performs favorably against state-of-the-art methods.
Glass is very common in our daily life. Existing computer vision systems neglect it and thus may have severe consequences, e.g., a robot crash into glass wall. However, sensing the presence of not straightforward. The key challenge that arbitrary objects/scenes can appear behind glass, content within region typically similar to those it. In this paper, we propose an important problem detecting from single RGB image. To address problem, construct large-scale detection dataset (GDD) design...
Unsupervised video object segmentation (UVOS) aims at segmenting the primary objects in videos without any human intervention. Due to lack of prior knowledge about objects, identifying them from is major challenge UVOS. Previous methods often regard moving as ones and rely on optical flow capture motion cues videos, but information alone insufficient distinguish background that move together. This because, when noisy features are combined with appearance features, localization misguided. To...
HD map reconstruction is crucial for autonomous driving. LiDAR-based methods are limited due to the deployed expensive sensors and time-consuming computation. Camera-based usually need separately perform road segmentation view transformation, which often causes distortion absence of content. To push limits technology, we present a novel framework that enables reconstructing local formed by layout vehicle occupancy in bird's-eye given front-view monocular image only. In particular, propose...
We present a novel high-resolution face swapping method using the inherent prior knowledge of pre-trained GAN model. Although previous research can leverage generative priors to produce results, their quality suffer from entangled semantics latent space. explicitly disentangle by utilizing progressive nature generator, deriving structure at-tributes shallow layers and appearance attributes deeper ones. Identity pose information within are further separated introducing landmark-driven...
Lip reading aims to predict the spoken sentences from silent lip videos. Due fact that such a vision task usually performs worse than its counterpart speech recognition, one potential scheme is distill knowledge teacher pretrained by audio signals. However, latent domain gap between cross-modal data could lead learning ambiguity and thus limits performance of reading. In this paper, we propose novel collaborative framework for reading, two aspects issues are considered: 1) should understand...
The presence of non-homogeneous haze can cause scene blurring, color distortion, low contrast, and other degradations that obscure texture details. Existing homogeneous dehazing methods struggle to handle the non-uniform distribution in a robust manner. crucial challenge is effectively extract features reconstruct details hazy areas with high quality. In this paper, we propose novel self-paced semi-curricular attention network, called SCANet, for image focuses on enhancing haze-occluded...
Fine-grained visual classification (FGVC) involves categorizing fine subdivisions within a broader category, which poses challenges due to subtle inter-class discrepancies and large intra-class variations. However, prevailing approaches primarily focus on uni-modal concepts. Recent advancements in pre-trained vision-language models have demonstrated remarkable performance various high-level vision tasks, yet the applicability of such FGVC tasks remains uncertain. In this paper, we aim fully...
The fully convolutional network (FCN) has dominated salient object detection for a long period. However, the locality of CNN requires model deep enough to have global receptive field and such always leads loss local details. In this paper, we introduce new attention-based encoder, vision transformer, into ensure globalization representations from shallow layers. With view in very layers, transformer encoder preserves more recover spatial details final saliency maps. Besides, as each layer...
RGB-Thermal Salient Object Detection (RGB-T SOD) aims to pinpoint prominent objects within aligned pairs of visible and thermal infrared images. A key challenge lies in bridging the inherent disparities between RGB Thermal modalities for effective saliency map prediction. Traditional encoder-decoder architectures, while designed cross-modality feature interactions, may not have adequately considered robustness against noise originating from defective modalities, thereby leading suboptimal...
In this paper, we propose a deep CNN to tackle the image restoration problem by learning structured residual. Previous based methods directly learn mapping from corrupted images clean images, and may suffer gradient exploding/vanishing problems of neural networks. We address details recovering latent together, shared information between image. addition, instead pure difference (corruption), add "residual formatting layer" format residual information, which allows network converge faster...
Subitizing (i.e., instant judgement on the number) and detection of salient objects are human inborn abilities. These two tasks influence each other in visual system. In this paper, we delve into complementarity these tasks. We propose a multi-task deep neural network with weight prediction for object detection, where parameters an adaptive layer dynamically determined by auxiliary subitizing network. The numerical representation is therefore embedded spatial representation. proposed joint...
Crowd counting is challenging due to unconstrained imaging factors, e.g., background clutters, non-uniform distribution of people, large scale and perspective variations. Dealing with these problems using deep neural networks requires rich prior knowledge multi-scale contextual representations. In this paper, we propose a Cross-stage Refinement Network (CRNet) that can refine predicted density maps progressively based on hierarchical multi-level priors. particular, CRNet composed several...