- Advanced Neural Network Applications
- Domain Adaptation and Few-Shot Learning
- Advanced Image and Video Retrieval Techniques
- Visual Attention and Saliency Detection
- Anomaly Detection Techniques and Applications
- Video Surveillance and Tracking Methods
- Machine Learning and Data Classification
- Multimodal Machine Learning Applications
- Human Pose and Action Recognition
- Medical Image Segmentation Techniques
- Music and Audio Processing
- Image Retrieval and Classification Techniques
- Generative Adversarial Networks and Image Synthesis
- Image Enhancement Techniques
- Face recognition and analysis
- Speech and Audio Processing
- Face Recognition and Perception
- Speech and dialogue systems
- Water Systems and Optimization
- Image and Video Quality Assessment
- Brain Tumor Detection and Classification
- Target Tracking and Data Fusion in Sensor Networks
- Adversarial Robustness in Machine Learning
- Radiology practices and education
- Olfactory and Sensory Function Studies
Hong Kong University of Science and Technology
2025
University of Hong Kong
2025
Singapore Management University
2023-2024
South China University of Technology
2020-2023
Microsoft Research Asia (China)
2023
National University of Singapore
2022-2023
ShangHai JiAi Genetics & IVF Institute
2021-2022
Recent Vision Transformer (ViT) models have demonstrated encouraging results across various computer vision tasks, thanks to its competence in modeling long-range de-pendencies of image patches or tokens via self-attention. These models, however, usually designate the similar receptive fields each token feature within layer. Such a constraint inevitably limits ability self-attention layer capturing multi-scale features, thereby leading performance degradation handling images with multiple...
Unsupervised video object segmentation (UVOS) aims at segmenting the primary objects in videos without any human intervention. Due to lack of prior knowledge about objects, identifying them from is major challenge UVOS. Previous methods often regard moving as ones and rely on optical flow capture motion cues videos, but information alone insufficient distinguish background that move together. This because, when noisy features are combined with appearance features, localization misguided. To...
In self-supervised representation learning, a common idea behind most of the state-of-the-art approaches is to enforce robustness representations predefined augmentations. A potential issue this existence completely collapsed solutions (i.e., constant features), which are typically avoided implicitly by carefully chosen implementation details. work, we study relatively concise framework containing components from recent approaches. We verify complete collapse and discover another reachable...
Lip reading aims to predict the spoken sentences from silent lip videos. Due fact that such a vision task usually performs worse than its counterpart speech recognition, one potential scheme is distill knowledge teacher pretrained by audio signals. However, latent domain gap between cross-modal data could lead learning ambiguity and thus limits performance of reading. In this paper, we propose novel collaborative framework for reading, two aspects issues are considered: 1) should understand...
Vision Transformer has demonstrated impressive success across various vision tasks. However, its heavy computation cost, which grows quadratically with respect to the token sequence length, largely limits power in handling large feature maps. To alleviate previous works rely on either fine-grained self-attentions restricted local small regions, or global but shorten length resulting coarse granularity. In this paper, we propose a novel model, termed as Self-guided (SG-Former), towards...
The fully convolutional network (FCN) has dominated salient object detection for a long period. However, the locality of CNN requires model deep enough to have global receptive field and such always leads loss local details. In this paper, we introduce new attention-based encoder, vision transformer, into ensure globalization representations from shallow layers. With view in very layers, transformer encoder preserves more recover spatial details final saliency maps. Besides, as each layer...
The inductive bias of vision transformers is more relaxed that cannot work well with insufficient data. Knowledge distillation thus introduced to assist the training transformers. Unlike previous works, where merely heavy convolution-based teachers are provided, in this paper, we delve into influence models biases knowledge (e.g., convolution and involution). Our key observation teacher accuracy not dominant reason for student accuracy, but important. We demonstrate lightweight different...
Masked image modeling (MIM) performs strongly in pretraining large vision Transformers (ViTs). However, small models that are critical for real-world applications can-not or only marginally benefit from this pre-training approach. In paper, we explore distillation techniques to transfer the success of MIM-based pre-trained smaller ones. We systematically study different options framework, including distilling targets, losses, input, network regularization, sequential distillation, etc,...
Data mixing (e.g., Mixup, Cutmix, ResizeMix) is an essential component for advancing recognition models. In this paper, we focus on studying its effectiveness in the self-supervised setting. By noticing mixed images that share same source are intrinsically related to each other, hereby propose SDMP, short Simple Mixing Prior, capture straightforward yet prior, and position such as additional positive pairs facilitate representation learning. Our experiments verify proposed SDMP enables data...
Vision Transformer shows great superiority in medical image segmentation due to the ability learn long-range dependency. For from 3-D data, such as computed tomography (CT), existing methods can be broadly classified into 2-D-based and 3-D-based methods. One key limitation is that intraslice information ignored, while high computation cost memory consumption, resulting a limited feature representation for inner slice information. During clinical examination, radiologists primarily use axial...
Deep supervision, which involves extra supervisions to the intermediate features of a neural network, was widely used in image classification early deep learning era since it significantly reduces training difficulty and eases optimization like avoiding gradient vanish over vanilla training. Nevertheless, with emergence normalization techniques residual connection, supervision gradually phased out. In this paper, we revisit for masked modeling (MIM) that pre-trains Vision Transformer (ViT)...
Crowd image is arguably one of the most laborious data to annotate. In this paper, we aim reduce massive demand for densely labeled crowd data, and propose a novel weakly-supervised setting, in which leverage binary ranking two images with high-contrast counts as training guidance. To enable under new convert count regression problem potential prediction problem. particular, tailor Siamese Ranking Network that predicts scores indicating ordering counts. Hence, ultimate goal assign...
Due to domain shift, a large performance drop is usually observed when trained crowd counting model deployed in the wild. While existing domain-adaptive methods achieve promising results, they typically regard each image as whole and reduce discrepancies holistic manner, thus limiting further improvement of adaptation performance. To this end, we propose untangle domain-invariant domain-specific background from images design fine-grained adaption method for counting. Specifically,...
The popularity of multimodal sensors and the accessibility Internet have brought us a massive amount unlabeled data. Since existing datasets well-trained models are primarily unimodal, modality gap between unimodal network multi-modal data poses an interesting problem: how to transfer pre-trained perform same task with extra data? In this work, we propose knowledge expansion (MKE), distillation-based framework effectively utilize without requiring labels. Opposite traditional distillation,...
This paper tackles the task of Few-Shot Video Object Segmentation (FSVOS), i.e., segmenting objects in query videos with certain class specified a few labeled support images. The key is to model relationship between and images for propagating object information. many-to-many problem often relies on full-rank attention, which computationally intensive. In this paper, we propose novel Domain Agent Network (DAN), breaking down attention into two smaller ones. We consider one single frame video...
The fully convolutional network (FCN) has dominated salient object detection for a long period. However, the locality of CNN requires model deep enough to have global receptive field and such always leads loss local details. In this paper, we introduce new attention-based encoder, vision transformer, into ensure globalization representations from shallow layers. With view in very layers, transformer encoder preserves more recover spatial details final saliency maps. Besides, as each layer...
Labeling is onerous for crowd counting as it should annotate each individual in images. Recently, several methods have been proposed semi-supervised to reduce the labeling efforts. Given a limited budget, they typically select few images and densely label all individuals of them. Despite promising results, we argue None-or-All strategy suboptimal labeled image usually appear similar while massive unlabeled may contain entirely diverse individuals. To this end, propose break chain previous...
Crossmodal knowledge distillation (KD) extends traditional to the area of multimodal learning and demonstrates great success in various applications. To achieve transfer across modalities, a pretrained network from one modality is adopted as teacher provide supervision signals student another modality. In contrast empirical reported prior works, working mechanism crossmodal KD remains mystery. this paper, we present thorough understanding KD. We begin with two case studies demonstrate that...
Inspired by the success of self-supervised autoregressive representation learning in natural language (GPT and its variants), advances recent visual architecture design with Vision Transformers (ViTs), this paper, we explore effect various choices have on applying such training strategies for feature learning. Specifically, introduce a novel strategy that call Random Segments Autoregressive Coding (RandSAC). In RandSAC, group patch representations (image tokens) into hierarchically arranged...
Integrating low-level edge features has been proven to be effective in preserving clear boundaries of salient objects. However, the locality makes it difficult capture globally edges, leading distraction final predictions. To address this problem, we propose produce distraction-free by incorporating cross-scale holistic interdependencies between high-level features. In particular, first formulate our extraction process as a boundary-filling problem. way, enforce focus on closed instead those...