Sucheng Ren

ORCID: 0000-0003-4730-8435
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Advanced Neural Network Applications
  • Domain Adaptation and Few-Shot Learning
  • Advanced Image and Video Retrieval Techniques
  • Visual Attention and Saliency Detection
  • Anomaly Detection Techniques and Applications
  • Video Surveillance and Tracking Methods
  • Machine Learning and Data Classification
  • Multimodal Machine Learning Applications
  • Human Pose and Action Recognition
  • Medical Image Segmentation Techniques
  • Music and Audio Processing
  • Image Retrieval and Classification Techniques
  • Generative Adversarial Networks and Image Synthesis
  • Image Enhancement Techniques
  • Face recognition and analysis
  • Speech and Audio Processing
  • Face Recognition and Perception
  • Speech and dialogue systems
  • Water Systems and Optimization
  • Image and Video Quality Assessment
  • Brain Tumor Detection and Classification
  • Target Tracking and Data Fusion in Sensor Networks
  • Adversarial Robustness in Machine Learning
  • Radiology practices and education
  • Olfactory and Sensory Function Studies

Hong Kong University of Science and Technology
2025

University of Hong Kong
2025

Singapore Management University
2023-2024

South China University of Technology
2020-2023

Microsoft Research Asia (China)
2023

National University of Singapore
2022-2023

ShangHai JiAi Genetics & IVF Institute
2021-2022

Recent Vision Transformer (ViT) models have demonstrated encouraging results across various computer vision tasks, thanks to its competence in modeling long-range de-pendencies of image patches or tokens via self-attention. These models, however, usually designate the similar receptive fields each token feature within layer. Such a constraint inevitably limits ability self-attention layer capturing multi-scale features, thereby leading performance degradation handling images with multiple...

10.1109/cvpr52688.2022.01058 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

Unsupervised video object segmentation (UVOS) aims at segmenting the primary objects in videos without any human intervention. Due to lack of prior knowledge about objects, identifying them from is major challenge UVOS. Previous methods often regard moving as ones and rely on optical flow capture motion cues videos, but information alone insufficient distinguish background that move together. This because, when noisy features are combined with appearance features, localization misguided. To...

10.1109/cvpr46437.2021.01520 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

In self-supervised representation learning, a common idea behind most of the state-of-the-art approaches is to enforce robustness representations predefined augmentations. A potential issue this existence completely collapsed solutions (i.e., constant features), which are typically avoided implicitly by carefully chosen implementation details. work, we study relatively concise framework containing components from recent approaches. We verify complete collapse and discover another reachable...

10.1109/iccv48922.2021.00946 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

Lip reading aims to predict the spoken sentences from silent lip videos. Due fact that such a vision task usually performs worse than its counterpart speech recognition, one potential scheme is distill knowledge teacher pretrained by audio signals. However, latent domain gap between cross-modal data could lead learning ambiguity and thus limits performance of reading. In this paper, we propose novel collaborative framework for reading, two aspects issues are considered: 1) should understand...

10.1109/cvpr46437.2021.01312 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

Vision Transformer has demonstrated impressive success across various vision tasks. However, its heavy computation cost, which grows quadratically with respect to the token sequence length, largely limits power in handling large feature maps. To alleviate previous works rely on either fine-grained self-attentions restricted local small regions, or global but shorten length resulting coarse granularity. In this paper, we propose a novel model, termed as Self-guided (SG-Former), towards...

10.1109/iccv51070.2023.00552 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

The fully convolutional network (FCN) has dominated salient object detection for a long period. However, the locality of CNN requires model deep enough to have global receptive field and such always leads loss local details. In this paper, we introduce new attention-based encoder, vision transformer, into ensure globalization representations from shallow layers. With view in very layers, transformer encoder preserves more recover spatial details final saliency maps. Besides, as each layer...

10.1109/tetci.2024.3380442 article EN IEEE Transactions on Emerging Topics in Computational Intelligence 2024-04-02

The inductive bias of vision transformers is more relaxed that cannot work well with insufficient data. Knowledge distillation thus introduced to assist the training transformers. Unlike previous works, where merely heavy convolution-based teachers are provided, in this paper, we delve into influence models biases knowledge (e.g., convolution and involution). Our key observation teacher accuracy not dominant reason for student accuracy, but important. We demonstrate lightweight different...

10.1109/cvpr52688.2022.01627 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

Masked image modeling (MIM) performs strongly in pretraining large vision Transformers (ViTs). However, small models that are critical for real-world applications can-not or only marginally benefit from this pre-training approach. In paper, we explore distillation techniques to transfer the success of MIM-based pre-trained smaller ones. We systematically study different options framework, including distilling targets, losses, input, network regularization, sequential distillation, etc,...

10.1109/cvpr52729.2023.00359 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

Data mixing (e.g., Mixup, Cutmix, ResizeMix) is an essential component for advancing recognition models. In this paper, we focus on studying its effectiveness in the self-supervised setting. By noticing mixed images that share same source are intrinsically related to each other, hereby propose SDMP, short Simple Mixing Prior, capture straightforward yet prior, and position such as additional positive pairs facilitate representation learning. Our experiments verify proposed SDMP enables data...

10.1109/cvpr52688.2022.01419 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

Vision Transformer shows great superiority in medical image segmentation due to the ability learn long-range dependency. For from 3-D data, such as computed tomography (CT), existing methods can be broadly classified into 2-D-based and 3-D-based methods. One key limitation is that intraslice information ignored, while high computation cost memory consumption, resulting a limited feature representation for inner slice information. During clinical examination, radiologists primarily use axial...

10.1109/tnnls.2024.3519634 article EN IEEE Transactions on Neural Networks and Learning Systems 2025-01-01

10.1109/wacv61041.2025.00095 article DA 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025-02-26

Deep supervision, which involves extra supervisions to the intermediate features of a neural network, was widely used in image classification early deep learning era since it significantly reduces training difficulty and eases optimization like avoiding gradient vanish over vanilla training. Nevertheless, with emergence normalization techniques residual connection, supervision gradually phased out. In this paper, we revisit for masked modeling (MIM) that pre-trains Vision Transformer (ViT)...

10.48550/arxiv.2303.08817 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Crowd image is arguably one of the most laborious data to annotate. In this paper, we aim reduce massive demand for densely labeled crowd data, and propose a novel weakly-supervised setting, in which leverage binary ranking two images with high-contrast counts as training guidance. To enable under new convert count regression problem potential prediction problem. particular, tailor Siamese Ranking Network that predicts scores indicating ordering counts. Hence, ultimate goal assign...

10.1109/wacv57701.2024.00041 article EN 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2024-01-03

Due to domain shift, a large performance drop is usually observed when trained crowd counting model deployed in the wild. While existing domain-adaptive methods achieve promising results, they typically regard each image as whole and reduce discrepancies holistic manner, thus limiting further improvement of adaptation performance. To this end, we propose untangle domain-invariant domain-specific background from images design fine-grained adaption method for counting. Specifically,...

10.1109/icme55011.2023.00403 article EN 2022 IEEE International Conference on Multimedia and Expo (ICME) 2023-07-01

The popularity of multimodal sensors and the accessibility Internet have brought us a massive amount unlabeled data. Since existing datasets well-trained models are primarily unimodal, modality gap between unimodal network multi-modal data poses an interesting problem: how to transfer pre-trained perform same task with extra data? In this work, we propose knowledge expansion (MKE), distillation-based framework effectively utilize without requiring labels. Opposite traditional distillation,...

10.1109/iccv48922.2021.00089 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

This paper tackles the task of Few-Shot Video Object Segmentation (FSVOS), i.e., segmenting objects in query videos with certain class specified a few labeled support images. The key is to model relationship between and images for propagating object information. many-to-many problem often relies on full-rank attention, which computationally intensive. In this paper, we propose novel Domain Agent Network (DAN), breaking down attention into two smaller ones. We consider one single frame video...

10.1109/cvpr46437.2021.01382 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

The fully convolutional network (FCN) has dominated salient object detection for a long period. However, the locality of CNN requires model deep enough to have global receptive field and such always leads loss local details. In this paper, we introduce new attention-based encoder, vision transformer, into ensure globalization representations from shallow layers. With view in very layers, transformer encoder preserves more recover spatial details final saliency maps. Besides, as each layer...

10.48550/arxiv.2108.02759 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Labeling is onerous for crowd counting as it should annotate each individual in images. Recently, several methods have been proposed semi-supervised to reduce the labeling efforts. Given a limited budget, they typically select few images and densely label all individuals of them. Despite promising results, we argue None-or-All strategy suboptimal labeled image usually appear similar while massive unlabeled may contain entirely diverse individuals. To this end, propose break chain previous...

10.1109/tpami.2022.3232712 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2022-12-28

Crossmodal knowledge distillation (KD) extends traditional to the area of multimodal learning and demonstrates great success in various applications. To achieve transfer across modalities, a pretrained network from one modality is adopted as teacher provide supervision signals student another modality. In contrast empirical reported prior works, working mechanism crossmodal KD remains mystery. this paper, we present thorough understanding KD. We begin with two case studies demonstrate that...

10.48550/arxiv.2206.06487 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Inspired by the success of self-supervised autoregressive representation learning in natural language (GPT and its variants), advances recent visual architecture design with Vision Transformers (ViTs), this paper, we explore effect various choices have on applying such training strategies for feature learning. Specifically, introduce a novel strategy that call Random Segments Autoregressive Coding (RandSAC). In RandSAC, group patch representations (image tokens) into hierarchically arranged...

10.48550/arxiv.2203.12054 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Integrating low-level edge features has been proven to be effective in preserving clear boundaries of salient objects. However, the locality makes it difficult capture globally edges, leading distraction final predictions. To address this problem, we propose produce distraction-free by incorporating cross-scale holistic interdependencies between high-level features. In particular, first formulate our extraction process as a boundary-filling problem. way, enforce focus on closed instead those...

10.1109/mmul.2023.3235936 article EN IEEE Multimedia 2023-01-10
Coming Soon ...