- Music and Audio Processing
- Advanced Neural Network Applications
- Speech and Audio Processing
- Speech Recognition and Synthesis
- Image and Video Quality Assessment
- Advanced Image Processing Techniques
- Domain Adaptation and Few-Shot Learning
- Advanced Image and Video Retrieval Techniques
- Image Retrieval and Classification Techniques
- Advanced Image Fusion Techniques
- Machine Learning and Data Classification
- Music Technology and Sound Studies
- Privacy-Preserving Technologies in Data
- Machine Learning and ELM
- Visual Attention and Saliency Detection
- Stochastic Gradient Optimization Techniques
- Image and Signal Denoising Methods
- Advanced Vision and Imaging
- Video Surveillance and Tracking Methods
- Human Pose and Action Recognition
- Generative Adversarial Networks and Image Synthesis
- Robotics and Sensor-Based Localization
- Multimodal Machine Learning Applications
- Brain Tumor Detection and Classification
- Image Enhancement Techniques
City University of Hong Kong
2019-2024
TCL (China)
2021-2024
Trinity College London
2022
TFI Digital Media Limited (China)
2018-2019
Deep convolutional neural networks (CNNs) have been successfully applied on no-reference image quality assessment (NR-IQA) with respect to human perception. Most of these methods deal small patches and use the average score test for predicting whole quality. We discovered that from homogenous regions are unreliable both network training final estimation. In addition, complex structures much higher chances achieving better prediction. Based findings, we enhanced conventional CNN-based NR-IQA...
Visual Attention Networks (VAN) with Large Kernel (LKA) modules have been shown to provide remarkable performance, that surpasses Vision Transformers (ViTs), on a range of vision-based tasks. However, the depth-wise convolutional layer in these LKA incurs quadratic increase computational and memory footprints increasing kernel size. To mitigate problems enable use extremely large kernels attention VAN, we propose family Separable modules, termed LSKA. LSKA decomposes 2D into cascaded...
Deep learning based image hashing methods learn hash codes by using powerful feature extractors and nonlinear transformations to achieve highly efficient retrieval. For most end-to-end deep methods, the supervised process relies on pair-wise or triplet-wise information provide an internal relationship of similarity data. However, use triplet loss function is limited not only expensive training costs but also quantization errors. In this paper, we propose a novel semantic method for retrieval...
Self-supervised learning (SSL) aims to learn feature representation without human-annotated data. Existing methods approach this goal by encouraging the representations be invariant under a set of task-irrelevant transformations and distortions defined priori. However, multiple studies have shown that such an assumption often limits expressive power model would perform poorly when downstream tasks violate assumption. For example, being rotations prevent features from retaining enough...
Deep Learning based image quality assessment (IQA) has been shown to greatly improve the score prediction accuracy of images with single distortion. However, because these models lack generalizability and multidistortion-based data is relatively low, designing reliable IQA systems still an open issue. In this paper, we propose introduce long-range dependencies between local artifacts high-order spatial pooling into a convolutional neural network (CNN) model performance full-reference...
Due to unreliable geometric matching and content misalignment, most conventional pose transfer algorithms fail generate fine-trained person images. In this paper, we propose a novel framework – Spatial Content Alignment GAN (SCA-GAN) which aims enhance the consistency of garment textures details human characteristics. We first alleviate spatial misalignment by transferring edge target in advance. Secondly, introduce new Content-Style DeBlk can progressively synthesize photo-realistic images...
This report presents the technical details of our submission to 2023 Epic-Kitchen EPIC-SOUNDS Audio-Based Interaction Recognition Challenge. The task is learn mapping from audio samples their corresponding action labels. To achieve this goal, we propose a simple yet effective single-stream CNN-based architecture called AudioInceptionNeXt that operates on time-frequency log-mel-spectrogram samples. Motivated by design InceptionNeXt, parallel multi-scale depthwise separable convolutional...
The integration of Federated Learning (FL) and Self-supervised (SSL) offers a unique synergetic combination to exploit the audio data for general-purpose understanding, without compromising user privacy. However, rare efforts have been made investigate SSL models in FL regime especially when training is generated by large-scale heterogeneous sources. In this paper, we evaluate performance feature-matching predictive audio-SSL techniques integrated into settings simulated with...
Recent research has successfully adapted vision-based convolutional neural network (CNN) architectures for audio recognition tasks using Mel-Spectrograms. However, these CNNs have high computational costs and memory requirements, limiting their deployment on low-end edge devices. Motivated by the success of efficient vision models like InceptionNeXt ConvNeXt, we propose AudioRepInceptionNeXt, a single-stream architecture. Its basic building block breaks down parallel multi-branch depth-wise...
Federated Learning (FL) has emerged as a privacy-preserving method for training machine learning models in distributed manner on edge devices. However, on-device face inherent computational power and memory limitations, potentially resulting constrained gradient updates. As the model's size increases, frequency of updates devices decreases, ultimately leading to suboptimal outcomes during any particular FL round. This limits feasibility deploying advanced large-scale devices, hindering...
This paper is the report of first Under-Display Camera (UDC) image restoration challenge in conjunction with RLQ workshop at ECCV 2020. The based on a newly-collected database Camera. tracks correspond to two types display: 4k Transparent OLED (T-OLED) and phone Pentile (P-OLED). Along about 150 teams registered challenge, eight nine submitted results during testing phase for each track. are state-of-the-art performance Restoration. Datasets available https://yzhouas.github.io/projects/UDC/udc.html.
Visual Attention Networks (VAN) with Large Kernel (LKA) modules have been shown to provide remarkable performance, that surpasses Vision Transformers (ViTs), on a range of vision-based tasks. However, the depth-wise convolutional layer in these LKA incurs quadratic increase computational and memory footprints increasing kernel size. To mitigate problems enable use extremely large kernels attention VAN, we propose family Separable modules, termed LSKA. LSKA decomposes 2D into cascaded...
Recent research has successfully adapted vision-based convolutional neural network (CNN) architectures for audio recognition tasks using Mel-Spectrograms. However, these CNNs have high computational costs and memory requirements, limiting their deployment on low-end edge devices. Motivated by the success of efficient vision models like InceptionNeXt ConvNeXt, we propose AudioRepInceptionNeXt, a single-stream architecture. Its basic building block breaks down parallel multi-branch depth-wise...
Uncertainty estimation aims to evaluate the confidence of a trained deep neural network. However, existing uncertainty approaches rely on low-dimensional distributional assumptions and thus suffer from high dimensionality latent features. Existing tend focus discrete classification probabilities, which leads poor generalizability for other tasks. Moreover, most literature requires seeing out-of-distribution (OOD) data in training better uncertainty, limits performance practice because OOD...