- Domain Adaptation and Few-Shot Learning
- Multimodal Machine Learning Applications
- Generative Adversarial Networks and Image Synthesis
- Human Pose and Action Recognition
- Text and Document Classification Technologies
- Face recognition and analysis
- Advanced Image and Video Retrieval Techniques
- Video Analysis and Summarization
- Advanced Image Processing Techniques
- Advanced Neural Network Applications
- Privacy-Preserving Technologies in Data
- Video Surveillance and Tracking Methods
- Fire Detection and Safety Systems
- Adversarial Robustness in Machine Learning
- Recommender Systems and Techniques
- Advanced Vision and Imaging
- COVID-19 diagnosis using AI
- Explainable Artificial Intelligence (XAI)
- Image Retrieval and Classification Techniques
- Image and Signal Denoising Methods
- Educational Technology and Assessment
- Remote-Sensing Image Classification
- Educational Tools and Methods
- Handwritten Text Recognition Techniques
- Music and Audio Processing
National Taiwan University
2018-2024
Nvidia (United States)
2024
Asus (Taiwan)
2020-2021
Person re-identification (Re-ID) aims at recognizing the same person from images taken across different cameras. To address this task, one typically requires a large amount labeled data for training an effective Re-ID model, which might not be practical real-world applications. alleviate limitation, we choose to exploit sufficient of pre-existing (auxiliary) dataset. By jointly considering such auxiliary dataset and interest (but without label information), our proposed adaptation network...
Video summarization still remains a challenging task. Due to sufficient video data on the Internet, such task draws significant attention in vision community and benefits wide range of applications, e.g., retrieval, search, etc. To effectively perform by deriving keyframes which represent given input video, we propose novel framework named Hierarchical Multi-Attention Network (H-MAN) comprises shot-level reconstruction model multi-head model. While our designed is two-stage hierarchical...
Text-to-video (T2V) diffusion models have shown promising capabilities in synthesizing realistic videos from input text prompts. However, the description alone provides limited control over precise objects movements and camera framing. In this work, we tackle motion customization problem, where a reference video is provided as guidance. While most existing methods choose to fine-tune pre-trained reconstruct frame differences of video, observe that such strategy suffer content leakage they...
When translating text inputs into layouts or images, existing works typically require explicit descriptions of each object in a scene, including their spatial information the associated relationships. To better exploit input, so that implicit objects relationships can be properly inferred during layout generation, we propose LayoutTransformer Network (LT-Net) this paper. Given scene-graph our LT-Net uniquely encodes semantic features for exploiting co-occurrences and This allows one to...
Federated learning (FL) emerges as a decentralized framework which trains models from multiple distributed clients without sharing their data to preserve privacy. Recently, large-scale pre-trained (e.g., Vision Transformer) have shown strong capability of deriving robust representations. However, the heterogeneity among clients, limited computation resources, and communication bandwidth restrict deployment in FL frameworks. To leverage representations while enabling efficient model...
Human face reenactment aims at transferring motion patterns from one (from a source-domain video) to an-other (in the target domain with identity of interest).While recent works report impressive results, they are notable handle multiple identities in unified model. In this paper, we propose unique network CrossID-GAN perform multi-ID reenactment. Given video extracted facial landmarks and target-domain image, our learns identity-invariant via such information produce videos whose ID matches...
Few-shot semantic segmentation addresses the learning task in which only few images with ground truth pixel-level labels are available for novel classes of interest. One is typically required to collect a large mount data (i.e., base classes) such information, followed by meta-learning strategies address above task. When image-level can be observed during both training and testing, it considered as an even more challenging weakly supervised few-shot segmentation. To this problem, we propose...
Learning interpretable data representation has been an active research topic in deep learning and computer vision. While disentanglement is effective technique for addressing this task, existing works cannot easily handle the problems which manipulating recognizing across multiple domains are desirable. In paper, we present a unified network architecture of Multi-domain Multi-modal Representation Disentangler (M2RD), with goal domain-invariant content associated domain-specific observed. By...
Federated Learning (FL) is an emerging paradigm that enables multiple users to collaboratively train a robust model in privacy-preserving manner without sharing their private data. Most existing approaches of FL only consider traditional single-label image classification, ignoring the impact when transferring task multi-label classification. Nevertheless, it still challenging for deal with user heterogeneity local data distribution real-world scenario, and this issue becomes even more severe...
Person re-identification (Re-ID) aims at recognizing the same person from images taken across different cameras. To address this task, one typically requires a large amount labeled data for training an effective Re-ID model, which might not be practical real-world applications. alleviate limitation, we choose to exploit sufficient of pre-existing (auxiliary) dataset. By jointly considering such auxiliary dataset and interest (but without label information), our proposed adaptation network...
Large-scale vision-language models (VLMs) have shown a strong zero-shot generalization capability on unseen-domain data. However, when adapting pre-trained VLMs to sequence of downstream tasks, they are prone forgetting previously learned knowledge and degrade their classification capability. To tackle this problem, we propose unique Selective Dual-Teacher Knowledge Transfer framework that leverages the most recent fine-tuned original as dual teachers preserve capabilities, respectively....
Recent developments in All-in-One (AiO) RGB image restoration and prompt learning have enabled the representation of distinct degradations through prompts, allowing degraded images to be effectively addressed by a single model. However, this paradigm faces significant challenges when transferring hyperspectral (HSI) tasks due to: 1) domain gap between HSI features difference on their structures, 2) information loss visual prompts under severe composite degradations, 3) difficulties capturing...
While self-supervised learning has been shown to benefit a number of vision tasks, existing techniques mainly focus on image-level manipulation, which may not generalize well downstream tasks at patch or pixel levels. Moreover, SSL methods might sufficiently describe and associate the above representations within across image scales. In this paper, we propose Self-Supervised Pyramid Representation Learning (SS-PRL) framework. The proposed SS-PRL is designed derive pyramid levels via proper...
To understand how deep neural networks perform classification predictions, recent research attention has been focusing on developing techniques to offer desirable explanations. However, most existing methods cannot be easily applied for semantic segmentation; moreover, they are not designed interpretability under the multi-annotator setting. Instead of viewing ground-truth pixel-level labels annotated by a single annotator with consistent labeling tendency, we aim at providing interpretable...
Generating videos with content and motion variations is a challenging task in computer vision. While the recent development of GAN allows video generation from latent representations, it not easy to produce particular patterns interest. In this paper, we propose Dual Motion Transfer (Dual-MTGAN), which takes image data as inputs while learning disentangled representations. Our Dual-MTGAN able perform deterministic transfer stochastic generation. Based on given image, former preserves input...
Few-shot classification aims to carry out given only few labeled examples for the categories of interest. Though several approaches have been proposed, most existing few-shot learning (FSL) models assume that base and novel classes are drawn from same data domain. When it comes recognizing novel-class in an unseen domain, this becomes even more challenging task domain generalized classification. In paper, we present a unique framework domain-generalized classification, where homogeneous...
Federated learning (FL) emerges as a decentralized framework which trains models from multiple distributed clients without sharing their data to preserve privacy. Recently, large-scale pre-trained (e.g., Vision Transformer) have shown strong capability of deriving robust representations. However, the heterogeneity among clients, limited computation resources, and communication bandwidth restrict deployment in FL frameworks. To leverage representations while enabling efficient model...
Federated Learning (FL) is an emerging paradigm that enables multiple users to collaboratively train a robust model in privacy-preserving manner without sharing their private data. Most existing approaches of FL only consider traditional single-label image classification, ignoring the impact when transferring task multi-label classification. Nevertheless, it still challenging for deal with user heterogeneity local data distribution real-world scenario, and this issue becomes even more severe...
Learning interpretable and interpolatable latent representations has been an emerging research direction, allowing researchers to understand utilize the derived space for further applications such as visual synthesis or recognition. While most existing approaches derive induces smooth transition in image appearance, it is still not clear how observe desirable which would contain semantic information of interest. In this paper, we aim learn meaningful simultaneously perform semantic-oriented...
While self-supervised learning has been shown to benefit a number of vision tasks, existing techniques mainly focus on image-level manipulation, which may not generalize well downstream tasks at patch or pixel levels. Moreover, SSL methods might sufficiently describe and associate the above representations within across image scales. In this paper, we propose Self-Supervised Pyramid Representation Learning (SS-PRL) framework. The proposed SS-PRL is designed derive pyramid levels via proper...