- Advanced Neural Network Applications
- Domain Adaptation and Few-Shot Learning
- Multimodal Machine Learning Applications
- Generative Adversarial Networks and Image Synthesis
- Image Enhancement Techniques
- Advanced Image and Video Retrieval Techniques
- Advanced Image Processing Techniques
- Topic Modeling
- Human Pose and Action Recognition
- Visual Attention and Saliency Detection
- Video Analysis and Summarization
- COVID-19 diagnosis using AI
- Brain Tumor Detection and Classification
- Natural Language Processing Techniques
- Autonomous Vehicle Technology and Safety
- Speech and dialogue systems
- Medical Imaging and Analysis
- Face recognition and analysis
- Human-Automation Interaction and Safety
- Image Processing and 3D Reconstruction
- Traffic and Road Safety
- Image Retrieval and Classification Techniques
- Speech and Audio Processing
- Machine Learning and ELM
- Anomaly Detection Techniques and Applications
University of Hong Kong
2021-2024
Hong Kong University of Science and Technology
2023
Chinese University of Hong Kong
2021
Pretraining Vision Transformers (ViTs) has achieved great success in visual recognition. A following scenario is to adapt a ViT various image and video recognition tasks. The adaptation challenging because of heavy computation memory storage. Each model needs an independent complete finetuning process different tasks, which limits its transferability domains. To address this challenge, we propose effective approach for Transformer, namely AdaptFormer, can the pre-trained ViTs into many tasks...
Image virtual try-on aims to fit a garment image (target clothes) person image. Prior methods are heavily based on human parsing. However, slightly-wrong segmentation results would lead unrealistic images with large artifacts. A recent pioneering work employed knowledge distillation reduce the dependency of parsing, where produced by parser-based method used as supervisions train "student" network without relying segmentation, making student mimic ability model. quality is bounded To address...
Despite the considerable progress in automatic abdominal multi-organ segmentation from CT/MRI scans recent years, a comprehensive evaluation of models' capabilities is hampered by lack large-scale benchmark diverse clinical scenarios. Constraint high cost collecting and labeling 3D medical data, most deep learning models to date are driven datasets with limited number organs interest or samples, which still limits power modern makes it difficult provide fully fair estimate various methods....
Vision Transformers (ViTs) take all the image patches as tokens and construct multi-head self-attention (MHSA) among them. Complete leverage of these brings redundant computations since not are attentive in MHSA. Examples include that containing semantically meaningless or distractive backgrounds do positively contribute to ViT predictions. In this work, we propose reorganize during feed-forward process models, which is integrated into training. For each forward inference, identify between...
This article presents a simple yet effective multilayer perceptron (MLP) architecture, namely CycleMLP, which is versatile neural backbone network capable of solving various tasks dense visual predictions such as object detection, segmentation, and human pose estimation. Compared to recent advanced MLP architectures MLP-Mixer (Tolstikhin et al. 2021), ResMLP (Touvron gMLP (Liu whose are sensitive image size infeasible in prediction tasks, CycleMLP has two appealing advantages: 1) can cope...
Safety is the primary priority of autonomous driving. Nevertheless, no published dataset currently supports direct and explainable safety evaluation for In this work, we propose DeepAccident, a large-scale generated via realistic simulator containing diverse accident scenarios that frequently occur in real-world The proposed DeepAccident includes 57K annotated frames 285K samples, approximately 7 times more than nuScenes with 40k samples. addition, new task, end-to-end motion prediction,...
This paper presents a simple MLP-like architecture, CycleMLP, which is versatile backbone for visual recognition and dense predictions. As compared to modern MLP architectures, e.g., MLP-Mixer, ResMLP, gMLP, whose architectures are correlated image size thus infeasible in object detection segmentation, CycleMLP has two advantages approaches. (1) It can cope with various sizes. (2) achieves linear computational complexity by using local windows. In contrast, previous MLPs have $O(N^2)$...
Image virtual try-on replaces the clothes on a person image with desired in-shop image. It is challenging because and are unpaired. Existing methods formulate as either in-painting or cycle consistency. Both of these two formulations encourage generation networks to reconstruct input in self-supervised manner. However, existing do not differentiate clothing non-clothing regions. A straightforward impedes quality heavily coupled contents. In this paper, we propose Disentangled...
We propose an end-to-end pipeline, named Watch Once Only (WOO), for video action detection. Current methods either decouple detection task into separated stages of actor localization and classification or train two models within one stage. In contrast, our approach solves the simultaneously in a unified network. The whole pipeline is significantly simplified by unifying backbone network eliminating many hand-crafted components. WOO takes to extract features location classification. addition,...
Perception systems in modern autonomous driving vehicles typically take inputs from complementary multi-modal sensors, e.g., LiDAR and cameras. However, real-world applications, sensor corruptions failures lead to inferior performances, thus compromising safety. In this paper, we propose a robust framework, called MetaBEV, address extreme environments, involving overall six two sensor-missing situations. signals multiple sensors are first processed by modal-specific encoders. Subsequently,...
Visual attention advances object detection by attending neural networks to representations. While existing methods incorporate empirical modules empower network attention, we rethink attentive from the learning perspective in this work. We propose a NEural Attention Learning approach (NEAL) which consists of two parts. During back-propagation each training iteration, first calculate partial derivatives (a.k.a. accumulated gradients) classification output with respect input features. refine...
Contrastive learning methods train visual encoders by comparing views from one instance to others. Typically, the created are set as positive, while other instances negative. This binary discrimination is studied extensively improve feature representations in self-supervised learning. In this paper, we rethink framework and find labeling insufficient measure correlations between different samples. For an intuitive example, given a random image instance, there may exist images mini-batch...
This paper introduces Goku, a state-of-the-art family of joint image-and-video generation models leveraging rectified flow Transformers to achieve industry-leading performance. We detail the foundational elements enabling high-quality visual generation, including data curation pipeline, model architecture design, formulation, and advanced infrastructure for efficient robust large-scale training. The Goku demonstrate superior performance in both qualitative quantitative evaluations, setting...
DiT diffusion models have achieved great success in text-to-video generation, leveraging their scalability model capacity and data scale. High content motion fidelity aligned with text prompts, however, often require large parameters a substantial number of function evaluations (NFEs). Realistic visually appealing details are typically reflected high resolution outputs, further amplifying computational demands especially for single stage models. To address these challenges, we propose novel...
Vision Transformers (ViTs) have been shown to enhance visual recognition through modeling long-range dependencies with multi-head self-attention (MHSA), which is typically formulated as Query-Key-Value computation. However, the attention map generated from Query and Key captures only token-to-token correlations at one single granularity. In this paper, we argue that should a more comprehensive mechanism capture among tokens groups (i.e., multiple adjacent tokens) for higher representational...
The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for AIGC community while increasing CO2 emissions. This paper introduces PIXART-$\alpha$, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art generators Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution...
Studies on self-supervised visual representation learning (SSL) improve encoder backbones to discriminate training samples without labels. While CNN encoders via SSL achieve comparable recognition performance those supervised learning, their network attention is under-explored for further improvement. Motivated by the transformers that explore effectively in scenarios, we propose a Attention REvitalization (CARE) framework train attentive guided SSL. The proposed CARE consists of stream...
Safety is the primary priority of autonomous driving. Nevertheless, no published dataset currently supports direct and explainable safety evaluation for In this work, we propose DeepAccident, a large-scale generated via realistic simulator containing diverse accident scenarios that frequently occur in real-world The proposed DeepAccident includes 57K annotated frames 285K samples, approximately 7 times more than nuScenes with 40k samples. addition, new task, end-to-end motion prediction,...
Perception systems in modern autonomous driving vehicles typically take inputs from complementary multi-modal sensors, e.g., LiDAR and cameras. However, real-world applications, sensor corruptions failures lead to inferior performances, thus compromising safety. In this paper, we propose a robust framework, called MetaBEV, address extreme environments involving overall six two sensor-missing situations. signals multiple sensors are first processed by modal-specific encoders. Subsequently,...
We propose InstructDET, a data-centric method for referring object detection (ROD) that localizes target objects based on user instructions. While deriving from expressions (REC), the instructions we leverage are greatly diversified to encompass common intentions related detection. For one image, produce tremendous refer every single and different combinations of multiple objects. Each instruction its corresponding bounding boxes (bbxs) constitute training data pair. In order expressions,...
Robotic behavior synthesis, the problem of understanding multimodal inputs and generating precise physical control for robots, is an important part Embodied AI. Despite successes in applying large language models high-level understanding, it remains challenging to translate these conceptual understandings into detailed robotic actions while achieving generalization across various scenarios. In this paper, we propose a tree-structured code generation framework generalized termed RoboCodeX....
We propose Waymo Open Motion Dataset-Reasoning (WOMD-Reasoning), a language annotation dataset built on WOMD, with focus describing and reasoning interactions intentions in driving scenarios. Previous datasets primarily captured caused by close distances. However, induced traffic rules human intentions, which can occur over long distances, are yet sufficiently covered, despite being very common more challenging for prediction or planning models to understand. Therefore, our WOMD-Reasoning...
In this paper, we introduce PixArt-\Sigma, a Diffusion Transformer model~(DiT) capable of directly generating images at 4K resolution. PixArt-\Sigma represents significant advancement over its predecessor, PixArt-\alpha, offering markedly higher fidelity and improved alignment with text prompts. A key feature is training efficiency. Leveraging the foundational pre-training it evolves from `weaker' baseline to `stronger' model via incorporating quality data, process term "weak-to-strong...