NFDI4DS | UHH-SEMS - Publication Details

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

OPENALEX - Publications

Shoufa Chen Chongjian Ge Tong Zhan Jiangliu Wang Yibing Song and 2 more

Pretraining Vision Transformers (ViTs) has achieved great success in visual recognition. A following scenario is to adapt a ViT various image and video recognition tasks. The adaptation challenging because of heavy computation memory storage. Each model needs an independent complete finetuning process different tasks, which limits its transferability domains. To address this challenge, we propose effective approach for Transformer, namely AdaptFormer, can the pre-trained ViTs into many tasks...

10.48550/arxiv.2205.13535 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Parser-Free Virtual Try-on via Distilling Appearance Flows

OPENALEX - Publications

Yuying Ge Yibing Song Ruimao Zhang Chongjian Ge Wei Liu and 1 more

Image virtual try-on aims to fit a garment image (target clothes) person image. Prior methods are heavily based on human parsing. However, slightly-wrong segmentation results would lead unrealistic images with large artifacts. A recent pioneering work employed knowledge distillation reduce the dependency of parsing, where produced by parser-based method used as supervisions train "student" network without relying segmentation, making student mimic ability model. quality is bounded To address...

10.1109/cvpr46437.2021.00838 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation

OPENALEX - Publications

Yuanfeng Ji Haotian Bai Jie Yang Chongjian Ge Ye Zhu and 6 more

Despite the considerable progress in automatic abdominal multi-organ segmentation from CT/MRI scans recent years, a comprehensive evaluation of models' capabilities is hampered by lack large-scale benchmark diverse clinical scenarios. Constraint high cost collecting and labeling 3D medical data, most deep learning models to date are driven datasets with limited number organs interest or samples, which still limits power modern makes it difficult provide fully fair estimate various methods....

10.48550/arxiv.2206.08023 preprint EN cc-by-sa arXiv (Cornell University) 2022-01-01

Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations

OPENALEX - Publications

Youwei Liang Chongjian Ge Zhan Tong Yibing Song Jue Wang and 1 more

Vision Transformers (ViTs) take all the image patches as tokens and construct multi-head self-attention (MHSA) among them. Complete leverage of these brings redundant computations since not are attentive in MHSA. Examples include that containing semantically meaningless or distractive backgrounds do positively contribute to ViT predictions. In this work, we propose reorganize during feed-forward process models, which is integrated into training. For each forward inference, identify between...

10.48550/arxiv.2202.07800 preprint EN other-oa arXiv (Cornell University) 2022-01-01

CycleMLP: A MLP-Like Architecture for Dense Visual Predictions

OPENALEX - Publications

Shoufa Chen Enze Xie Chongjian Ge Runjian Chen Ding Liang and 1 more

This article presents a simple yet effective multilayer perceptron (MLP) architecture, namely CycleMLP, which is versatile neural backbone network capable of solving various tasks dense visual predictions such as object detection, segmentation, and human pose estimation. Compared to recent advanced MLP architectures MLP-Mixer (Tolstikhin et al. 2021), ResMLP (Touvron gMLP (Liu whose are sensitive image size infeasible in prediction tasks, CycleMLP has two appealing advantages: 1) can cope...

10.1109/tpami.2023.3303397 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2023-08-08

DeepAccident: A Motion and Accident Prediction Benchmark for V2X Autonomous Driving

OPENALEX - Publications

Tianqi Wang Suk-Min Kim Ji Wenxuan Enze Xie Chongjian Ge and 3 more

Safety is the primary priority of autonomous driving. Nevertheless, no published dataset currently supports direct and explainable safety evaluation for In this work, we propose DeepAccident, a large-scale generated via realistic simulator containing diverse accident scenarios that frequently occur in real-world The proposed DeepAccident includes 57K annotated frames 285K samples, approximately 7 times more than nuScenes with 40k samples. addition, new task, end-to-end motion prediction,...

10.1609/aaai.v38i6.28370 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2024-03-24

CycleMLP: A MLP-like Architecture for Dense Prediction

OPENALEX - Publications

Shoufa Chen Enze Xie Chongjian Ge Runjian Chen Ding Liang and 1 more

This paper presents a simple MLP-like architecture, CycleMLP, which is versatile backbone for visual recognition and dense predictions. As compared to modern MLP architectures, e.g., MLP-Mixer, ResMLP, gMLP, whose architectures are correlated image size thus infeasible in object detection segmentation, CycleMLP has two advantages approaches. (1) It can cope with various sizes. (2) achieves linear computational complexity by using local windows. In contrast, previous MLPs have $O(N^2)$...

10.48550/arxiv.2107.10224 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Disentangled Cycle Consistency for Highly-realistic Virtual Try-On

OPENALEX - Publications

Chongjian Ge Yibing Song Yuying Ge Han Yang Wei Liu and 1 more

Image virtual try-on replaces the clothes on a person image with desired in-shop image. It is challenging because and are unpaired. Existing methods formulate as either in-painting or cycle consistency. Both of these two formulations encourage generation networks to reconstruct input in self-supervised manner. However, existing do not differentiate clothing non-clothing regions. A straightforward impedes quality heavily coupled contents. In this paper, we propose Disentangled...

10.1109/cvpr46437.2021.01665 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

Watch Only Once: An End-to-End Video Action Detection Framework

OPENALEX - Publications

Shoufa Chen Peize Sun Enze Xie Chongjian Ge Jiannan Wu and 3 more

We propose an end-to-end pipeline, named Watch Once Only (WOO), for video action detection. Current methods either decouple detection task into separated stages of actor localization and classification or train two models within one stage. In contrast, our approach solves the simultaneously in a unified network. The whole pipeline is significantly simplified by unifying backbone network eliminating many hand-crafted components. WOO takes to extract features location classification. addition,...

10.1109/iccv48922.2021.00807 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

MetaBEV: Solving Sensor Failures for 3D Detection and Map Segmentation

OPENALEX - Publications

Chongjian Ge Jun Song Chen Enze Xie Zhongdao Wang Lanqing Hong and 3 more

Perception systems in modern autonomous driving vehicles typically take inputs from complementary multi-modal sensors, e.g., LiDAR and cameras. However, real-world applications, sensor corruptions failures lead to inferior performances, thus compromising safety. In this paper, we propose a robust framework, called MetaBEV, address extreme environments, involving overall six two sensor-missing situations. signals multiple sensors are first processed by modal-specific encoders. Subsequently,...

10.1109/iccv51070.2023.00801 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

Rethinking Attentive Object Detection via Neural Attention Learning

OPENALEX - Publications

Chongjian Ge Yibing Song Chao Ma Yuankai Qi Ping Luo

Visual attention advances object detection by attending neural networks to representations. While existing methods incorporate empirical modules empower network attention, we rethink attentive from the learning perspective in this work. We propose a NEural Attention Learning approach (NEAL) which consists of two parts. During back-propagation each training iteration, first calculate partial derivatives (a.k.a. accumulated gradients) classification output with respect input features. refine...

10.1109/tip.2023.3251693 article EN IEEE Transactions on Image Processing 2023-07-18

Soft Neighbors are Positive Supporters in Contrastive Visual Representation Learning

OPENALEX - Publications

Chongjian Ge Jiangliu Wang Tong Zhan Shoufa Chen Yibing Song and 1 more

Contrastive learning methods train visual encoders by comparing views from one instance to others. Typically, the created are set as positive, while other instances negative. This binary discrimination is studied extensively improve feature representations in self-supervised learning. In this paper, we rethink framework and find labeling insufficient measure correlations between different samples. For an intuitive example, given a random image instance, there may exist images mini-batch...

10.48550/arxiv.2303.17142 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Goku: Flow Based Video Generative Foundation Models

OPENALEX - Publications

Shoufa Chen Chongjian Ge Yuqi Zhang Yi-Da Zhang Fengda Zhu and 17 more

This paper introduces Goku, a state-of-the-art family of joint image-and-video generation models leveraging rectified flow Transformers to achieve industry-leading performance. We detail the foundational elements enabling high-quality visual generation, including data curation pipeline, model architecture design, formulation, and advanced infrastructure for efficient robust large-scale training. The Goku demonstrate superior performance in both qualitative quantitative evaluations, setting...

10.48550/arxiv.2502.04896 preprint EN arXiv (Cornell University) 2025-02-07

FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation

OPENALEX - Publications

Shilong Zhang Wenbo Li Shoufa Chen Chongjian Ge Peize Sun and 5 more

DiT diffusion models have achieved great success in text-to-video generation, leveraging their scalability model capacity and data scale. High content motion fidelity aligned with text prompts, however, often require large parameters a substantial number of function evaluations (NFEs). Realistic visually appealing details are typically reflected high resolution outputs, further amplifying computational demands especially for single stage models. To address these challenges, we propose novel...

10.48550/arxiv.2502.05179 preprint EN arXiv (Cornell University) 2025-02-07

Advancing Vision Transformers with Group-Mix Attention

OPENALEX - Publications

Chongjian Ge Xiaohan Ding Zhan Tong Yuan Li Jiangliu Wang and 2 more

Vision Transformers (ViTs) have been shown to enhance visual recognition through modeling long-range dependencies with multi-head self-attention (MHSA), which is typically formulated as Query-Key-Value computation. However, the attention map generated from Query and Key captures only token-to-token correlations at one single granularity. In this paper, we argue that should a more comprehensive mechanism capture among tokens groups (i.e., multiple adjacent tokens) for higher representational...

10.48550/arxiv.2311.15157 preprint EN other-oa arXiv (Cornell University) 2023-01-01

PixArt-$α$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

OPENALEX - Publications

Jun Song Chen Jincheng Yu Chongjian Ge Lewei Yao Enze Xie and 6 more

The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for AIGC community while increasing CO2 emissions. This paper introduces PIXART-$\alpha$, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art generators Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution...

10.48550/arxiv.2310.00426 preprint EN cc-by-nc-sa arXiv (Cornell University) 2023-01-01

Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning

OPENALEX - Publications

Chongjian Ge Youwei Liang Yibing Song Jianbo Jiao Jue Wang and 1 more

Studies on self-supervised visual representation learning (SSL) improve encoder backbones to discriminate training samples without labels. While CNN encoders via SSL achieve comparable recognition performance those supervised learning, their network attention is under-explored for further improvement. Motivated by the transformers that explore effectively in scenarios, we propose a Attention REvitalization (CARE) framework train attentive guided SSL. The proposed CARE consists of stream...

10.48550/arxiv.2110.05340 preprint EN other-oa arXiv (Cornell University) 2021-01-01

DeepAccident: A Motion and Accident Prediction Benchmark for V2X Autonomous Driving

OPENALEX - Publications

Tianqi Wang Suk-Min Kim Wenxuan Ji Enze Xie Chongjian Ge and 3 more

Safety is the primary priority of autonomous driving. Nevertheless, no published dataset currently supports direct and explainable safety evaluation for In this work, we propose DeepAccident, a large-scale generated via realistic simulator containing diverse accident scenarios that frequently occur in real-world The proposed DeepAccident includes 57K annotated frames 285K samples, approximately 7 times more than nuScenes with 40k samples. addition, new task, end-to-end motion prediction,...

10.48550/arxiv.2304.01168 preprint EN cc-by arXiv (Cornell University) 2023-01-01

MetaBEV: Solving Sensor Failures for BEV Detection and Map Segmentation

OPENALEX - Publications

Chongjian Ge Jun Song Chen Enze Xie Zhongdao Wang Lanqing Hong and 3 more

Perception systems in modern autonomous driving vehicles typically take inputs from complementary multi-modal sensors, e.g., LiDAR and cameras. However, real-world applications, sensor corruptions failures lead to inferior performances, thus compromising safety. In this paper, we propose a robust framework, called MetaBEV, address extreme environments involving overall six two sensor-missing situations. signals multiple sensors are first processed by modal-specific encoders. Subsequently,...

10.48550/arxiv.2304.09801 preprint EN other-oa arXiv (Cornell University) 2023-01-01

InstructDET: Diversifying Referring Object Detection with Generalized Instructions

OPENALEX - Publications

Ronghao Dang Jiangyan Feng Haodong Zhang Chongjian Ge Lin Song and 6 more

We propose InstructDET, a data-centric method for referring object detection (ROD) that localizes target objects based on user instructions. While deriving from expressions (REC), the instructions we leverage are greatly diversified to encompass common intentions related detection. For one image, produce tremendous refer every single and different combinations of multiple objects. Each instruction its corresponding bounding boxes (bbxs) constitute training data pair. In order expressions,...

10.48550/arxiv.2310.05136 preprint EN other-oa arXiv (Cornell University) 2023-01-01

RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis

OPENALEX - Publications

Yao Mu Junting Chen Qinglong Zhang Shoufa Chen Qiaojun Yu and 14 more

Robotic behavior synthesis, the problem of understanding multimodal inputs and generating precise physical control for robots, is an important part Embodied AI. Despite successes in applying large language models high-level understanding, it remains challenging to translate these conceptual understandings into detailed robotic actions while achieving generalization across various scenarios. In this paper, we propose a tree-structured code generation framework generalized termed RoboCodeX....

10.48550/arxiv.2402.16117 preprint EN arXiv (Cornell University) 2024-02-25

WOMD-Reasoning: A Large-Scale Language Dataset for Interaction and Driving Intentions Reasoning

OPENALEX - Publications

Yiheng Li Chongjian Ge Chenran Li Chenfeng Xu Masayoshi Tomizuka and 3 more

We propose Waymo Open Motion Dataset-Reasoning (WOMD-Reasoning), a language annotation dataset built on WOMD, with focus describing and reasoning interactions intentions in driving scenarios. Previous datasets primarily captured caused by close distances. However, induced traffic rules human intentions, which can occur over long distances, are yet sufficiently covered, despite being very common more challenging for prediction or planning models to understand. Therefore, our WOMD-Reasoning...

10.48550/arxiv.2407.04281 preprint EN arXiv (Cornell University) 2024-07-05

PixArt-\Sigma: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

OPENALEX - Publications

Jun Song Chen Chongjian Ge Enze Xie Yue Wu Lewei Yao and 5 more

In this paper, we introduce PixArt-\Sigma, a Diffusion Transformer model~(DiT) capable of directly generating images at 4K resolution. PixArt-\Sigma represents significant advancement over its predecessor, PixArt-\alpha, offering markedly higher fidelity and improved alignment with text prompts. A key feature is training efficiency. Leveraging the foundational pre-training it evolves from `weaker' baseline to `stronger' model via incorporating quality data, process term "weak-to-strong...

10.48550/arxiv.2403.04692 preprint EN arXiv (Cornell University) 2024-03-07