- Face recognition and analysis
- Computer Graphics and Visualization Techniques
- Advanced Vision and Imaging
- Advanced Neural Network Applications
- Human Pose and Action Recognition
- Speech and Audio Processing
- Domain Adaptation and Few-Shot Learning
- Generative Adversarial Networks and Image Synthesis
- Advanced Image and Video Retrieval Techniques
- 3D Shape Modeling and Analysis
- Human Motion and Animation
- Music and Audio Processing
- Image Processing and 3D Reconstruction
- Autonomous Vehicle Technology and Safety
- Hand Gesture Recognition Systems
- Robotics and Sensor-Based Localization
- Video Surveillance and Tracking Methods
- Natural Language Processing Techniques
- Visual Attention and Saliency Detection
- Biometric Identification and Security
- Stroke Rehabilitation and Recovery
- Image Retrieval and Classification Techniques
- Induction Heating and Inverter Technology
- COVID-19 diagnosis using AI
- Anomaly Detection Techniques and Applications
Baidu (China)
2023-2024
Vision Technology (United States)
2023-2024
Xi'an Jiaotong University
2023
Japan Advanced Institute of Science and Technology
2022
Beihang University
2010-2011
Detection transformer (DETR) relies on one-to-one assignment, assigning one ground-truth object to prediction, for end-to-end detection without NMS post-processing. It is known that one-to-many multiple predictions, succeeds in methods such as Faster R-CNN and FCOS. While the naive assignment does not work DETR, it remains challenging apply DETR training. In this paper, we introduce Group a simple yet efficient training approach introduces group-wise way assignment. This involves using...
Recently, transformer-based networks have shown impressive results in semantic segmentation. Yet for real-time segmentation, pure CNN-based approaches still dominate this field, due to the time-consuming computation mechanism of transformer. We propose RTFormer, an efficient dual-resolution transformer segmenation, which achieves better trade-off between performance and efficiency than models. To achieve high inference on GPU-like devices, our RTFormer leverages GPU-Friendly Attention with...
Despite recent advances in syncing lip movements with any audio waves, current methods still struggle to balance generation quality and the model's generalization ability. Previous studies either require long-term data for training or produce a similar movement pattern on all subjects low quality. In this paper, we propose StyleSync, an effective framework that enables high-fidelity synchronization. We identify style-based generator would sufficiently enable such charming property both...
In this paper, we study Text-to-3D content generation leveraging 2D diffusion priors to enhance the quality and detail of generated 3D models. Recent progress [11] in text-to-3D has shown that employing high-resolution (e.g., 512 × 512) renderings can lead production high-quality models using latent priors. To enable rendering at even higher resolutions, which potential further augment models, propose a novel approach combines multiple noise estimation processes with pretrained prior....
Previous face Presentation Attack Detection (PAD) methods aim to improve the effectiveness of cross-domain tasks. However, in real-world scenarios, original training data pre-trained model is not available due privacy or other reasons. Under these constraints, general for fine-tuning single-target domain may lose previously learned knowledge, leading a catastrophic forgetting problem. To address issues, we propose multi-domain incremental learning (MDIL) method PAD, which only learns...
Existing methods of multi-person video 3D human Pose and Shape Estimation (PSE) typically adopt a two-stage strategy, which first detects instances in each frame then performs single-person PSE with temporal model. However, the global spatio-temporal context among spatial can not be captured. In this paper, we propose new end-to-end estimation framework progressive Video Transformer, termed PSVT. PSVT, encoder (STE) captures feature dependencies objects. Then, pose decoder (STPD) shape...
Current domain adaptation methods for face anti-spoofing leverage labeled source data and unlabeled target to obtain a promising generalizable decision boundary. However, it is usually difficult these achieve perfect domain-invariant liveness feature disentanglement, which may degrade the final classification performance by differences in illumination, category, spoof type, etc. In this work, we tackle cross-scenario proposing novel method called cyclically disentangled translation network...
Freezing the pre-trained backbone has become a standard paradigm to avoid overfitting in few-shot segmentation. In this paper, we rethink and explore new regime: {\em fine-tuning small part of parameters backbone}. We present solution overcome problem, leading better model generalization on learning novel classes. Our method decomposes into three successive matrices via Singular Value Decomposition (SVD), then only fine-tunes singular values} keeps others frozen. The above design allows...
We present a strong object detector with encoder-decoder pretraining and finetuning. Our method, called Group DETR v2, is built upon vision transformer encoder ViT-Huge~\cite{dosovitskiy2020image}, variant DINO~\cite{zhang2022dino}, an efficient training method DETR~\cite{chen2022group}. The process consists of self-supervised finetuning ViT-Huge on ImageNet-1K, the Object365, finally it COCO. v2 achieves $\textbf{64.5}$ mAP COCO test-dev, establishes new SoTA leaderboard...
In this paper, we present VideoGen, a text-to-video generation approach, which can generate high-definition video with high frame fidelity and strong temporal consistency using reference-guided latent diffusion. We leverage an off-the-shelf text-to-image model, e.g., Stable Diffusion, to image content quality from the text prompt, as reference guide generation. Then, introduce efficient cascaded diffusion module conditioned on both for generating representations, followed by flow-based...
In the field of skeleton-based action recognition, current top-performing graph convolutional networks (GCNs) exploit intra-sequence context to construct adaptive graphs for feature aggregation. However, we argue that such is still \textit{local} since rich cross-sequence relations have not been explicitly investigated. this paper, propose a contrastive learning framework recognition (\textit{SkeletonGCL}) explore \textit{global} across all sequences. specific, SkeletonGCL associates...
This paper presents GGRt, a novel approach to generalizable view synthesis that alleviates the need for real camera poses, complexity in processing high-resolution images, and lengthy optimization processes, thus facilitating stronger applicability of 3D Gaussian Splatting (3D-GS) real-world scenarios. Specifically, we design joint learning framework consists an Iterative Pose Optimization Network (IPO-Net) Generalizable 3D-Gaussians (G-3DG) model. With mechanism, proposed can inherently...
Design knowledge and experience are the bases to carry out aircraft conceptual design tasks due high complexity integration of during this phase. When carrying same task, different designers may need individual strategies fulfill their own demands. A knowledge-based extensible method in building systems is studied considering above requirements. Based on theory, a environment, called environment (KEACDE) with open architecture, built as enable wrap add-on extensions make systems. The...
Design knowledge and experience are the basis to carry out aircraft conceptual design tasks due high complexity integration of works involved in this phase. Aircraft designers need a computer-aided package help them easily with their individual strategies. This paper presents set web-based software framework called Pad (ADP). The architecture is open so that users can wrap add-on extensions make own system. development aspects ADP discussed case presented demonstrate its usability effectiveness.
DETR is a novel end-to-end transformer architecture object detector, which significantly outperforms classic detectors when scaling up the model size. In this paper, we focus on compression of with knowledge distillation. While distillation has been well-studied in detectors, there lack researches how to make it work effectively DETR. We first provide experimental and theoretical analysis point out that main challenge consistent points. Distillation points refer corresponding inputs...
Wide-baseline panoramic images are frequently used in applications like VR and simulations to minimize capturing labor costs storage needs. However, synthesizing novel views from these real time remains a significant challenge, especially due imagery's high resolution inherent distortions. Although existing 3D Gaussian splatting (3DGS) methods can produce photo-realistic under narrow baselines, they often overfit the training when dealing with wide-baseline difficulty learning precise...
This paper presents GEA, a novel method for creating expressive 3D avatars with high-fidelity reconstructions of body and hands based on Gaussians. The key contributions are twofold. First, we design two-stage pose estimation to obtain an accurate SMPL-X from input images, providing correct mapping between the pixels training image model. It uses attention-aware network optimization scheme align normal silhouette estimated real in image. Second, propose iterative re-initialization strategy...
This paper presents TexRO, a novel method for generating delicate textures of known 3D mesh by optimizing its UV texture. The key contributions are two-fold. We propose an optimal viewpoint selection strategy, that finds the most miniature set viewpoints covering all faces mesh. Our strategy guarantees completeness generated result. recursive optimization pipeline optimizes texture at increasing resolutions, with adaptive denoising re-uses existing new generation. Through extensive...
Do we fully leverage the potential of visual encoder in Multimodal Large Language Models (MLLMs)? The recent outstanding performance MLLMs multimodal understanding has garnered broad attention from both academia and industry. In current MLLM rat race, focus seems to be predominantly on linguistic side. We witness rise larger higher-quality instruction datasets, as well involvement larger-sized LLMs. Yet, scant been directed towards signals utilized by MLLMs, often assumed final high-level...
This paper introduces OpenGaussian, a method based on 3D Gaussian Splatting (3DGS) capable of point-level open vocabulary understanding. Our primary motivation stems from observing that existing 3DGS-based methods mainly focus 2D pixel-level parsing. These struggle with tasks due to weak feature expressiveness and inaccurate 2D-3D associations. To ensure robust presentation understanding, we first employ SAM masks without cross-frame associations train instance features consistency. exhibit...
Dynamic Gaussian splatting has led to impressive scene reconstruction and image synthesis advances in novel views. Existing methods, however, heavily rely on pre-computed poses initialization by Structure from Motion (SfM) algorithms or expensive sensors. For the first time, this paper addresses issue integrating self-supervised VO into our pose-free dynamic method (VDG) boost pose depth static-dynamic decomposition. Moreover, VDG can work with only RGB input construct scenes at a faster...
Thoroughly testing autonomy systems is crucial in the pursuit of safe autonomous driving vehicles. It necessitates creating safety-critical scenarios that go beyond what can be safely collected from real-world data, as many these occur infrequently on public roads. However, evaluation most existing NVS methods relies sporadic sampling image frames training comparing rendered images with ground truth using metrics. Unfortunately, this protocol falls short meeting actual requirements...