- Advanced Neural Network Applications
- Domain Adaptation and Few-Shot Learning
- Advanced Vision and Imaging
- Computer Graphics and Visualization Techniques
- Robotics and Sensor-Based Localization
- Adversarial Robustness in Machine Learning
- 3D Shape Modeling and Analysis
- Neural dynamics and brain function
- Blind Source Separation Techniques
- Brain Tumor Detection and Classification
- Visual Attention and Saliency Detection
- Advanced Image Processing Techniques
- Generative Adversarial Networks and Image Synthesis
- 3D Surveying and Cultural Heritage
- Multimodal Machine Learning Applications
- Video Surveillance and Tracking Methods
- COVID-19 diagnosis using AI
- Functional Brain Connectivity Studies
- Image Processing and 3D Reconstruction
- Sentiment Analysis and Opinion Mining
- Machine Learning and Data Classification
- Tactile and Sensory Interactions
- Adaptive Dynamic Programming Control
- Music and Audio Processing
- Direction-of-Arrival Estimation Techniques
Tsinghua University
2004-2025
Emory University
2018-2024
Soochow University
2024
China Mobile (China)
2023
Shandong University of Science and Technology
2023
Zhejiang University of Science and Technology
2022
PRG S&Tech (South Korea)
2021
Air Force Medical University
2019
Institute of Seismology
2018
University of Electronic Science and Technology of China
2014
Many adaptations of transformers have emerged to address the single-modal vision tasks, where self-attention modules are stacked handle input sources like images. Intuitively, feeding multiple modalities data could improve performance, yet innermodal attentive weights may be diluted, which thus greatly undermine final performance. In this paper, we propose a multimodal token fusion method (TokenFusion), tailored for transformer-based tasks. To effectively fuse modalities, TokenFusion...
Deep multimodal fusion by using multiple sources of data for classification or regression has exhibited a clear advantage over the unimodal counterpart on various applications. Yet, current methods including aggregation-based and alignment-based are still inadequate in balancing trade-off between inter-modal intra-modal processing, incurring bottleneck performance improvement. To this end, paper proposes Channel-Exchanging-Network (CEN), parameter-free framework that dynamically exchanges...
3D object detection is an important task in autonomous driving to perceive the surroundings. Despite excellent performance, existing detectors lack robustness real-world corruptions caused by adverse weathers, sensor noises, etc., provoking concerns about safety and reliability of systems. To comprehensively rigorously benchmark corruption detectors, this paper we design 27 types common for both LiDAR camera inputs considering realworld scenarios. By synthesizing these on public datasets,...
This paper presents an algorithm for classifying single-trial electroencephalogram (EEG) during the preparation of self-paced tapping. It combines common spatial subspace decomposition with Fisher discriminant analysis to extract features from multichannel EEG. Three are obtained based on Bereitschaftspotential and event-related desynchronization. Finally, a perceptron neural network is trained as classifier. was applied data set <self-paced 1s> "BCI Competition 2003" classification accuracy...
3D object detection is a crucial research topic in computer vision, which usually uses point clouds as input conventional setups. Recently, there trend of leveraging multiple sources data, such complementing the cloud with 2D images that often have richer color and fewer noises. However, due to heterogeneous geometrics representations, it prevents us from applying off-the-shelf neural networks achieve multimodal fusion. To end, we propose Bridged Transformer (BrT), an end-to-end architecture...
Face recognition is a prevailing authentication solution in numerous biometric applications. Physical adversarial attacks, as an important surrogate, can identify the weak-nesses of face systems and evaluate their ro-bustness before deployed. However, most existing physical attacks are either detectable readily or ineffective against commercial systems. The goal this work to develop more reliable technique that carry out end-to-end evaluation robustness for It requires simultaneously deceive...
Large transformers have demonstrated remarkable success, making it necessary to compress these models reduce inference costs while preserving their perfor-mance. Current compression algorithms prune at fixed ratios, requiring a unique pruning process for each ratio, which results in high computational costs. In contrast, we propose of pretrained any desired ratio within single stage, based on differential inclusion mask parameter. This dynamic can generate the whole regularization solution...
The advancement of 4D (i.e., sequential 3D) generation opens up new possibilities for lifelike experiences in various applications, where users can explore dynamic objects or characters from any viewpoint. Meanwhile, video generative models are receiving particular attention given their ability to produce realistic and imaginative frames. These also observed exhibit strong 3D consistency, indicating the potential act as world simulators. In this work, we present Video4DGen, a novel framework...
Unsupervised non-rigid point cloud shape correspondence underpins a multitude of 3D vision tasks, yet itself is non-trivial given the exponential complexity stemming from inter-point degree-of-freedom, i.e., pose transformations. Based on assumption local rigidity, one solution for reducing to decompose overall into independent regions using Local Reference Frames (LRFs) that are equivariant SE(3) However, focus solely structure neglects global geometric contexts, resulting in less...
With the rapid advancements in diffusion models and 3D generation techniques, dynamic content has become a crucial research area. However, achieving high-fidelity 4D (dynamic 3D) with strong spatial-temporal consistency remains challenging task. Inspired by recent findings that pretrained features capture rich correspondences, we propose FB-4D, novel framework integrates Feature Bank mechanism to enhance both spatial temporal generated frames. In store extracted from previous frames fuse...
We propose a compact and effective framework to fuse multimodal features at multiple layers in single network. The consists of two innovative fusion schemes. Firstly, unlike existing methods that necessitate individual encoders for different modalities, we verify can be learnt within shared network by merely maintaining modality-specific batch normalization the encoder, which also enables implicit via joint feature representation learning. Secondly, bidirectional multi-layer scheme, where...
Multimodal fusion and multitask learning are two vital topics in machine learning. Despite the fruitful progress, existing methods for both problems still brittle to same challenge-it remains dilemmatic integrate common information across modalities (resp. tasks) meanwhile preserving specific patterns of each modality task). Besides, while they actually closely related other, multimodal rarely explored within methodological framework before. In this paper, we propose...
We propose a deep fine-grained multi-level fusion architecture for monocular 3D object detection, with an additionally designed anti-occlusion optimization process. Conventional detection methods usually leverage geometry constraints such as keypoints, shape relationships, and to 2D optimizations offset the lack of accurate depth information. However, these still struggle against directly extracting rich information from estimation. To solve problem, we integrate features pseudo-LiDAR filter...
In the low-bit quantization field, training Binarized Neural Networks (BNNs) is extreme solution to ease deployment of deep models on resource-constrained devices, having lowest storage cost and significantly cheaper bit-wise operations compared 32-bit floating-point counterparts. this paper, we introduce Sub-bit (SNNs), a new type binary design tailored compress accelerate BNNs. SNNs are inspired by an empirical observation, showing that kernels learnt at convolutional layers BNN model...
Improving the performance of click-through rate (CTR) prediction remains one core tasks in online advertising systems. With rise deep learning, CTR models with networks remarkably enhance model capacities. In models, exploiting users' historical data is essential for learning behaviors and interests. As existing works neglect importance temporal signals when embed clicking records, we propose a time-aware attention which explicitly uses absolute expressing periodic relative relation between...
Video generative models are receiving particular attention given their ability to generate realistic and imaginative frames. Besides, these also observed exhibit strong 3D consistency, significantly enhancing potential act as world simulators. In this work, we present Vidu4D, a novel reconstruction model that excels in accurately reconstructing 4D (i.e., sequential 3D) representations from single generated videos, addressing challenges associated with non-rigidity frame distortion. This...
Gesture recognition has been paid more and attention as a new generation of human-computer interaction visual input mode. In motion sensing games other applications, gesture is used the interface. However, because its inherent features such diversity, ambiguity, space-time difference large computational burden, it difficult to achieve real-time application with software, especially in embedded system. Therefore, this paper we propose hardware-based system well innovative algorithm...
Tactile sensing plays an important role in robotic perception and manipulation tasks. To overcome the real-world limitations of data collection, simulating tactile response a virtual environment comes as desirable direction research. In this paper, we propose Elastic Interaction Particles (EIP) for simulation, which is capable reflecting elastic property sensor well characterizing fine-grained physical interaction during contact. Specifically, EIP models group coordinated particles, applied...
Advancements in 3D scene reconstruction have transformed 2D images from the real world into models, producing realistic results hundreds of input photos. Despite great success dense-view scenarios, rendering a detailed insufficient captured views is still an ill-posed optimization problem, often resulting artifacts and distortions unseen areas. In this paper, we propose ReconX, novel paradigm that reframes ambiguous challenge as temporal generation task. The key insight to unleash strong...
Audio-visual navigation task requires an agent to find a sound source in realistic, unmapped 3D environment by utilizing egocentric audio-visual observations. Existing works assume clean that solely contains the target sound, which, however, would not be suitable most real-world applications due unexpected noise or intentional interference. In this work, we design acoustically complex besides there exists attacker playing zero-sum game with agent. More specifically, can move and change...
This work focuses on the 3D reconstruction of non-rigid objects based monocular RGB video sequences. Concretely, we aim at building high-fidelity models for generic object categories and casually captured scenes. To this end, do not assume known root poses objects, utilize category-specific templates or dense pose priors. The key idea our method, Root Pose Decomposition (RPD), is to maintain a per-frame transformation, meanwhile field with local transformations rectify pose. optimization...