Yikai Wang

ORCID: 0000-0003-1341-6235
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Advanced Neural Network Applications
  • Domain Adaptation and Few-Shot Learning
  • Advanced Vision and Imaging
  • Computer Graphics and Visualization Techniques
  • Robotics and Sensor-Based Localization
  • Adversarial Robustness in Machine Learning
  • 3D Shape Modeling and Analysis
  • Neural dynamics and brain function
  • Blind Source Separation Techniques
  • Brain Tumor Detection and Classification
  • Visual Attention and Saliency Detection
  • Advanced Image Processing Techniques
  • Generative Adversarial Networks and Image Synthesis
  • 3D Surveying and Cultural Heritage
  • Multimodal Machine Learning Applications
  • Video Surveillance and Tracking Methods
  • COVID-19 diagnosis using AI
  • Functional Brain Connectivity Studies
  • Image Processing and 3D Reconstruction
  • Sentiment Analysis and Opinion Mining
  • Machine Learning and Data Classification
  • Tactile and Sensory Interactions
  • Adaptive Dynamic Programming Control
  • Music and Audio Processing
  • Direction-of-Arrival Estimation Techniques

Tsinghua University
2004-2025

Emory University
2018-2024

Soochow University
2024

China Mobile (China)
2023

Shandong University of Science and Technology
2023

Zhejiang University of Science and Technology
2022

PRG S&Tech (South Korea)
2021

Air Force Medical University
2019

Institute of Seismology
2018

University of Electronic Science and Technology of China
2014

Many adaptations of transformers have emerged to address the single-modal vision tasks, where self-attention modules are stacked handle input sources like images. Intuitively, feeding multiple modalities data could improve performance, yet innermodal attentive weights may be diluted, which thus greatly undermine final performance. In this paper, we propose a multimodal token fusion method (TokenFusion), tailored for transformer-based tasks. To effectively fuse modalities, TokenFusion...

10.1109/cvpr52688.2022.01187 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

10.1109/cvpr52733.2024.02022 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024-06-16

Deep multimodal fusion by using multiple sources of data for classification or regression has exhibited a clear advantage over the unimodal counterpart on various applications. Yet, current methods including aggregation-based and alignment-based are still inadequate in balancing trade-off between inter-modal intra-modal processing, incurring bottleneck performance improvement. To this end, paper proposes Channel-Exchanging-Network (CEN), parameter-free framework that dynamically exchanges...

10.48550/arxiv.2011.05005 preprint EN other-oa arXiv (Cornell University) 2020-01-01

3D object detection is an important task in autonomous driving to perceive the surroundings. Despite excellent performance, existing detectors lack robustness real-world corruptions caused by adverse weathers, sensor noises, etc., provoking concerns about safety and reliability of systems. To comprehensively rigorously benchmark corruption detectors, this paper we design 27 types common for both LiDAR camera inputs considering realworld scenarios. By synthesizing these on public datasets,...

10.1109/cvpr52729.2023.00105 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

This paper presents an algorithm for classifying single-trial electroencephalogram (EEG) during the preparation of self-paced tapping. It combines common spatial subspace decomposition with Fisher discriminant analysis to extract features from multichannel EEG. Three are obtained based on Bereitschaftspotential and event-related desynchronization. Finally, a perceptron neural network is trained as classifier. was applied data set <self-paced 1s> "BCI Competition 2003" classification accuracy...

10.1109/tbme.2004.826697 article EN IEEE Transactions on Biomedical Engineering 2004-05-25

3D object detection is a crucial research topic in computer vision, which usually uses point clouds as input conventional setups. Recently, there trend of leveraging multiple sources data, such complementing the cloud with 2D images that often have richer color and fewer noises. However, due to heterogeneous geometrics representations, it prevents us from applying off-the-shelf neural networks achieve multimodal fusion. To end, we propose Bridged Transformer (BrT), an end-to-end architecture...

10.1109/cvpr52688.2022.01180 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

Face recognition is a prevailing authentication solution in numerous biometric applications. Physical adversarial attacks, as an important surrogate, can identify the weak-nesses of face systems and evaluate their ro-bustness before deployed. However, most existing physical attacks are either detectable readily or ineffective against commercial systems. The goal this work to develop more reliable technique that carry out end-to-end evaluation robustness for It requires simultaneously deceive...

10.1109/cvpr52729.2023.00401 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

Large transformers have demonstrated remarkable success, making it necessary to compress these models reduce inference costs while preserving their perfor-mance. Current compression algorithms prune at fixed ratios, requiring a unique pruning process for each ratio, which results in high computational costs. In contrast, we propose of pretrained any desired ratio within single stage, based on differential inclusion mask parameter. This dynamic can generate the whole regularization solution...

10.48550/arxiv.2501.03289 preprint EN arXiv (Cornell University) 2025-01-06

The advancement of 4D (i.e., sequential 3D) generation opens up new possibilities for lifelike experiences in various applications, where users can explore dynamic objects or characters from any viewpoint. Meanwhile, video generative models are receiving particular attention given their ability to produce realistic and imaginative frames. These also observed exhibit strong 3D consistency, indicating the potential act as world simulators. In this work, we present Video4DGen, a novel framework...

10.1109/tpami.2025.3550031 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2025-01-01

Unsupervised non-rigid point cloud shape correspondence underpins a multitude of 3D vision tasks, yet itself is non-trivial given the exponential complexity stemming from inter-point degree-of-freedom, i.e., pose transformations. Based on assumption local rigidity, one solution for reducing to decompose overall into independent regions using Local Reference Frames (LRFs) that are equivariant SE(3) However, focus solely structure neglects global geometric contexts, resulting in less...

10.1109/tip.2025.3550006 article EN IEEE Transactions on Image Processing 2025-01-01

With the rapid advancements in diffusion models and 3D generation techniques, dynamic content has become a crucial research area. However, achieving high-fidelity 4D (dynamic 3D) with strong spatial-temporal consistency remains challenging task. Inspired by recent findings that pretrained features capture rich correspondences, we propose FB-4D, novel framework integrates Feature Bank mechanism to enhance both spatial temporal generated frames. In store extracted from previous frames fuse...

10.48550/arxiv.2503.20784 preprint EN arXiv (Cornell University) 2025-03-26

We propose a compact and effective framework to fuse multimodal features at multiple layers in single network. The consists of two innovative fusion schemes. Firstly, unlike existing methods that necessitate individual encoders for different modalities, we verify can be learnt within shared network by merely maintaining modality-specific batch normalization the encoder, which also enables implicit via joint feature representation learning. Secondly, bidirectional multi-layer scheme, where...

10.1145/3394171.3413621 article EN Proceedings of the 30th ACM International Conference on Multimedia 2020-10-12

Multimodal fusion and multitask learning are two vital topics in machine learning. Despite the fruitful progress, existing methods for both problems still brittle to same challenge-it remains dilemmatic integrate common information across modalities (resp. tasks) meanwhile preserving specific patterns of each modality task). Besides, while they actually closely related other, multimodal rarely explored within methodological framework before. In this paper, we propose...

10.1109/tpami.2022.3211086 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2022-09-30

We propose a deep fine-grained multi-level fusion architecture for monocular 3D object detection, with an additionally designed anti-occlusion optimization process. Conventional detection methods usually leverage geometry constraints such as keypoints, shape relationships, and to 2D optimizations offset the lack of accurate depth information. However, these still struggle against directly extracting rich information from estimation. To solve problem, we integrate features pseudo-LiDAR filter...

10.1109/tip.2022.3180210 article EN IEEE Transactions on Image Processing 2022-01-01

In the low-bit quantization field, training Binarized Neural Networks (BNNs) is extreme solution to ease deployment of deep models on resource-constrained devices, having lowest storage cost and significantly cheaper bit-wise operations compared 32-bit floating-point counterparts. this paper, we introduce Sub-bit (SNNs), a new type binary design tailored compress accelerate BNNs. SNNs are inspired by an empirical observation, showing that kernels learnt at convolutional layers BNN model...

10.1109/iccv48922.2021.00531 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

Improving the performance of click-through rate (CTR) prediction remains one core tasks in online advertising systems. With rise deep learning, CTR models with networks remarkably enhance model capacities. In models, exploiting users' historical data is essential for learning behaviors and interests. As existing works neglect importance temporal signals when embed clicking records, we propose a time-aware attention which explicitly uses absolute expressing periodic relative relation between...

10.1145/3357384.3357936 preprint EN 2019-11-03

Video generative models are receiving particular attention given their ability to generate realistic and imaginative frames. Besides, these also observed exhibit strong 3D consistency, significantly enhancing potential act as world simulators. In this work, we present Vidu4D, a novel reconstruction model that excels in accurately reconstructing 4D (i.e., sequential 3D) representations from single generated videos, addressing challenges associated with non-rigidity frame distortion. This...

10.48550/arxiv.2405.16822 preprint EN arXiv (Cornell University) 2024-05-27

Gesture recognition has been paid more and attention as a new generation of human-computer interaction visual input mode. In motion sensing games other applications, gesture is used the interface. However, because its inherent features such diversity, ambiguity, space-time difference large computational burden, it difficult to achieve real-time application with software, especially in embedded system. Therefore, this paper we propose hardware-based system well innovative algorithm...

10.1109/icosp.2014.7015043 article EN 2014-10-01

Tactile sensing plays an important role in robotic perception and manipulation tasks. To overcome the real-world limitations of data collection, simulating tactile response a virtual environment comes as desirable direction research. In this paper, we propose Elastic Interaction Particles (EIP) for simulation, which is capable reflecting elastic property sensor well characterizing fine-grained physical interaction during contact. Specifically, EIP models group coordinated particles, applied...

10.1145/3474085.3475414 article EN Proceedings of the 30th ACM International Conference on Multimedia 2021-10-17

Advancements in 3D scene reconstruction have transformed 2D images from the real world into models, producing realistic results hundreds of input photos. Despite great success dense-view scenarios, rendering a detailed insufficient captured views is still an ill-posed optimization problem, often resulting artifacts and distortions unseen areas. In this paper, we propose ReconX, novel paradigm that reframes ambiguous challenge as temporal generation task. The key insight to unleash strong...

10.48550/arxiv.2408.16767 preprint EN arXiv (Cornell University) 2024-08-29

Audio-visual navigation task requires an agent to find a sound source in realistic, unmapped 3D environment by utilizing egocentric audio-visual observations. Existing works assume clean that solely contains the target sound, which, however, would not be suitable most real-world applications due unexpected noise or intentional interference. In this work, we design acoustically complex besides there exists attacker playing zero-sum game with agent. More specifically, can move and change...

10.48550/arxiv.2202.10910 preprint EN other-oa arXiv (Cornell University) 2022-01-01

This work focuses on the 3D reconstruction of non-rigid objects based monocular RGB video sequences. Concretely, we aim at building high-fidelity models for generic object categories and casually captured scenes. To this end, do not assume known root poses objects, utilize category-specific templates or dense pose priors. The key idea our method, Root Pose Decomposition (RPD), is to maintain a per-frame transformation, meanwhile field with local transformations rectify pose. optimization...

10.1109/iccv51070.2023.01277 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01
Coming Soon ...