Xiaohan Wang

ORCID: 0000-0001-6206-7911
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Multimodal Machine Learning Applications
  • Human Pose and Action Recognition
  • Advanced Image and Video Retrieval Techniques
  • Topic Modeling
  • Advanced Graph Neural Networks
  • Natural Language Processing Techniques
  • Domain Adaptation and Few-Shot Learning
  • Anomaly Detection Techniques and Applications
  • Video Surveillance and Tracking Methods
  • Combustion and flame dynamics
  • Advanced Neural Network Applications
  • Complex Network Analysis Techniques
  • Video Analysis and Summarization
  • Stock Market Forecasting Methods
  • 3D Shape Modeling and Analysis
  • Advanced Combustion Engine Technologies
  • Visual Attention and Saliency Detection
  • 3D Surveying and Cultural Heritage
  • Advanced Vision and Imaging
  • Hand Gesture Recognition Systems
  • Image Enhancement Techniques
  • Radiative Heat Transfer Studies
  • Semantic Web and Ontologies
  • Traffic Prediction and Management Techniques
  • Internet Traffic Analysis and Secure E-voting

Stanford University
2024-2025

City University of Hong Kong
2025

Nanjing University
2025

Guangzhou Institute of Energy Conversion
2007-2024

Shenyang Ligong University
2024

Zhejiang University
2021-2024

Beijing Institute of Technology
2021-2024

Harbin Engineering University
2024

Chinese Academy of Sciences
2009-2024

Institute of Computing Technology
2024

Text-video retrieval is a challenging task that aims to search relevant video contents based on natural language descriptions. The key this problem measure text-video similarities in joint embedding space. However, most existing methods only consider the global cross-modal similarity and overlook local details. Some works incorporate comparisons through matching reasoning. These complex operations introduce tremendous computation. In paper, we design an efficient global-local alignment...

10.1109/cvpr46437.2021.00504 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

Vision-language models (VLMs) pre-trained on large- scale image-text pairs have demonstrated impressive transferability various visual tasks. Transferring knowledge from such powerful VLMs is a promising direction for building effective video recognition models. However, current exploration in this field still limited. We believe that the greatest value of lies bridge between and textual domains. In paper, we propose novel framework called BIKE, which utilizes cross-modal to explore...

10.1109/cvpr52729.2023.00640 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

Signal processing on graph is attracting more and attentions. For a signal in the low-frequency subspace, missing data associated with unsampled vertices can be reconstructed through sampled by exploiting smoothness of signal. In this paper, concept local set introduced two local-set-based iterative methods are proposed to reconstruct bandlimited from data. each iteration, one reweights residuals for different vertices, while other propagates their respective sets. These algorithms built...

10.1109/tsp.2015.2411217 article EN IEEE Transactions on Signal Processing 2015-03-06

Signal processing on graphs is an emerging research field dealing with signals living irregular domain that captured by a graph, and has been applied to sensor networks, machine learning, climate analysis, etc. Existing works sampling reconstruction of graph mainly studied static bandlimited signals. However, many real-world are time-varying, they evolve smoothly, so instead the themselves being or smooth it more reasonable their temporal differences graph. In this paper, new batch method...

10.1109/jstsp.2017.2726969 article EN IEEE Journal of Selected Topics in Signal Processing 2017-07-13

In this paper, we propose to tackle egocentric action recognition by suppressing background distractors and enhancing action-relevant interactions. The existing approaches usually utilize two independent branches recognize actions, i.e., a verb branch noun branch. However, the mechanism suppress distracting objects exploit local human-object correlations is missing. To end, introduce extra sources of information, candidate spatial location their discriminative features, enable concentration...

10.1109/tpami.2020.3015894 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2020-08-11

Recently, large-scale pre-training methods like CLIP have made great progress in multi-modal research such as text-video retrieval. In CLIP, transformers are vital for modeling complex relations. However, the vision transformer of essential visual tokenization process, which produces discrete token sequences, generates many homogeneous tokens due to redundancy nature consecutive and similar frames videos. This significantly increases computation costs hinders deployment video retrieval...

10.1145/3477495.3531950 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2022-07-06

Text-video retrieval is one of the basic tasks for multimodal research and has been widely harnessed in many real-world systems. Most existing approaches directly compare global representation between videos text descriptions utilize contrastive loss to train model. These designs overlook local alignment word-level supervision signal. In this paper, we propose a new framework, called Align Tell, text-video retrieval. Compared previous work, our framework contains additional modules, <italic...

10.1109/tmm.2022.3204444 article EN IEEE Transactions on Multimedia 2022-09-05

People live in a 3D world. However, existing works on person re-identification (re-id) mostly consider the semantic representation learning 2D space, intrinsically limiting understanding of people. In this work, we address limitation by exploring prior knowledge body structure. Specifically, project images to space and introduce novel parameter-efficient Omni-scale Graph Network (OG-Net) learn pedestrian directly from point clouds. OG-Net effectively exploits local information provided...

10.1109/tnnls.2022.3214834 article EN IEEE Transactions on Neural Networks and Learning Systems 2022-11-04

Video-based 3D human pose and shape estimations are evaluated by intra-frame accuracy inter-frame smoothness. Although these two metrics responsible for different ranges of temporal consistency, existing state-of-the-art methods treat them as a unified problem use monotonous modeling structures (e.g., RNN or attention-based block) to design their networks. However, using single kind structure is difficult balance the learning short-term long-term correlations, may bias network one them,...

10.1109/cvpr52729.2023.00858 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

10.1016/j.compenvurbsys.2024.102153 article EN Computers Environment and Urban Systems 2024-07-29

This paper delves into the challenges of achieving scalable and effective multi-object modeling for semi-supervised Video Object Segmentation (VOS). Previous VOS methods decode features with a single positive object, limiting learning representation as they must match segment each target separately under scenarios. Additionally, earlier techniques catered to specific application objectives lacked flexibility fulfill different speed-accuracy requirements. To address these problems, we present...

10.1109/tpami.2024.3383592 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2024-04-02

Sign language provides a way for differently-abled individuals to express their feelings and emotions. However, learning sign can be challenging time consuming. An alternative approach is animate user photos using videos of specific words, which achieved existing image animation methods. the finger motions in generated are often not ideal. To address this issue, we propose Structure-aware Temporal Consistency Network (STCNet), jointly optimizes prior structure humans with temporal...

10.1145/3648368 article EN ACM Transactions on Multimedia Computing Communications and Applications 2024-02-16

Image sensors with internal computing capabilities fuse sensing and to significantly reduce the power consumption latency of machine vision tasks. Linear photodetectors such as 2D semiconductors tunable electrical optical properties enable in-sensor for multiple functions. In-sensor at single-photon level is much more plausible but has not yet been achieved. Here, we demonstrate a photon-efficient camera based on superconducting nanowire array detector four programmable dimensions including...

10.1038/s41467-025-58501-2 article EN cc-by-nc-nd Nature Communications 2025-04-03

Recently, many researchers started to challenge a long-standing practice of digital photography: oversampling followed by compression and pursuing more intelligent sparse sampling techniques. In this paper, we propose practical approach uniform down in image space yet making the adaptive spatially varying, directional low-pass prefiltering. The resulting down-sampled prefiltered remains conventional square sample grid, and, thus, it can be compressed transmitted without any change current...

10.1109/tip.2008.2010638 article EN IEEE Transactions on Image Processing 2009-02-13

The rapid development of signal processing on graphs provides a new perspective for large-scale data associated with irregular domains. In many practical applications, it is necessary to handle massive sets through complex networks, in which most nodes have limited computing power. Designing efficient distributed algorithms critical this task. This paper focuses the reconstruction time-varying bandlimited graph based observations sampled at subset selected nodes. A least square (DLSR)...

10.1109/jstsp.2015.2403799 article EN IEEE Journal of Selected Topics in Signal Processing 2015-02-13

Egocentric video recognition is a natural testbed for diverse interaction reasoning. Due to the large action vocabulary in egocentric datasets, recent studies usually utilize two-branch structure recognition, i.e., one branch verb classification and other noun classification. However, correlation study between branches have been largely ignored. Besides, two fail exploit local features due absence of position-aware attention mechanism. In this paper, we propose novel Symbiotic Attention...

10.1609/aaai.v34i07.6907 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2020-04-03

Egocentric video recognition is a challenging task that requires to identify both the actor’s motion and active object actor interacts with. Recognizing particularly hard due cluttered background with distracting objects, frequent field of view changes, severe occlusion, etc. To improve classification, most existing methods use detectors or human gaze information, which are computationally expensive require labor-intensive annotations. avoid these additional costs, we propose an end-to-end...

10.1109/iccv48922.2021.00806 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

In this paper, we present a new large-scale dataset for the video panoptic segmentation task, which aims to assign semantic classes and track identities all pixels in video. As ground truth task is difficult annotate, previous datasets are limited by either small scales or number of scenes. contrast, our VIdeo Panoptic Segmentation Wild (VIPSeg) provides 3,536 videos 84,750 frames with pixel-level annotations, covering wide range real-world scenarios categories. To best knowledge, VIPSeg...

10.1109/cvpr52688.2022.02036 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

Vision-language navigation (VLN), which entails an agent to navigate 3D environments following human instructions, has shown great advances. However, current agents are built upon panoramic observations, hinders their ability perceive scene geometry and easily leads ambiguous selection of view. To address these limitations, we present a BEV Scene Graph (BSG), leverages multi-step representations encode layouts geometric cues indoor environment under the supervision detection. During...

10.1109/iccv51070.2023.01007 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

Point cloud analysis (such as 3D segmentation and detection) is a challenging task, because of not only the irregular geometries many millions unordered points, but also great variations caused by depth, viewpoint, occlusion, etc. Current studies put much focus on adaption neural networks to complex point clouds, are blind fundamental question: how learn an appropriate embedding space that aware both discriminative semantics variations? As response, we propose clustering based supervised...

10.1109/iccv51070.2023.00761 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

Text-video retrieval is a critical multi-modal task to find the most relevant video for text query. Although pretrained models like CLIP have demonstrated impressive potential in this area, rising cost of fully finetuning these due increasing model size continues pose problem. To address challenge, prompt tuning has emerged as an alternative. However, existing works still face two problems when adapting image-text downstream video-text tasks: (1) The visual encoder could only encode...

10.1609/aaai.v38i7.28475 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2024-03-24
Coming Soon ...