Xuanhan Wang

ORCID: 0000-0002-3881-9658
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Human Pose and Action Recognition
  • Multimodal Machine Learning Applications
  • Advanced Neural Network Applications
  • Video Surveillance and Tracking Methods
  • Anomaly Detection Techniques and Applications
  • Gait Recognition and Analysis
  • Salivary Gland Tumors Diagnosis and Treatment
  • Adversarial Robustness in Machine Learning
  • Advanced Image and Video Retrieval Techniques
  • Domain Adaptation and Few-Shot Learning
  • Stroke Rehabilitation and Recovery
  • Oral Health Pathology and Treatment
  • Generative Adversarial Networks and Image Synthesis
  • Topic Modeling
  • Salivary Gland Disorders and Functions
  • Natural Language Processing Techniques
  • Hand Gesture Recognition Systems
  • Video Analysis and Summarization
  • Lipid metabolism and disorders
  • Reproductive System and Pregnancy
  • Human Motion and Animation
  • Advanced Vision and Imaging
  • AI in cancer detection
  • Context-Aware Activity Recognition Systems
  • Cancer-related molecular mechanisms research

Yangzhou University
2022-2024

University of Electronic Science and Technology of China
2016-2023

Human activity recognition in videos with convolutional neural network (CNN) features has received increasing attention multimedia understanding. Taking as a sequence of frames, new record was recently set on several benchmark datasets by feeding frame-level CNN to long short-term memory (LSTM) model for video recognition. This recurrent model-based visual pipeline is natural choice perceptual problems time-varying input or sequential outputs. However, the above-mentioned takes LSTM, which...

10.1109/lsp.2016.2611485 article EN IEEE Signal Processing Letters 2016-09-20

3-D convolutional neural networks (3-D-convNets) have been very recently proposed for action recognition in videos, and promising results are achieved. However, existing 3-D-convNets has two "artificial" requirements that may reduce the quality of video analysis: 1) It requires a fixed-sized (e.g., 112 $\times$ 112) input video; 2) most require fixed-length (i.e., shots with fixed number frames). To tackle these issues, we propose an end-to-end pipeline named Two-stream 3-D-convNet Fusion,...

10.1109/tmm.2017.2749159 article EN IEEE Transactions on Multimedia 2017-09-04

The scene graph generation (SGG) task aims to detect visual relationship triplets, i.e., subject, predicate, object, in an image, providing a structural vision layout for understanding. However, current models are stuck common predicates, e.g., "on" and "at", rather than informative ones, "standing on" "looking at", resulting the loss of precise information overall performance. If model only uses "stone on road" "blocking" describe it is easy misunderstand scene. We argue that this...

10.1109/iccv48922.2021.01607 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

Video visual question answering (V-VQA) remains challenging at the intersection of vision and language, where it requires joint comprehension video natural language question. Image-Question co-attention mechanism, which aims generating a spatial map highlighting image regions relevant to vice versa, has obtained impressive results. Despite success, simply applying results in unsatisfactory performance due complexity temporal nature videos. In this paper, we proposed novel architecture,...

10.1145/3343031.3350971 article EN Proceedings of the 30th ACM International Conference on Multimedia 2019-10-15

Skeleton-based action recognition aims to project skeleton sequences categories, where are derived from multiple forms of pre-detected points. Compared with earlier methods that focus on exploring single-form skeletons via Graph Convolutional Networks (GCNs), existing tend improve GCNs by leveraging multi-form due their complementary cues. However, these (either adapting structure or model ensemble) require the co-existence all during both training and inference stages, while a typical...

10.1145/3503161.3547811 article EN Proceedings of the 30th ACM International Conference on Multimedia 2022-10-10

The graph convolutional networks (GCNs), which model human body skeletons as several spatial-temporal graphs, have been widely used and become a key to representative feature extraction. However, existing methods limitations in recognizing action the wild, where are captured from real-world scenes with diversified view-points, obvious motion blurs, complex interactions fast varying resolutions of body. In this paper, we propose Multi-modal Knowledge Embedded Graph Convolutional Network...

10.1109/icme52920.2022.9859787 article EN 2022 IEEE International Conference on Multimedia and Expo (ICME) 2022-07-18

Human densepose estimation, aiming at establishing dense correspondences between 2D pixels of human body and 3D template, is a key technique in enabling machines to have an understanding people images. It still poses several challenges due practical scenarios where real-world scenes are complex only partial annotations available, leading incompelete or false estimations. In this work, we present novel framework detect the multiple image. The proposed method, which refer Knowledge Transfer...

10.1109/tcsvt.2022.3181604 article EN IEEE Transactions on Circuits and Systems for Video Technology 2022-06-09

To determine whether age at menarche (AAM), first live birth (AFB), and estradiol levels are causally correlated with the development of systemic lupus erythematosus (SLE).A two-sample Mendelian randomization (MR) analysis was performed after data collected from a dataset genome-wide association studies (GWASs) related to SLE (as outcome), open access databases find statistics AAM, AFB, exposure).In our study, negative causal correlation between AAM confirmed by MR (MR egger: beta = 0.116,...

10.1177/09612033231180358 article EN Lupus 2023-05-29

Despite of the recent great progress on multi-person pose estimation, existing solutions still remain challenging under condition "crowded scenes'', where RGB images capture complex real-world scenes with highly-overlapped people, severe occlusions and diverse postures. In this work, we focus two main problems: 1) how to design an effective pipeline for crowded estimation; 2) equip ability relation modeling interference resolving. To tackle these problems, propose a new named Relation based...

10.1609/aaai.v35i2.16206 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2021-05-18

Multiple human parsing (MHP) is typically treated as two sub-tasks, i.e., instance separation and body part segmentation. Existing methods usually tackle the sub-tasks by adopting a two-stage strategy, which regards MHP an ROI-based (i.e., detect-then-segment) or grouping-based segment-then-grouping) paradigm. However, strong dependence between limits potential of method, since it often requires qualified prior predictions. Besides, isolated models responsible for bring significant...

10.1109/tmm.2023.3281070 article EN IEEE Transactions on Multimedia 2023-05-29

In this paper, we address the multi-person densepose estimation problem, which aims at learning dense correspondences between 2D pixels of human body and 3D surface. It still poses several challenges due to real-world scenes with scale variations, occlusion insufficient annotations. particular, two main problems: 1) how design a simple yet effective pipeline for estimation; 2) equip ability handling issues limited annotations class-imbalanced labels. To tackle these problems, develop novel...

10.1145/3394171.3414014 article EN Proceedings of the 30th ACM International Conference on Multimedia 2020-10-12

Crowded scenes human pose estimation remains challenging, which requires joint comprehension of multi-persons and their keypoints in a highly complex scenario. The top-down mechanism, is detect-then-estimate pipeline, has become the mainstream solution for general obtained impressive progress. However, simply applying this mechanism to crowded results unsatisfactory performance due several issues, particular involving missing crowds ambiguously labeling during training. To tackle above two...

10.1145/3474085.3475233 article EN Proceedings of the 30th ACM International Conference on Multimedia 2021-10-17

Part-level attribute parsing is a fundamental but challenging task, which requires the region-level visual understanding to provide explainable details of body parts. Most existing approaches address this problem by adding regional convolutional neural network (RCNN) with an prediction head two-stage detector, in attributes parts are identified from localwise part boxes. However, boxes limit clues (i.e., appearance only) lead unsatisfying results, since highly dependent on comprehensive...

10.1109/tcyb.2022.3209653 article EN IEEE Transactions on Cybernetics 2022-11-03

High-resolution representation is necessary for human pose estimation to achieve high performance, and the ensuing problem computational complexity. In particular, predominant methods estimate joints by 2D single-peak heatmaps. Each heatmap can be hori-zontally vertically projected reconstructed a pair of 1D heat vectors. Inspired this observation, we introduce lightweight powerful alternative, Spatially Unidimensional Self-Attention (SUSA), pointwise (1 x 1) convolution that main bottleneck...

10.1109/icme52920.2022.9859751 article EN 2022 IEEE International Conference on Multimedia and Expo (ICME) 2022-07-18

Learning human 2D-3D correspondences aims to map all 2D pixels a 3D template, namely densepose estimation, involving surface patch recognition (i.e., Index-to-Patch (I)) and regression of patch-specific UV coordinates. Despite recent progress, it remains challenging especially under the condition “in wild”, where RGB images capture real-world scenes with backgrounds, occlusions, scale variations, postural diversity. In this paper, we address three vital problems in task: 1) how perceive...

10.1109/tmm.2021.3135145 article EN IEEE Transactions on Multimedia 2021-12-14

Building multi-person pose estimation (MPPE) models that can handle complex foreground and uncommon scenes is an important challenge in computer vision. Aside from designing novel models, strengthening training data a promising direction but remains largely unexploited for the MPPE task. In this article, we systematically identify key deficiencies of existing datasets prevent power well-designed being fully exploited propose corresponding solutions. Specifically, find traditional...

10.1109/tnnls.2023.3244957 article EN IEEE Transactions on Neural Networks and Learning Systems 2023-05-11

Abstract Aim To evaluate the utility of magnetic resonance imaging (MRI) and sialography (MRS) for diagnosis primary Sjögren syndrome (pSS) singly or integrated with 2016 American College Rheumatology (ACR)/European League Against Rheumatic Diseases (EULAR) classification criteria. Methods The diagnostic efficiencies MRI, MRS, labial salivary gland biopsy (LSGB) were evaluated. prediction model was established by multivariate analysis. Finally, performance ACR/EULAR criteria evaluated after...

10.1111/1756-185x.14528 article EN International Journal of Rheumatic Diseases 2022-12-11

Temporal action proposal generation aims to localize temporal segments of human activities in videos. Current boundary-based methods can generate proposals with precise boundary but often suffer from the inferior quality confidence scores used for retrieving. In this article, we propose an effective and end-to-end method, named ProposalVLAD, Proposal-Intra Exploring Network (PVPI-Net). We first a ProposalVLAD module dynamically global features entire video, then combine local final feature...

10.1145/3571747 article EN ACM Transactions on Multimedia Computing Communications and Applications 2022-11-24

Existing methods of multiple human parsing (MHP) apply deep models to learn instance-level representations for segmenting each person into non-overlapped body parts. However, learned often contain many spurious correlations that degrade model generalization, leading be vulnerable visually contextual variations in images (e.g., unseen image styles/external interventions). To tackle this, we present a causal property integrated termed CPI-Parser, which is driven by fundamental principles...

10.1109/tip.2024.3469579 article EN IEEE Transactions on Image Processing 2024-01-01

Targeted adversarial attack, which aims to mislead a model recognize any image as target object by imperceptible perturbations, has become mainstream tool for vulnerability assessment of deep neural networks (DNNs). Since existing targeted attackers only learn attack known classes, they cannot generalize well unknown classes. To tackle this issue, we propose $\bf{G}$eneralized $\bf{A}$dversarial attac$\bf{KER}$ ($\bf{GAKer}$), is able construct examples class. The core idea behind GAKer...

10.48550/arxiv.2407.12292 preprint EN arXiv (Cornell University) 2024-07-16

To guarantee the safety and reliability of autonomous vehicle (AV) systems, corner cases play a crucial role in exploring system's behavior under rare challenging conditions within simulation environments. However, current approaches often fall short meeting diverse testing needs struggle to generalize novel, high-risk scenarios that closely mirror real-world conditions. tackle this challenge, we present AutoScenario, multimodal Large Language Model (LLM)-based framework for realistic case...

10.48550/arxiv.2412.00243 preprint EN arXiv (Cornell University) 2024-11-29
Coming Soon ...