Chunyu Wang

ORCID: 0000-0002-9400-9107
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Human Pose and Action Recognition
  • Video Surveillance and Tracking Methods
  • Hand Gesture Recognition Systems
  • Advanced Neural Network Applications
  • Advanced Image and Video Retrieval Techniques
  • Gait Recognition and Analysis
  • Advanced Vision and Imaging
  • Anomaly Detection Techniques and Applications
  • Video Analysis and Summarization
  • Human Motion and Animation
  • Diabetic Foot Ulcer Assessment and Management
  • Generative Adversarial Networks and Image Synthesis
  • Multimodal Machine Learning Applications
  • Visual Attention and Saliency Detection
  • 3D Shape Modeling and Analysis
  • Robotics and Sensor-Based Localization
  • Infrared Target Detection Methodologies
  • Agriculture, Land Use, Rural Development
  • Computer Graphics and Visualization Techniques
  • Image Processing Techniques and Applications
  • Fire Detection and Safety Systems
  • Face recognition and analysis
  • Advanced Computational Techniques and Applications
  • Image Processing and 3D Reconstruction
  • 3D Surveying and Cultural Heritage

Microsoft Research Asia (China)
2018-2025

University of Electronic Science and Technology of China
2018-2024

Liaoning Shihua University
2024

Chaohu University
2024

Tianjin University of Technology
2023

Microsoft Research (United Kingdom)
2023

Zhejiang University of Technology
2023

China Agricultural University
2022

Peking University
2012-2016

We address action recognition in videos by modeling the spatial-temporal structures of human poses. start improving a state art method for estimating joint locations from videos. More precisely, we obtain K-best estimations output existing and incorporate additional segmentation cues temporal constraints to select ``best'' one. Then group estimated joints into five body parts (e.g. left arm) apply data mining techniques representation actions. This captures spatial configurations one frame...

10.1109/cvpr.2013.123 article EN 2009 IEEE Conference on Computer Vision and Pattern Recognition 2013-06-01

A human pose is naturally represented as a graph where the joints are nodes and bones edges. So it natural to apply Graph Convolutional Network (GCN) estimate 3D poses from 2D poses. In this work, we propose generic formulation both GCN Fully Connected (FCN) its special cases. From formulation, discover that has limited representation power when used for estimating We overcome limitation by introducing Locally (LCN) which implemented formulation. It notably improves capability over GCN....

10.1109/iccv.2019.00235 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2019-10-01

Human pose estimation is a key step to action recognition. We propose method of estimating 3D human poses from single image, which works in conjunction with an existing 2D pose/joint detector. challenging because multiple may correspond the same after projection due lack depth information. Moreover, current estimators are usually inaccurate cause errors estimation. address challenges three ways: (i) represent as linear combination sparse set bases learned skeletons. (ii) enforce limb length...

10.1109/cvpr.2014.303 article EN 2009 IEEE Conference on Computer Vision and Pattern Recognition 2014-06-01

Robustness and discrimination power are two fundamental requirements in visual object tracking. In most tracking paradigms, we find that the features extracted by popular Siamese-like networks cannot fully discriminatively model tracked targets distractor objects, hindering them from simultaneously meeting these requirements. While methods focus on designing robust correlation operations, propose a novel target-dependent feature network inspired self-/cross-attention scheme. contrast to...

10.1109/cvpr52688.2022.00855 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

Compared with object detection in static images, videos is more challenging due to degraded image qualities. An effective way address this problem exploit temporal contexts by linking the same across video form tubelets and aggregating classification scores tubelets. In paper, we focus on obtaining high quality results for better classification. Unlike previous methods that link objects checking boxes between neighboring frames, propose frame. To achieve goal, extend prior following aspects:...

10.1109/tpami.2019.2910529 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2019-04-16

We present <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">VoxelTrack</i> for multi-person 3D pose estimation and tracking from a few cameras which are separated by wide baselines. It employs multi-branch network to jointly estimate poses re-identification (Re-ID) features all people in the environment. In contrast previous efforts require establish cross-view correspondence based on noisy 2D estimates, it directly estimates tracks...

10.1109/tpami.2022.3163709 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2022-04-15

Estimating 3D human pose from a single image suffers severe ambiguity since multiple joint configurations may have the same 2D projection. The state-of-the-art methods often rely on context modeling such as pictorial structure model (PSM) or graph neural network (GNN) to reduce ambiguity. However, there is no study that rigorously compares them side by side. So we first present general formula for in which both PSM and GNN are its special cases. By comparing two methods, found end-to-end...

10.1109/cvpr46437.2021.00617 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

Human pose is typically represented by a coordinate vector of body joints or their heatmap embeddings. While easy for data processing, unrealistic estimates are admitted due to the lack dependency modeling between joints. In this paper, we present structured representation, named Pose as Compositional Tokens (PCT), explore joint dependency. It represents M discrete tokens with each characterizing sub-structure several interdependent (see Figure 1). The compositional design enables it achieve...

10.1109/cvpr52729.2023.00071 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

We introduce AiT, a unified output representation for various vision tasks, which is crucial step towards general-purpose task solvers. Despite the challenges posed by high-dimensional and task-specific outputs, we showcase potential of using discrete (VQVAE) to model dense outputs many computer tasks as sequence tokens. This inspired established ability VQ-VAE conserve structures spanning multiple pixels few codes. To that end, present modified shallower architecture improves efficiency...

10.1109/iccv51070.2023.01822 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

Recognizing an action from a sequence of 3D skeletal poses is challenging task. First, different actors may perform the same in various styles. Second, estimated are sometimes inaccurate. These challenges can cause large variations between instances class. Third, datasets usually small, with only few performing repetitions each action. Hence training complex classifiers risks over-fitting data. We address this task by mining set key-pose-motifs for A key-pose-motif contains ordered poses,...

10.1109/cvpr.2016.289 article EN 2016-06-01

We present a Siamese-like Dual-branch network based on solely Transformers for tracking. Given template and search image, we divide them into non-overlapping patches extract feature vector each patch its matching results with others within an attention window. For token, estimate whether it contains the target object corresponding size. The advantage of approach is that features are learned from matching, ultimately, matching. So aligned tracking task. method achieves better or comparable as...

10.1109/iccvw54120.2021.00303 article EN 2021-10-01

We present an approach for 3D human pose estimation from monocular images. The consists of two steps: it first estimates a 2D image and then the corresponding pose. This paper focuses on second step. Graph convolutional network (GCN) has recently become de facto standard related tasks such as action recognition. However, in this work, we show that GCN critical limitations when is used due to inherent weight sharing scheme. are clearly exposed through novel reformulation GCN, which both Fully...

10.1109/tpami.2020.3019139 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2020-08-24

Monocular 3D human mesh estimation faces challenges due to depth ambiguity and the complexity of mapping images complex parameter spaces. Recent methods propose use poses as a proxy representation, which often lose crucial body shape information, leading mediocre performance. Conversely, advanced motion capture systems, though accurate, are impractical for markerless wild images. Addressing these limitations, we introduce an innovative intermediate representation virtual markers, learned...

10.1109/tpami.2025.3535538 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2025-01-01

Multi-camera tracking systems are gaining popularity in applications that demand high-quality results, such as frictionless checkout. In cluttered and crowded environments, monocular multi-object (MOT) often fail due to occlusions. Multiple highly overlapped cameras capable of recovering partial 3D information. When used properly, data can significantly alleviate the occlusion issue. However, training a multi-camera tracker demands large-scale dataset with diverse camera settings...

10.1109/wacv56688.2023.00484 article EN 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2023-01-01

Semantic image segmentation is an important yet unsolved problem. One of the major challenges large variability object scales. To tackle this scale problem, we propose a Scale-Adaptive Network (SAN) which consists multiple branches with each one taking charge objects certain range Given image, SAN first computes dense map indicating pixel automatically determined by size enclosing object. Then features different are fused according to generate final map. ensure that branch indeed learns for...

10.1109/tip.2019.2941644 article EN IEEE Transactions on Image Processing 2019-10-22

Graphic design conveys messages through the combination of text, images and other visual elements. Unstructured designs such as overloaded social media graphics may fail to communicate their intended effectively. To address this issue, layout grouping offers a solution by organizing elements into perceptual groups. While most methods rely on heuristic Gestalt principles, they often lack context modeling ability needed handle complex layouts. In work, we reformulate task set prediction...

10.1109/wacv57701.2024.00107 article EN 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2024-01-03
Coming Soon ...