TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation

Epipolar geometry Merge (version control) ENCODE
DOI: 10.48550/arxiv.2110.09554 Publication Date: 2021-01-01
ABSTRACT
Estimating the 2D human poses in each view is typically first step calibrated multi-view 3D pose estimation. But performance of detectors suffers from challenging situations such as occlusions and oblique viewing angles. To address these challenges, previous works derive point-to-point correspondences between different views epipolar geometry utilize to merge prediction heatmaps or feature representations. Instead post-prediction merge/calibration, here we introduce a transformer framework for estimation, aiming at directly improving individual predictors by integrating information views. Inspired multi-modal transformers, design unified architecture, named TransFusion, fuse cues both current neighboring Moreover, propose concept field encode positional into model. The position encoding guided provides an efficient way pixels Experiments on Human 3.6M Ski-Pose show that our method more has consistent improvements compared other fusion methods. Specifically, achieve 25.8 mm MPJPE with only 5M parameters 256 x resolution.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....