S^2Former-OR: Single-Stage Bimodal Transformer for Scene Graph Generation in OR

FOS: Computer and information sciences Transformer 3D surgical scene understanding Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition Scene graph generation Single-stage Bi-modal
DOI: 10.48550/arxiv.2402.14461 Publication Date: 2024-02-22
ABSTRACT
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence the operating room (OR). However, previous works have primarily relied on multi-stage learning that generates semantic scene graphs dependent intermediate processes with pose estimation and object detection, which may compromise model efficiency efficacy, also impose extra annotation burden. In this study, we introduce a novel single-stage bimodal transformer framework for SGG OR, termed S^2Former-OR, aimed to complementally leverage multi-view 2D scenes 3D point clouds an end-to-end manner. Concretely, our embraces View-Sync Transfusion scheme encourage visual information interaction. Concurrently, Geometry-Visual Cohesion operation designed integrate synergic features into cloud features. Moreover, based augmented feature, propose relation-sensitive decoder embeds dynamic entity-pair queries relational trait priors, enables direct prediction relations without steps. Extensive experiments validated superior performance lower computational cost S^2Former-OR 4D-OR benchmark, compared current OR-SGG methods, e.g., 3% Precision increase 24.2M reduction parameters. We further method generic methods broader metrics comprehensive evaluation, consistently better achieved. The code will be made available.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....