SurgicalGPT: End-to-End Language-Vision GPT for Visual Question Answering in Surgery

Feature (linguistics)
DOI: 10.48550/arxiv.2304.09974 Publication Date: 2023-01-01
ABSTRACT
Advances in GPT-based large language models (LLMs) are revolutionizing natural processing, exponentially increasing its use across various domains. Incorporating uni-directional attention, these autoregressive LLMs can generate long and coherent paragraphs. However, for visual question answering (VQA) tasks that require both vision with bi-directional attention or employing fusion techniques often employed to capture the context of multiple modalities all at once. As GPT does not natively process tokens, exploit advancements VQA robotic surgery, we design an end-to-end trainable Language-Vision (LV-GPT) model expands GPT2 include input (image). The proposed LV-GPT incorporates a feature extractor (vision tokenizer) token embedding (token type pose). Given limitations unidirectional their ability paragraphs, carefully sequence word tokens before mimicking human thought understanding infer answer from image. Quantitatively, prove outperforms other state-of-the-art on two publically available surgical-VQA datasets (based endoscopic challenge scene segmentation 2018 CholecTriplet2021) our newly annotated dataset holistic surgical dataset). We further annotate three question-type annotations allow sub-type analysis. Furthermore, extensively study present effects sequencing, pose model.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....