NFDI4DS | UHH-SEMS - Publication Details

SurgicalGPT: End-to-End Language-Vision GPT for Visual Question Answering in Surgery

Feature (linguistics)

DOI: 10.48550/arxiv.2304.09974 Publication Date: 2023-01-01

Abstract Supplemental Material References Cited by

AUTHORS (4)

Lalithkumar Seeni...

Mobarakol Islam

G. Kannan

Hongliang Ren

ABSTRACT

Advances in GPT-based large language models (LLMs) are revolutionizing natural processing, exponentially increasing its use across various domains. Incorporating uni-directional attention, these autoregressive LLMs can generate long and coherent paragraphs. However, for visual question answering (VQA) tasks that require both vision with bi-directional attention or employing fusion techniques often employed to capture the context of multiple modalities all at once. As GPT does not natively process tokens, exploit advancements VQA robotic surgery, we design an end-to-end trainable Language-Vision (LV-GPT) model expands GPT2 include input (image). The proposed LV-GPT incorporates a feature extractor (vision tokenizer) token embedding (token type pose). Given limitations unidirectional their ability paragraphs, carefully sequence word tokens before mimicking human thought understanding infer answer from image. Quantitatively, prove outperforms other state-of-the-art on two publically available surgical-VQA datasets (based endoscopic challenge scene segmentation 2018 CholecTriplet2021) our newly annotated dataset holistic surgical dataset). We further annotate three question-type annotations allow sub-type analysis. Furthermore, extensively study present effects sequencing, pose model.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENALEX - Publications OPENAIRE - Products

PlumX Metrics

SurgicalGPT: End-to-End Language-Vision GPT for Visual Question Answering in Surgery

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....