Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark

Benchmark (surveying) Modality (human–computer interaction)
DOI: 10.48550/arxiv.2306.06494 Publication Date: 2023-01-01
ABSTRACT
With the availability of large-scale, comprehensive, and general-purpose vision-language (VL) datasets such as MSCOCO, pre-training (VLP) has become an active area research proven to be effective for various VL tasks visual-question answering. However, studies on VLP in medical domain have so far been scanty. To provide a comprehensive perspective tasks, we conduct thorough experimental analysis study key factors that may affect performance with unified Transformer. allow making sound quick decisions, propose RadioGraphy Captions (RGC), high-quality, multi-modality radiographic dataset containing 18,434 image-caption pairs collected from open-access online database MedPix. RGC can used or new benchmark report generation image-text retrieval. By utilizing other available pre-training, develop several insights guide future strong baselines tasks.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()