Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark
Benchmark (surveying)
Modality (human–computer interaction)
DOI:
10.48550/arxiv.2306.06494
Publication Date:
2023-01-01
AUTHORS (5)
ABSTRACT
With the availability of large-scale, comprehensive, and general-purpose vision-language (VL) datasets such as MSCOCO, pre-training (VLP) has become an active area research proven to be effective for various VL tasks visual-question answering. However, studies on VLP in medical domain have so far been scanty. To provide a comprehensive perspective tasks, we conduct thorough experimental analysis study key factors that may affect performance with unified Transformer. allow making sound quick decisions, propose RadioGraphy Captions (RGC), high-quality, multi-modality radiographic dataset containing 18,434 image-caption pairs collected from open-access online database MedPix. RGC can used or new benchmark report generation image-text retrieval. By utilizing other available pre-training, develop several insights guide future strong baselines tasks.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....