X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers
Closed captioning
Discriminative model
Generality
Generative model
Mirroring
DOI:
10.18653/v1/2020.emnlp-main.707
Publication Date:
2020-11-29T14:51:46Z
AUTHORS (5)
ABSTRACT
Mirroring the success of masked language models, vision-and-language counterparts like VILBERT, LXMERT and UNITER have achieved state art performance on a variety multimodal discriminative tasks visual question answering grounding. Recent work has also successfully adapted such models towards generative task image captioning. This begs question: Can these go other way generate images from pieces text? Our analysis popular representative this model family – finds that it is unable to rich semantically meaningful imagery with its current training setup. We introduce X-LXMERT, an extension refinements including: discretizing representations, using uniform masking large range ratios aligning right pre-training datasets objectives which enables paint. X-LXMERT’s generation capabilities rival while captioning abilities remains comparable LXMERT. Finally, we demonstrate generality by adding into produce X-UNITER.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (52)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....