- Multimodal Machine Learning Applications
- Advanced Image and Video Retrieval Techniques
- Natural Language Processing Techniques
- Domain Adaptation and Few-Shot Learning
- Generative Adversarial Networks and Image Synthesis
- Topic Modeling
- Computer Graphics and Visualization Techniques
China Electronics Technology Group Corporation
2022-2023
Image–text retrieval is a vital task in computer vision and has received growing attention, since it connects cross-modality data. It comes with the critical challenges of learning unified representations eliminating large gap between visual textual domains. Over past few decades, although many works have made significant progress image–text retrieval, they are still confronted challenge incomplete text descriptions images, i.e., how to fully learn correlations relevant region–word pairs...
Synthesizing vivid images with descriptive texts is gradually emerging as a frontier cross-domain generation task. However, it obviously inadequate to generate the high-quality image one single sentence accurately due information asymmetry between modalities, which needs external knowledge balance process. Moreover, limited description of entities in cannot guarantee semantic consistency text and generated image, causing deficiency details foreground background. Here, we propose...
News image captioning aims to generate descriptions containing concrete named entities for news images by leveraging relevant articles. However, existing approaches suffer from two shortcomings: 1) lack of commonsense knowledge required understand entities, and 2) limited multimodal context modeling capabilities. In this paper, we propose migrate the ability large-scale pre-trained models captioning. To acquire factual describing induce a language model reasoning using context-aware prompts....