- Multimodal Machine Learning Applications
- Advanced Image and Video Retrieval Techniques
- Domain Adaptation and Few-Shot Learning
- Natural Language Processing Techniques
- Video Analysis and Summarization
- Handwritten Text Recognition Techniques
Indian Institute of Technology Jodhpur
2022-2023
We study visual question answering in a setting where the answer has to be mined from pool of relevant and irrelevant images given as context. For such setting, model must first retrieve these retrieved images. refer this problem retrieval-based (or RETVQA short). The is distinctively different more challenging than traditionally-studied Visual Question Answering (VQA), answered with single image Towards solving task, we propose unified Multi Image BART (MI-BART) that takes using our...
In this paper, we study the problem of identifying logos business brands in natural scenes an open-set one-shot setting. This setup is significantly more challenging than traditionally-studied 'closed-set' and 'large-scale training samples per category' logo recognition settings. We propose a novel multi-view textual-visual encoding framework that encodes text appearing as well graphical design to learn robust contrastive representations. These representations are jointly learned for...
We revisit knowledge-aware text-based visual question answering, also known as Text-KVQA, in the light of modern advancements large multimodal models (LMMs), and make following contributions: (i) propose VisTEL - a principled approach to perform text entity linking. The proposed module harnesses state-of-the-art recognition engine power model jointly reason using textual context obtained surrounding cues image link correct knowledge base entity. (ii) present KaLMA assistant that augments an...
One characteristic that makes humans superior to modern artificially intelligent models is the ability interpret images beyond what visually apparent. Consider following two natural language search queries - (i) "a queue of customers patiently waiting buy ice cream" and (ii) tourists going see a famous Mughal architecture in India." Interpreting these requires one reason with Commonsense such as interpreting people or tourists, actions see; Fact world knowledge associated named visual...
We study visual question answering in a setting where the answer has to be mined from pool of relevant and irrelevant images given as context. For such setting, model must first retrieve these retrieved images. refer this problem retrieval-based (or RETVQA short). The is distinctively different more challenging than traditionally-studied Visual Question Answering (VQA), answered with single image Towards solving task, we propose unified Multi Image BART (MI-BART) that takes using our...