Abhirama Subramanyam Penamakuri

ORCID: 0000-0003-3646-8492
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Multimodal Machine Learning Applications
  • Advanced Image and Video Retrieval Techniques
  • Domain Adaptation and Few-Shot Learning
  • Natural Language Processing Techniques
  • Video Analysis and Summarization
  • Handwritten Text Recognition Techniques

Indian Institute of Technology Jodhpur
2022-2023

We study visual question answering in a setting where the answer has to be mined from pool of relevant and irrelevant images given as context. For such setting, model must first retrieve these retrieved images. refer this problem retrieval-based (or RETVQA short). The is distinctively different more challenging than traditionally-studied Visual Question Answering (VQA), answered with single image Towards solving task, we propose unified Multi Image BART (MI-BART) that takes using our...

10.24963/ijcai.2023/146 article EN 2023-08-01

In this paper, we study the problem of identifying logos business brands in natural scenes an open-set one-shot setting. This setup is significantly more challenging than traditionally-studied 'closed-set' and 'large-scale training samples per category' logo recognition settings. We propose a novel multi-view textual-visual encoding framework that encodes text appearing as well graphical design to learn robust contrastive representations. These representations are jointly learned for...

10.1145/3571600.3571625 preprint EN 2022-12-08

We revisit knowledge-aware text-based visual question answering, also known as Text-KVQA, in the light of modern advancements large multimodal models (LMMs), and make following contributions: (i) propose VisTEL - a principled approach to perform text entity linking. The proposed module harnesses state-of-the-art recognition engine power model jointly reason using textual context obtained surrounding cues image link correct knowledge base entity. (ii) present KaLMA assistant that augments an...

10.48550/arxiv.2410.19144 preprint EN arXiv (Cornell University) 2024-10-24

10.18653/v1/2024.emnlp-main.1151 article EN Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2024-01-01

One characteristic that makes humans superior to modern artificially intelligent models is the ability interpret images beyond what visually apparent. Consider following two natural language search queries - (i) "a queue of customers patiently waiting buy ice cream" and (ii) tourists going see a famous Mughal architecture in India." Interpreting these requires one reason with Commonsense such as interpreting people or tourists, actions see; Fact world knowledge associated named visual...

10.48550/arxiv.2210.08554 preprint EN cc-by-nc-sa arXiv (Cornell University) 2022-01-01

We study visual question answering in a setting where the answer has to be mined from pool of relevant and irrelevant images given as context. For such setting, model must first retrieve these retrieved images. refer this problem retrieval-based (or RETVQA short). The is distinctively different more challenging than traditionally-studied Visual Question Answering (VQA), answered with single image Towards solving task, we propose unified Multi Image BART (MI-BART) that takes using our...

10.48550/arxiv.2306.16713 preprint EN cc-by-nc-sa arXiv (Cornell University) 2023-01-01
Coming Soon ...