NFDI4DS | UHH-SEMS - Publication Details

Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering

OPENALEX - Publications

Abhirama Subramanyam Penamakuri Manish Gupta Mithun Das Gupta Anand Mishra

We study visual question answering in a setting where the answer has to be mined from pool of relevant and irrelevant images given as context. For such setting, model must first retrieve these retrieved images. refer this problem retrieval-based (or RETVQA short). The is distinctively different more challenging than traditionally-studied Visual Question Answering (VQA), answered with single image Towards solving task, we propose unified Multi Image BART (MI-BART) that takes using our...

10.24963/ijcai.2023/146 article EN 2023-08-01

Contrastive Multi-View Textual-Visual Encoding: Towards One Hundred Thousand-Scale One-Shot Logo Identification✱

OPENALEX - Publications

Nakul Sharma Abhirama Subramanyam Penamakuri Anand Mishra

In this paper, we study the problem of identifying logos business brands in natural scenes an open-set one-shot setting. This setup is significantly more challenging than traditionally-studied 'closed-set' and 'large-scale training samples per category' logo recognition settings. We propose a novel multi-view textual-visual encoding framework that encodes text appearing as well graphical design to learn robust contrastive representations. These representations are jointly learned for...

10.1145/3571600.3571625 preprint EN 2022-12-08

Visual Text Matters: Improving Text-KVQA with Visual Text Entity Knowledge-aware Large Multimodal Assistant

OPENALEX - Publications

Abhirama Subramanyam Penamakuri Anand Mishra

We revisit knowledge-aware text-based visual question answering, also known as Text-KVQA, in the light of modern advancements large multimodal models (LMMs), and make following contributions: (i) propose VisTEL - a principled approach to perform text entity linking. The proposed module harnesses state-of-the-art recognition engine power model jointly reason using textual context obtained surrounding cues image link correct knowledge base entity. (ii) present KaLMA assistant that augments an...

10.48550/arxiv.2410.19144 preprint EN arXiv (Cornell University) 2024-10-24

Visual Text Matters: Improving Text-KVQA with Visual Text Entity Knowledge-aware Large Multimodal Assistant

OPENALEX - Publications

Abhirama Subramanyam Penamakuri Anand Mishra

10.18653/v1/2024.emnlp-main.1151 article EN Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2024-01-01

COFAR: Commonsense and Factual Reasoning in Image Search

OPENALEX - Publications

Prajwal Gatti Abhirama Subramanyam Penamakuri Revant Teotia Anand Mishra Shubhashis Sengupta and 1 more

One characteristic that makes humans superior to modern artificially intelligent models is the ability interpret images beyond what visually apparent. Consider following two natural language search queries - (i) "a queue of customers patiently waiting buy ice cream" and (ii) tourists going see a famous Mughal architecture in India." Interpreting these requires one reason with Commonsense such as interpreting people or tourists, actions see; Fact world knowledge associated named visual...

10.48550/arxiv.2210.08554 preprint EN cc-by-nc-sa arXiv (Cornell University) 2022-01-01

COFAR: Commonsense and Factual Reasoning in Image Search

OPENALEX - Publications

Prajwal Gatti Abhirama Subramanyam Penamakuri Revant Teotia Anand Mishra Shubhashis Sengupta and 1 more

10.18653/v1/2022.aacl-main.87 article EN 2022-01-01

Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering

OPENALEX - Publications

Abhirama Subramanyam Penamakuri Manish Gupta Mithun Das Gupta Anand Mishra

We study visual question answering in a setting where the answer has to be mined from pool of relevant and irrelevant images given as context. For such setting, model must first retrieve these retrieved images. refer this problem retrieval-based (or RETVQA short). The is distinctively different more challenging than traditionally-studied Visual Question Answering (VQA), answered with single image Towards solving task, we propose unified Multi Image BART (MI-BART) that takes using our...

10.48550/arxiv.2306.16713 preprint EN cc-by-nc-sa arXiv (Cornell University) 2023-01-01