NFDI4DS | UHH-SEMS - Publication Details

Lumos : Empowering Multimodal LLMs with Scene Text Recognition

OPENALEX - Publications

Ashish Shenoy Yichao Lu Srihari Jayakumar Debojeet Chatterjee Mohsen Moslehpour and 9 more

We introduce Lumos, the first end-to-end multimodal question-answering system with text understanding capabilities. At core of Lumos is a Scene Text Recognition (STR) component that extracts from person point-of-view images, output which used to augment input Multimodal Large Language Model (MM-LLM). While building we encountered numerous challenges related STR quality, overall latency, and model inference. In this paper, delve into those challenges, discuss architecture, design choices,...

10.48550/arxiv.2402.08017 preprint EN arXiv (Cornell University) 2024-02-12

ASR Adaptation for E-commerce Chatbots using Cross-Utterance Context and Multi-Task Language Modeling

OPENALEX - Publications

Ashish Shenoy Sravan Bodapati Katrin Kirchhoff

Automatic Speech Recognition (ASR) robustness toward slot entities are critical in e-commerce voice assistants that involve monetary transactions and purchases. Along with effective domain adaptation, it is intuitive cross utterance contextual cues play an important role disambiguating specific content words from speech. In this paper, we investigate various techniques to improve contextualization, word adaptation of a Transformer-XL neural language model (NLM) rescore ASR N-best hypotheses....

10.18653/v1/2021.ecnlp-1.3 preprint EN cc-by 2021-01-01

Adapting Long Context NLM for ASR Rescoring in Conversational Agents

OPENALEX - Publications

Ashish Shenoy Sravan Bodapati Monica Sunkara Srikanth Ronanki Katrin Kirchhoff

Neural Language Models (NLM), when trained and evaluated with context spanning multiple utterances, have been shown to consistently outperform both conventional n-gram language models NLMs that use limited context. In this paper, we investigate various techniques incorporate turn based history into recurrent (LSTM) Transformer-XL NLMs. For NLMs, explore carry over mechanism feature augmentation, where other forms of contextual information such as bot response system dialogue acts classified...

10.21437/interspeech.2021-1849 article EN Interspeech 2022 2021-08-27

Now It Sounds Like You: Learning Personalized Vocabulary On Device

OPENALEX - Publications

Ashish Shenoy Sid Wang Pierce Chuang John Nguyen

In recent years, Federated Learning (FL) has shown significant advancements in its ability to perform various natural language processing (NLP) tasks. This work focuses on applying personalized FL for on-device modeling. Due limitations of memory and latency, these models cannot support the complexity sub-word tokenization or beam search decoding, resulting decision deploy a closed-vocabulary model. However, are unable handle out-of-vocabulary (OOV) words belonging specific users. To address...

10.1609/aaaiss.v3i1.31224 article EN Proceedings of the AAAI Symposium Series 2024-05-20

L umos : Empowering Multimodal LLMs with Scene Text Recognition

OPENALEX - Publications

Ashish Shenoy Yichao Lu Srihari Jayakumar Debojeet Chatterjee Mohsen Moslehpour and 9 more

We introduce Lumos, the first end-to-end multimodal question-answering system with text understanding capabilities. At core of Lumos is a Scene Text Recognition (STR) component that extracts from person point-of-view images, output which used to augment input Multimodal Large Language Model (MM-LLM). While building we encountered numerous challenges related STR quality, overall latency, and model inference. In this paper, delve into those challenges, discuss architecture, design choices,...

10.1145/3637528.3671633 article EN cc-by Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining 2024-08-24

EgoQR: Efficient QR Code Reading in Egocentric Settings

OPENALEX - Publications

Mohsen Moslehpour Yichao Lu Pierce Chuang Ashish Shenoy Debojeet Chatterjee and 5 more

QR codes have become ubiquitous in daily life, enabling rapid information exchange. With the increasing adoption of smart wearable devices, there is a need for efficient, and friction-less code reading capabilities from Egocentric point-of-views. However, adapting existing phone-based readers to egocentric images poses significant challenges. Code bring unique challenges such as wide field-of-view, distortion lack visual feedback compared phones where users can adjust position framing....

10.48550/arxiv.2410.05497 preprint EN arXiv (Cornell University) 2024-10-07

Now It Sounds Like You: Learning Personalized Vocabulary On Device

OPENALEX - Publications

Sid Wang Ashish Shenoy Pierce Chuang John Nguyen

In recent years, Federated Learning (FL) has shown significant advancements in its ability to perform various natural language processing (NLP) tasks. This work focuses on applying personalized FL for on-device modeling. Due limitations of memory and latency, these models cannot support the complexity sub-word tokenization or beam search decoding, resulting decision deploy a closed-vocabulary model. However, are unable handle out-of-vocabulary (OOV) words belonging specific users. To address...

10.48550/arxiv.2305.03584 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Exploring Linguistic Similarity and Zero-Shot Learning for Multilingual Translation of Dravidian Languages

OPENALEX - Publications

Danish Mohammed Ebadulla Rahul Raman Natarajan Sundaram Hridhay Kiran Shetty Ashish Shenoy

Current research in zero-shot translation is plagued by several issues such as high compute requirements, increased training time and off target translations. Proposed remedies often come at the cost of additional data or requirements. Pivot based neural machine preferred over a single-encoder model for most settings despite evaluation time. In this work, we overcome shortcomings taking advantage transliteration linguistic similarity. We build single encoder-decoder system...

10.48550/arxiv.2308.05574 preprint EN other-oa arXiv (Cornell University) 2023-01-01