- Handwritten Text Recognition Techniques
- Natural Language Processing Techniques
- Multimodal Machine Learning Applications
- Domain Adaptation and Few-Shot Learning
- Topic Modeling
- Advanced Image and Video Retrieval Techniques
- Vehicle License Plate Recognition
- Hand Gesture Recognition Systems
- Image Processing and 3D Reconstruction
- Speech Recognition and Synthesis
- Video Analysis and Summarization
- Algorithms and Data Compression
- Web Data Mining and Analysis
- Human Pose and Action Recognition
Indian Institute of Technology Hyderabad
2016-2023
International Institute of Information Technology, Hyderabad
2017-2021
We present a new dataset for Visual Question Answering (VQA) on document images called DocVQA. The consists of 50,000 questions defined 12,000+ images. Detailed analysis the in comparison with similar datasets VQA and reading comprehension is presented. report several baseline results by adopting existing models. Although models perform reasonably well certain types questions, there large performance gap compared to human (94.36% accuracy). need improve specifically where understanding...
The success of deep learning based models have centered around recent architectures and the availability large scale annotated data. In this work, we explore these two factors systematically for improving handwritten recognition scanned off-line document images. We propose a modified CNN-RNN hybrid architecture with major focus on effective training using: (i) efficient initialization network using synthetic data pretraining, (ii) image normalization slant correction (iii) domain specific...
Images in the medical domain are fundamentally different from general images. Consequently, it is infeasible to directly employ Visual Question Answering (VQA) models for domain. Additionally, image annotation a costly and time-consuming process. To overcome these limitations, we propose solution inspired by self-supervised pretraining of Transformer-style architectures NLP, Vision, Language tasks. Our method involves learning richer text semantic representations using Masked Vision-Language...
Infographics communicate information using a combination of textual, graphical and visual elements. This work explores the automatic understanding infographic images by Visual Question Answering technique. To this end, we present InfographicVQA, new dataset comprising diverse collection infographics question-answer annotations. The questions require methods that jointly reason over document layout, textual content, elements, data visualizations. We curate with an emphasis on elementary...
Handwriting recognition (HWR) in Indic scripts, like Devanagari is very challenging due to the subtleties variations rendering and cursive nature of handwriting. Lack public handwriting datasets scripts has long stymied development offline handwritten word recognizers made comparison across different methods a tedious task field. In this paper, we release new dataset for Devanagari, IIIT-HW-Dev alleviate some these issues. We benchmark using CNN-RNN hybrid architecture. Furthermore,...
Building robust recognizers for Arabic has always been challenging. We demonstrate the effectiveness of an end-to-end trainable CNN-RNN hybrid architecture in recognizing text videos and natural scenes. outperform previous state-of-the-art on two publicly available video datasets - ALIF ACTIV. For scene recognition task, we introduce a new dataset establish baseline results. scripts like Arabic, major challenge developing is lack large quantity annotated data. overcome this by synthesizing...
Perceiving text is crucial to understand semantics of outdoor scenes and hence a critical requirement build intelligent systems for driver assistance self-driving. Most the existing datasets detection recognition comprise still images are mostly compiled keeping in mind. This paper introduces new "RoadText-1K" dataset driving videos. The 20 times larger than largest Our comprises 1000 video clips without any bias towards with annotations bounding boxes transcriptions every frame. State art...
In Indian scenario, a document analysis system has to support multiple languages at the same time. With emerging multilingualism in urban India, often bilingual, trilingual or even more need be supported. This demands development of multilingual OCR which can work seamlessly across Indic scripts. our approach script is identified word level, prior recognition word. An end-to-end RNN based architecture detect and recognize text segmentation-free manner proposed for this purpose. We...
Inspired by the success of Deep Learning based approaches to English scene text recognition, we pose and bench-mark recognition for three Indic scripts - Devanagari, Telugu Malayalam. Synthetic word images rendered from Unicode fonts are used training system. And performance is bench-marked on a new IIIT-ILST dataset comprising hundreds real containing in above mentioned scripts. We use segmentation free, hybrid but end-to-end trainable CNN-RNN deep neural network transcribing corresponding...
This paper presents final results of ICDAR 2019 Scene Text Visual Question Answering competition (ST-VQA). ST-VQA introduces an important aspect that is not addressed by any system up to date, namely the incorporation scene text answer questions asked about image. The a new dataset comprising 23,038 images annotated with 31,791 question / pairs where always grounded on instances present in are taken from 7 different public computer vision datasets, covering wide range scenarios. was...
Handwriting recognition (HWR) in Indic scripts is a challenging problem due to the inherent subtleties scripts, cursive nature of handwriting and similar shape characters. Lack publicly available datasets has affected development handwritten word recognizers, made direct comparisons across different methods an impossible task field. In this paper, we propose framework for annotating large scale images with ease speed. We also release new dataset Telugu, which collected annotated using...
Building robust text recognition systems for languages with cursive scripts like Urdu has always been challenging. Intricacies of the script and absence ample annotated data further act as adversaries to this task. We demonstrate effectiveness an end-to-end trainable hybrid CNN-RNN architecture in recognizing from printed documents, typically known OCR. The solution proposed is not bounded by any language specific lexicon model following a segmentation-free, sequence-tosequence transcription...
Lecture videos are rich with textual information and to be able understand the text is quite useful for larger video understanding/analysis applications. Though recognition from images have been an active research area in computer vision, lecture has mostly overlooked. In this work, we investigate efficacy of state-of-the art handwritten scene methods on videos. To end, a new dataset - LectureVideoDB compiled frames multiple introduced. Our experiments show that existing do not fare well...
This paper presents results of Document Visual Question Answering Challenge organized as part "Text and Documents in the Deep Learning Era" workshop, CVPR 2020. The challenge introduces a new problem - on document images. comprised two tasks. first task concerns with asking questions single image. On other hand, second is set retrieval where question posed over collection For 1 dataset introduced comprising 50,000 questions-answer(s) pairs defined 12,767 2 another has been created 20 14,362...
Video Question Answering methods focus on common-sense reasoning and visual cognition of objects or persons their interactions over time. Current VideoQA approaches ignore the textual information present in video. Instead, we argue that is complementary to action provides essential contextualisation cues process. To this end, propose a novel task requires reading understanding text explore direction, news videos require QA systems comprehend answer questions about topics presented by...
We present a new dataset for Visual Question Answering (VQA) on document images called DocVQA. The consists of 50,000 questions defined 12,000+ images. Detailed analysis the in comparison with similar datasets VQA and reading comprehension is presented. report several baseline results by adopting existing models. Although models perform reasonably well certain types questions, there large performance gap compared to human (94.36% accuracy). need improve specifically where understanding...
Word error rate of an OCR is often higher than its character rate. This especially true when OCRs are designed by recognizing characters. High word accuracies critical for many practical applications like content creation and text-to-speech systems. In order to detect correct the misrecognised words, it common employ a post-processor module improve accuracy. However, conventional approaches post-processing looking up dictionary or using statistical language model (SLM), still limited. such...
Building robust recognizers for Arabic has always been challenging. We demonstrate the effectiveness of an end-to-end trainable CNN-RNN hybrid architecture in recognizing text videos and natural scenes. outperform previous state-of-the-art on two publicly available video datasets - ALIF ACTIV. For scene recognition task, we introduce a new dataset establish baseline results. scripts like Arabic, major challenge developing is lack large quantity annotated data. overcome this by synthesising...