Minesh Mathew

ORCID: 0000-0002-0809-2590
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Handwritten Text Recognition Techniques
  • Natural Language Processing Techniques
  • Multimodal Machine Learning Applications
  • Domain Adaptation and Few-Shot Learning
  • Topic Modeling
  • Advanced Image and Video Retrieval Techniques
  • Vehicle License Plate Recognition
  • Hand Gesture Recognition Systems
  • Image Processing and 3D Reconstruction
  • Speech Recognition and Synthesis
  • Video Analysis and Summarization
  • Algorithms and Data Compression
  • Web Data Mining and Analysis
  • Human Pose and Action Recognition

Indian Institute of Technology Hyderabad
2016-2023

International Institute of Information Technology, Hyderabad
2017-2021

We present a new dataset for Visual Question Answering (VQA) on document images called DocVQA. The consists of 50,000 questions defined 12,000+ images. Detailed analysis the in comparison with similar datasets VQA and reading comprehension is presented. report several baseline results by adopting existing models. Although models perform reasonably well certain types questions, there large performance gap compared to human (94.36% accuracy). need improve specifically where understanding...

10.1109/wacv48630.2021.00225 article EN 2021-01-01

The success of deep learning based models have centered around recent architectures and the availability large scale annotated data. In this work, we explore these two factors systematically for improving handwritten recognition scanned off-line document images. We propose a modified CNN-RNN hybrid architecture with major focus on effective training using: (i) efficient initialization network using synthetic data pretraining, (ii) image normalization slant correction (iii) domain specific...

10.1109/icfhr-2018.2018.00023 article EN 2018-08-01

Images in the medical domain are fundamentally different from general images. Consequently, it is infeasible to directly employ Visual Question Answering (VQA) models for domain. Additionally, image annotation a costly and time-consuming process. To overcome these limitations, we propose solution inspired by self-supervised pretraining of Transformer-style architectures NLP, Vision, Language tasks. Our method involves learning richer text semantic representations using Masked Vision-Language...

10.1109/isbi48211.2021.9434063 article EN 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI) 2021-04-13

Infographics communicate information using a combination of textual, graphical and visual elements. This work explores the automatic understanding infographic images by Visual Question Answering technique. To this end, we present InfographicVQA, new dataset comprising diverse collection infographics question-answer annotations. The questions require methods that jointly reason over document layout, textual content, elements, data visualizations. We curate with an emphasis on elementary...

10.1109/wacv51458.2022.00264 article EN 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2022-01-01

Handwriting recognition (HWR) in Indic scripts, like Devanagari is very challenging due to the subtleties variations rendering and cursive nature of handwriting. Lack public handwriting datasets scripts has long stymied development offline handwritten word recognizers made comparison across different methods a tedious task field. In this paper, we release new dataset for Devanagari, IIIT-HW-Dev alleviate some these issues. We benchmark using CNN-RNN hybrid architecture. Furthermore,...

10.1109/das.2018.69 article EN 2018-04-01

Building robust recognizers for Arabic has always been challenging. We demonstrate the effectiveness of an end-to-end trainable CNN-RNN hybrid architecture in recognizing text videos and natural scenes. outperform previous state-of-the-art on two publicly available video datasets - ALIF ACTIV. For scene recognition task, we introduce a new dataset establish baseline results. scripts like Arabic, major challenge developing is lack large quantity annotated data. overcome this by synthesizing...

10.1109/asar.2017.8067754 article EN 2017-04-01

Perceiving text is crucial to understand semantics of outdoor scenes and hence a critical requirement build intelligent systems for driver assistance self-driving. Most the existing datasets detection recognition comprise still images are mostly compiled keeping in mind. This paper introduces new "RoadText-1K" dataset driving videos. The 20 times larger than largest Our comprises 1000 video clips without any bias towards with annotations bounding boxes transcriptions every frame. State art...

10.1109/icra40945.2020.9196577 article EN 2020-05-01

In Indian scenario, a document analysis system has to support multiple languages at the same time. With emerging multilingualism in urban India, often bilingual, trilingual or even more need be supported. This demands development of multilingual OCR which can work seamlessly across Indic scripts. our approach script is identified word level, prior recognition word. An end-to-end RNN based architecture detect and recognize text segmentation-free manner proposed for this purpose. We...

10.1109/das.2016.68 article EN 2016-04-01

Inspired by the success of Deep Learning based approaches to English scene text recognition, we pose and bench-mark recognition for three Indic scripts - Devanagari, Telugu Malayalam. Synthetic word images rendered from Unicode fonts are used training system. And performance is bench-marked on a new IIIT-ILST dataset comprising hundreds real containing in above mentioned scripts. We use segmentation free, hybrid but end-to-end trainable CNN-RNN deep neural network transcribing corresponding...

10.1109/icdar.2017.364 article EN 2017-11-01

This paper presents final results of ICDAR 2019 Scene Text Visual Question Answering competition (ST-VQA). ST-VQA introduces an important aspect that is not addressed by any system up to date, namely the incorporation scene text answer questions asked about image. The a new dataset comprising 23,038 images annotated with 31,791 question / pairs where always grounded on instances present in are taken from 7 different public computer vision datasets, covering wide range scenarios. was...

10.1109/icdar.2019.00251 article EN 2019-09-01

Handwriting recognition (HWR) in Indic scripts is a challenging problem due to the inherent subtleties scripts, cursive nature of handwriting and similar shape characters. Lack publicly available datasets has affected development handwritten word recognizers, made direct comparisons across different methods an impossible task field. In this paper, we propose framework for annotating large scale images with ease speed. We also release new dataset Telugu, which collected annotated using...

10.1109/icfhr-2018.2018.00015 article EN 2018-08-01

Building robust text recognition systems for languages with cursive scripts like Urdu has always been challenging. Intricacies of the script and absence ample annotated data further act as adversaries to this task. We demonstrate effectiveness an end-to-end trainable hybrid CNN-RNN architecture in recognizing from printed documents, typically known OCR. The solution proposed is not bounded by any language specific lexicon model following a segmentation-free, sequence-tosequence transcription...

10.1109/acpr.2017.5 article EN 2017-11-01

Lecture videos are rich with textual information and to be able understand the text is quite useful for larger video understanding/analysis applications. Though recognition from images have been an active research area in computer vision, lecture has mostly overlooked. In this work, we investigate efficacy of state-of-the art handwritten scene methods on videos. To end, a new dataset - LectureVideoDB compiled frames multiple introduced. Our experiments show that existing do not fare well...

10.1109/icfhr-2018.2018.00049 article EN 2018-08-01

This paper presents results of Document Visual Question Answering Challenge organized as part "Text and Documents in the Deep Learning Era" workshop, CVPR 2020. The challenge introduces a new problem - on document images. comprised two tasks. first task concerns with asking questions single image. On other hand, second is set retrieval where question posed over collection For 1 dataset introduced comprising 50,000 questions-answer(s) pairs defined 12,767 2 another has been created 20 14,362...

10.48550/arxiv.2008.08899 preprint EN other-oa arXiv (Cornell University) 2020-01-01

10.1007/s10032-021-00383-3 article EN International Journal on Document Analysis and Recognition (IJDAR) 2021-08-06

Video Question Answering methods focus on common-sense reasoning and visual cognition of objects or persons their interactions over time. Current VideoQA approaches ignore the textual information present in video. Instead, we argue that is complementary to action provides essential contextualisation cues process. To this end, propose a novel task requires reading understanding text explore direction, news videos require QA systems comprehend answer questions about topics presented by...

10.1109/wacv56688.2023.00442 article EN 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2023-01-01

We present a new dataset for Visual Question Answering (VQA) on document images called DocVQA. The consists of 50,000 questions defined 12,000+ images. Detailed analysis the in comparison with similar datasets VQA and reading comprehension is presented. report several baseline results by adopting existing models. Although models perform reasonably well certain types questions, there large performance gap compared to human (94.36% accuracy). need improve specifically where understanding...

10.48550/arxiv.2007.00398 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Word error rate of an OCR is often higher than its character rate. This especially true when OCRs are designed by recognizing characters. High word accuracies critical for many practical applications like content creation and text-to-speech systems. In order to detect correct the misrecognised words, it common employ a post-processor module improve accuracy. However, conventional approaches post-processing looking up dictionary or using statistical language model (SLM), still limited. such...

10.1109/icdar.2019.00110 preprint EN 2019-09-01

Building robust recognizers for Arabic has always been challenging. We demonstrate the effectiveness of an end-to-end trainable CNN-RNN hybrid architecture in recognizing text videos and natural scenes. outperform previous state-of-the-art on two publicly available video datasets - ALIF ACTIV. For scene recognition task, we introduce a new dataset establish baseline results. scripts like Arabic, major challenge developing is lack large quantity annotated data. overcome this by synthesising...

10.48550/arxiv.1711.02396 preprint EN other-oa arXiv (Cornell University) 2017-01-01
Coming Soon ...