NFDI4DS | UHH-SEMS - Publication Details

DocVQA: A Dataset for VQA on Document Images

OPENALEX - Publications

Minesh Mathew Dìmosthenis Karatzas C. V. Jawahar

We present a new dataset for Visual Question Answering (VQA) on document images called DocVQA. The consists of 50,000 questions defined 12,000+ images. Detailed analysis the in comparison with similar datasets VQA and reading comprehension is presented. report several baseline results by adopting existing models. Although models perform reasonably well certain types questions, there large performance gap compared to human (94.36% accuracy). need improve specifically where understanding...

10.1109/wacv48630.2021.00225 article EN 2021-01-01

Improving CNN-RNN Hybrid Networks for Handwriting Recognition

OPENALEX - Publications

Kartik Dutta Praveen Krishnan Minesh Mathew C. V. Jawahar

The success of deep learning based models have centered around recent architectures and the availability large scale annotated data. In this work, we explore these two factors systematically for improving handwritten recognition scanned off-line document images. We propose a modified CNN-RNN hybrid architecture with major focus on effective training using: (i) efficient initialization network using synthetic data pretraining, (ii) image normalization slant correction (iii) domain specific...

10.1109/icfhr-2018.2018.00023 article EN 2018-08-01

MMBERT: Multimodal BERT Pretraining for Improved Medical VQA

OPENALEX - Publications

Yash Khare Viraj Bagal Minesh Mathew Adithi Devi U. Deva Priyakumar and 1 more

Images in the medical domain are fundamentally different from general images. Consequently, it is infeasible to directly employ Visual Question Answering (VQA) models for domain. Additionally, image annotation a costly and time-consuming process. To overcome these limitations, we propose solution inspired by self-supervised pretraining of Transformer-style architectures NLP, Vision, Language tasks. Our method involves learning richer text semantic representations using Masked Vision-Language...

10.1109/isbi48211.2021.9434063 article EN 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI) 2021-04-13

InfographicVQA

OPENALEX - Publications

Minesh Mathew Viraj Bagal Rubèn Tito Dìmosthenis Karatzas Ernest Valveny and 1 more

Infographics communicate information using a combination of textual, graphical and visual elements. This work explores the automatic understanding infographic images by Visual Question Answering technique. To this end, we present InfographicVQA, new dataset comprising diverse collection infographics question-answer annotations. The questions require methods that jointly reason over document layout, textual content, elements, data visualizations. We curate with an emphasis on elementary...

10.1109/wacv51458.2022.00264 article EN 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2022-01-01

Offline Handwriting Recognition on Devanagari Using a New Benchmark Dataset

OPENALEX - Publications

Kartik Dutta Praveen Krishnan Minesh Mathew C. V. Jawahar

Handwriting recognition (HWR) in Indic scripts, like Devanagari is very challenging due to the subtleties variations rendering and cursive nature of handwriting. Lack public handwriting datasets scripts has long stymied development offline handwritten word recognizers made comparison across different methods a tedious task field. In this paper, we release new dataset for Devanagari, IIIT-HW-Dev alleviate some these issues. We benchmark using CNN-RNN hybrid architecture. Furthermore,...

10.1109/das.2018.69 article EN 2018-04-01

Unconstrained scene text and video text recognition for Arabic script

OPENALEX - Publications

Mohit Jain Minesh Mathew C. V. Jawahar

Building robust recognizers for Arabic has always been challenging. We demonstrate the effectiveness of an end-to-end trainable CNN-RNN hybrid architecture in recognizing text videos and natural scenes. outperform previous state-of-the-art on two publicly available video datasets - ALIF ACTIV. For scene recognition task, we introduce a new dataset establish baseline results. scripts like Arabic, major challenge developing is lack large quantity annotated data. overcome this by synthesizing...

10.1109/asar.2017.8067754 article EN 2017-04-01

RoadText-1K: Text Detection & Recognition Dataset for Driving Videos

OPENALEX - Publications

Sangeeth Reddy Minesh Mathew Lluís Gómez Marçal Rusiñol Dìmosthenis Karatzas and 1 more

Perceiving text is crucial to understand semantics of outdoor scenes and hence a critical requirement build intelligent systems for driver assistance self-driving. Most the existing datasets detection recognition comprise still images are mostly compiled keeping in mind. This paper introduces new "RoadText-1K" dataset driving videos. The 20 times larger than largest Our comprises 1000 video clips without any bias towards with annotations bounding boxes transcriptions every frame. State art...

10.1109/icra40945.2020.9196577 article EN 2020-05-01

Multilingual OCR for Indic Scripts

OPENALEX - Publications

Minesh Mathew Ajeet Kumar Singh C. V. Jawahar

In Indian scenario, a document analysis system has to support multiple languages at the same time. With emerging multilingualism in urban India, often bilingual, trilingual or even more need be supported. This demands development of multilingual OCR which can work seamlessly across Indic scripts. our approach script is identified word level, prior recognition word. An end-to-end RNN based architecture detect and recognize text segmentation-free manner proposed for this purpose. We...

10.1109/das.2016.68 article EN 2016-04-01

Benchmarking Scene Text Recognition in Devanagari, Telugu and Malayalam

OPENALEX - Publications

Minesh Mathew Mohit Jain C. V. Jawahar

Inspired by the success of Deep Learning based approaches to English scene text recognition, we pose and bench-mark recognition for three Indic scripts - Devanagari, Telugu Malayalam. Synthetic word images rendered from Unicode fonts are used training system. And performance is bench-marked on a new IIIT-ILST dataset comprising hundreds real containing in above mentioned scripts. We use segmentation free, hybrid but end-to-end trainable CNN-RNN deep neural network transcribing corresponding...

10.1109/icdar.2017.364 article EN 2017-11-01

ICDAR 2019 Competition on Scene Text Visual Question Answering

OPENALEX - Publications

Ali Furkan Biten Rubèn Tito Andrés Mafla Lluís Gómez Marçal Rusiñol and 4 more

This paper presents final results of ICDAR 2019 Scene Text Visual Question Answering competition (ST-VQA). ST-VQA introduces an important aspect that is not addressed by any system up to date, namely the incorporation scene text answer questions asked about image. The a new dataset comprising 23,038 images annotated with 31,791 question / pairs where always grounded on instances present in are taken from 7 different public computer vision datasets, covering wide range scenarios. was...

10.1109/icdar.2019.00251 article EN 2019-09-01

Towards Spotting and Recognition of Handwritten Words in Indic Scripts

OPENALEX - Publications

Kartik Dutta Praveen Krishnan Minesh Mathew C. V. Jawahar

Handwriting recognition (HWR) in Indic scripts is a challenging problem due to the inherent subtleties scripts, cursive nature of handwriting and similar shape characters. Lack publicly available datasets has affected development handwritten word recognizers, made direct comparisons across different methods an impossible task field. In this paper, we propose framework for annotating large scale images with ease speed. We also release new dataset Telugu, which collected annotated using...

10.1109/icfhr-2018.2018.00015 article EN 2018-08-01

Unconstrained OCR for Urdu Using Deep CNN-RNN Hybrid Networks

OPENALEX - Publications

Mohit Jain Minesh Mathew C. V. Jawahar

Building robust text recognition systems for languages with cursive scripts like Urdu has always been challenging. Intricacies of the script and absence ample annotated data further act as adversaries to this task. We demonstrate effectiveness an end-to-end trainable hybrid CNN-RNN architecture in recognizing from printed documents, typically known OCR. The solution proposed is not bounded by any language specific lexicon model following a segmentation-free, sequence-tosequence transcription...

10.1109/acpr.2017.5 article EN 2017-11-01

Localizing and Recognizing Text in Lecture Videos

OPENALEX - Publications

Kartik Dutta Minesh Mathew Praveen Krishnan C. V. Jawahar

Lecture videos are rich with textual information and to be able understand the text is quite useful for larger video understanding/analysis applications. Though recognition from images have been an active research area in computer vision, lecture has mostly overlooked. In this work, we investigate efficacy of state-of-the art handwritten scene methods on videos. To end, a new dataset - LectureVideoDB compiled frames multiple introduced. Our experiments show that existing do not fare well...

10.1109/icfhr-2018.2018.00049 article EN 2018-08-01

Document Visual Question Answering Challenge 2020

OPENALEX - Publications

Minesh Mathew Rubèn Tito Dìmosthenis Karatzas R. Manmatha C. V. Jawahar

This paper presents results of Document Visual Question Answering Challenge organized as part "Text and Documents in the Deep Learning Era" workshop, CVPR 2020. The challenge introduces a new problem - on document images. comprised two tasks. first task concerns with asking questions single image. On other hand, second is set retrieval where question posed over collection For 1 dataset introduced comprising 50,000 questions-answer(s) pairs defined 12,767 2 another has been created 20 14,362...

10.48550/arxiv.2008.08899 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Asking questions on handwritten document collections

OPENALEX - Publications

Minesh Mathew Lluís Gómez Dìmosthenis Karatzas C. V. Jawahar

10.1007/s10032-021-00383-3 article EN International Journal on Document Analysis and Recognition (IJDAR) 2021-08-06

Watching the News: Towards VideoQA Models that can Read

OPENALEX - Publications

Soumya Jahagirdar Minesh Mathew Dìmosthenis Karatzas C. V. Jawahar

Video Question Answering methods focus on common-sense reasoning and visual cognition of objects or persons their interactions over time. Current VideoQA approaches ignore the textual information present in video. Instead, we argue that is complementary to action provides essential contextualisation cues process. To this end, propose a novel task requires reading understanding text explore direction, news videos require QA systems comprehend answer questions about topics presented by...

10.1109/wacv56688.2023.00442 article EN 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2023-01-01

DocVQA: A Dataset for VQA on Document Images

OPENALEX - Publications

Minesh Mathew Dìmosthenis Karatzas C. V. Jawahar

We present a new dataset for Visual Question Answering (VQA) on document images called DocVQA. The consists of 50,000 questions defined 12,000+ images. Detailed analysis the in comparison with similar datasets VQA and reading comprehension is presented. report several baseline results by adopting existing models. Although models perform reasonably well certain types questions, there large performance gap compared to human (94.36% accuracy). need improve specifically where understanding...

10.48550/arxiv.2007.00398 preprint EN other-oa arXiv (Cornell University) 2020-01-01

A Cost Efficient Approach to Correct OCR Errors in Large Document Collections

OPENALEX - Publications

Deepayan Das Jerin Philip Minesh Mathew C. V. Jawahar

Word error rate of an OCR is often higher than its character rate. This especially true when OCRs are designed by recognizing characters. High word accuracies critical for many practical applications like content creation and text-to-speech systems. In order to detect correct the misrecognised words, it common employ a post-processor module improve accuracy. However, conventional approaches post-processing looking up dictionary or using statistical language model (SLM), still limited. such...

10.1109/icdar.2019.00110 preprint EN 2019-09-01

Unconstrained Scene Text and Video Text Recognition for Arabic Script

OPENALEX - Publications

Mohit Jain Minesh Mathew C. V. Jawahar

Building robust recognizers for Arabic has always been challenging. We demonstrate the effectiveness of an end-to-end trainable CNN-RNN hybrid architecture in recognizing text videos and natural scenes. outperform previous state-of-the-art on two publicly available video datasets - ALIF ACTIV. For scene recognition task, we introduce a new dataset establish baseline results. scripts like Arabic, major challenge developing is lack large quantity annotated data. overcome this by synthesising...

10.48550/arxiv.1711.02396 preprint EN other-oa arXiv (Cornell University) 2017-01-01