Anand Mishra

ORCID: 0000-0002-7806-2557
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Advanced Image and Video Retrieval Techniques
  • Multimodal Machine Learning Applications
  • Handwritten Text Recognition Techniques
  • Advanced Neural Network Applications
  • Domain Adaptation and Few-Shot Learning
  • Visual Attention and Saliency Detection
  • Face recognition and analysis
  • Topic Modeling
  • Image Retrieval and Classification Techniques
  • Vehicle License Plate Recognition
  • Natural Language Processing Techniques
  • Human Pose and Action Recognition
  • Video Analysis and Summarization
  • Image and Signal Denoising Methods
  • Digital Media Forensic Detection
  • 3D Shape Modeling and Analysis
  • Generative Adversarial Networks and Image Synthesis
  • Algorithms and Data Compression
  • Face and Expression Recognition
  • Advanced Image Processing Techniques
  • Management and Marketing Education
  • Information Retrieval and Data Mining
  • Hand Gesture Recognition Systems
  • Intelligent Tutoring Systems and Adaptive Learning
  • Image Enhancement Techniques

Indian Institute of Technology Jodhpur
2019-2024

Office for Students
2023

Sinhgad Dental College and Hospital
2023

Indian Institute of Science Bangalore
2018-2019

Indian Institute of Technology Hyderabad
2011-2017

International Institute of Information Technology, Hyderabad
2016

Center for Visual Communication (United States)
2012

Scene text recognition has gained significant attention from the computer vision community in recent years. Recognizing such is a challenging problem, even more so than of scanned documents. In this work, we focus on problem recognizing extracted street images. We present framework that exploits both bottom-up and top-down cues. The cues are derived individual character detections image. build Conditional Random Field model these to jointly strength interactions between them. impose obtained...

10.1109/cvpr.2012.6247990 preprint EN 2009 IEEE Conference on Computer Vision and Pattern Recognition 2012-06-01

The problem of answering questions about an image is popularly known as visual question (or VQA in short). It a well-established computer vision. However, none the methods currently utilize text often present image. These "texts images" provide additional useful cues and facilitate better understanding content. In this paper, we introduce novel task by reading images, i.e., optical character recognition or OCR. We refer to OCR-VQA. To systematic way studying new problem, large-scale dataset,...

10.1109/icdar.2019.00156 article EN 2019-09-01

Visual Question Answering (VQA) has emerged as an important problem spanning Computer Vision, Natural Language Processing and Artificial Intelligence (AI). In conventional VQA, one may ask questions about image which can be answered purely based on its content. For example, given with people in it, a typical VQA question inquire the number of image. More recently, there is growing interest answering require commonsense knowledge involving common nouns (e.g., cats, dogs, microphones) present...

10.1609/aaai.v33i01.33018876 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2019-07-17

Recognizing text in images taken the wild is a challenging problem that has received great attention recent years. Previous methods addressed this by first detecting individual characters, and then forming them into words. Such approaches often suffer from weak character detections, due to large intra-class variations, even more so than characters scanned documents. We take different view of present holistic word recognition framework. In this, we represent scene image synthetic generated...

10.1109/icdar.2013.87 preprint EN 2013-08-01

Inspired by the success of MRF models for solving object segmentation problems, we formulate binarization problem in this framework. We represent pixels a document image as random variables an MRF, and introduce new energy (or cost) function on these variables. Each variable takes foreground or background label, quality labelling) is determined value function. minimize function, i.e. find optimal binarization, using iterative graph cut scheme. Our model robust to variations colours use...

10.1109/icdar.2011.12 article EN International Conference on Document Analysis and Recognition 2011-09-01

We present an approach for the text-to-image retrieval problem based on textual content in images. Given recent developments understanding text images, appealing to address this is localize and recognize text, then query database, as a problem. show that such approach, despite being state-of-the-art methods, insufficient, propose method, where we do not rely exact localization recognition pipeline. take query-driven search find approximate locations of characters query, impose spatial...

10.1109/iccv.2013.378 preprint EN 2013-12-01

Text present in images are not merely strings, they provide useful cues about the image. Despite their utility better image understanding, scene texts used traditional visual question answering (VQA) models. In this work, we a VQA model which can read and perform reasoning on knowledge graph to arrive at an accurate answer. Our proposed has three mutually interacting modules: i. proposal module get word content proposals from image, ii. fusion fuse these proposals, base mine relevant facts,...

10.1109/iccv.2019.00470 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2019-10-01

Computer programming textbooks and software documentations often contain flowcharts to illustrate the flow of an algorithm or procedure. Modern OCR engines tag these as graphics ignore them in further processing. In this paper, we work towards making flowchart images machine-interpretable by converting executable Python codes. To end, inspired recent success natural language code generation literature, present a novel transformer-based framework, namely FloCo-T5. Our model is well-suited for...

10.48550/arxiv.2501.17441 preprint EN arXiv (Cornell University) 2025-01-29

We present an approach for automatically identifying the script of text localized in scene images. Our is inspired by advancements mid-level features. represent images using features which are pooled from densely computed local Once represented proposed feature representation, we use off-the-shelf classifier to identify image. efficient and requires very less labeled data. evaluate performance our method on a recently introduced CVSI dataset, demonstrating that can correctly 96.70% In...

10.1109/das.2016.57 article EN 2016-04-01

10.1007/s10032-017-0283-9 article EN International Journal on Document Analysis and Recognition (IJDAR) 2017-04-03

We study visual question answering in a setting where the answer has to be mined from pool of relevant and irrelevant images given as context. For such setting, model must first retrieve these retrieved images. refer this problem retrieval-based (or RETVQA short). The is distinctively different more challenging than traditionally-studied Visual Question Answering (VQA), answered with single image Towards solving task, we propose unified Multi Image BART (MI-BART) that takes using our...

10.24963/ijcai.2023/146 article EN 2023-08-01

Non-native speakers with limited vocabulary often struggle to name specific objects despite being able visualize them, e.g., people outside Australia searching for ‘numbats.’ Further, users may want search such elusive difficult-to-sketch interactions, “numbat digging in the ground.” In common but complex situations, desire a interface that accepts composite multimodal queries comprising hand-drawn sketches of “difficult-to-name easy-to-draw” and text describing “difficult-to-sketch...

10.1609/aaai.v38i3.27956 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2024-03-24

In this work, we study one-shot video object localization problem that aims to localize instances of unseen objects in the target using a single query image object. Toward addressing challenging problem, extend popular and successful detection method, namely DETR (Detection Transformer), introduce novel approach –query-guided transformer for videos (QDETRv). A distinctive feature QDETRv is its capacity exploit information from spatio-temporal context video, which significantly aids precisely...

10.1609/aaai.v38i3.28063 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2024-03-24

Massive Open Online Courses (MOOCs) enable easy access to many educational materials, particularly lecture slides, on the web. Searching through them based user queries becomes an essential problem due availability of such vast information. To address this, we present Lecture Slide Deck Search Engine – a model that supports natural language and hand-drawn sketches performs searches large collection slide images computer science topics. This search engine is trained using novel semantic...

10.1109/wacv57701.2024.00591 article EN 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2024-01-03

Sparse representation based image restoration techniques have shown to be successful in solving various inverse problems such as denoising, painting, and super-resolution, etc. on natural images videos. In this paper, we explore the use of sparse methods specifically restore degraded document images. While form a very small subset all possible admitting possibility representation, are significantly more restricted expected ideally suited for representation. However, binary nature textual...

10.1109/icdar.2013.146 article EN 2013-08-01

Interpreting visual relationships is a core aspect of comprehensive video understanding. Given query relationship as <subject, predicate, object> and test video, our objective to localize the subject object that are connected via predicate. modern visio-lingual understanding capabilities, solving this problem achievable, provided there large-scale annotated training examples available. However, annotating for every combination subject, object, predicate cumbersome, expensive, possibly...

10.1109/cvpr52729.2023.00227 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

10.1007/s13735-018-0160-4 article EN International Journal of Multimedia Information Retrieval 2018-11-16

This paper presents a framework for jointly grounding objects that follow certain semantic relationship constraints given in scene graph. A typical natural contains several objects, often exhibiting visual relationships of varied complexities between them. These inter-object provide strong contextual cues towards improving performance compared to traditional object query-only-based localization task. graph is an efficient and structured way represent all the their image. In attempt bridging...

10.1109/wacv56688.2023.00437 article EN 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2023-01-01
Coming Soon ...