- Advanced Image and Video Retrieval Techniques
- Multimodal Machine Learning Applications
- Handwritten Text Recognition Techniques
- Advanced Neural Network Applications
- Domain Adaptation and Few-Shot Learning
- Visual Attention and Saliency Detection
- Face recognition and analysis
- Topic Modeling
- Image Retrieval and Classification Techniques
- Vehicle License Plate Recognition
- Natural Language Processing Techniques
- Human Pose and Action Recognition
- Video Analysis and Summarization
- Image and Signal Denoising Methods
- Digital Media Forensic Detection
- 3D Shape Modeling and Analysis
- Generative Adversarial Networks and Image Synthesis
- Algorithms and Data Compression
- Face and Expression Recognition
- Advanced Image Processing Techniques
- Management and Marketing Education
- Information Retrieval and Data Mining
- Hand Gesture Recognition Systems
- Intelligent Tutoring Systems and Adaptive Learning
- Image Enhancement Techniques
Indian Institute of Technology Jodhpur
2019-2024
Office for Students
2023
Sinhgad Dental College and Hospital
2023
Indian Institute of Science Bangalore
2018-2019
Indian Institute of Technology Hyderabad
2011-2017
International Institute of Information Technology, Hyderabad
2016
Center for Visual Communication (United States)
2012
Scene text recognition has gained significant attention from the computer vision community in recent years. Recognizing such is a challenging problem, even more so than of scanned documents. In this work, we focus on problem recognizing extracted street images. We present framework that exploits both bottom-up and top-down cues. The cues are derived individual character detections image. build Conditional Random Field model these to jointly strength interactions between them. impose obtained...
The problem of answering questions about an image is popularly known as visual question (or VQA in short). It a well-established computer vision. However, none the methods currently utilize text often present image. These "texts images" provide additional useful cues and facilitate better understanding content. In this paper, we introduce novel task by reading images, i.e., optical character recognition or OCR. We refer to OCR-VQA. To systematic way studying new problem, large-scale dataset,...
Visual Question Answering (VQA) has emerged as an important problem spanning Computer Vision, Natural Language Processing and Artificial Intelligence (AI). In conventional VQA, one may ask questions about image which can be answered purely based on its content. For example, given with people in it, a typical VQA question inquire the number of image. More recently, there is growing interest answering require commonsense knowledge involving common nouns (e.g., cats, dogs, microphones) present...
Recognizing text in images taken the wild is a challenging problem that has received great attention recent years. Previous methods addressed this by first detecting individual characters, and then forming them into words. Such approaches often suffer from weak character detections, due to large intra-class variations, even more so than characters scanned documents. We take different view of present holistic word recognition framework. In this, we represent scene image synthetic generated...
Inspired by the success of MRF models for solving object segmentation problems, we formulate binarization problem in this framework. We represent pixels a document image as random variables an MRF, and introduce new energy (or cost) function on these variables. Each variable takes foreground or background label, quality labelling) is determined value function. minimize function, i.e. find optimal binarization, using iterative graph cut scheme. Our model robust to variations colours use...
We present an approach for the text-to-image retrieval problem based on textual content in images. Given recent developments understanding text images, appealing to address this is localize and recognize text, then query database, as a problem. show that such approach, despite being state-of-the-art methods, insufficient, propose method, where we do not rely exact localization recognition pipeline. take query-driven search find approximate locations of characters query, impose spatial...
Text present in images are not merely strings, they provide useful cues about the image. Despite their utility better image understanding, scene texts used traditional visual question answering (VQA) models. In this work, we a VQA model which can read and perform reasoning on knowledge graph to arrive at an accurate answer. Our proposed has three mutually interacting modules: i. proposal module get word content proposals from image, ii. fusion fuse these proposals, base mine relevant facts,...
Computer programming textbooks and software documentations often contain flowcharts to illustrate the flow of an algorithm or procedure. Modern OCR engines tag these as graphics ignore them in further processing. In this paper, we work towards making flowchart images machine-interpretable by converting executable Python codes. To end, inspired recent success natural language code generation literature, present a novel transformer-based framework, namely FloCo-T5. Our model is well-suited for...
We present an approach for automatically identifying the script of text localized in scene images. Our is inspired by advancements mid-level features. represent images using features which are pooled from densely computed local Once represented proposed feature representation, we use off-the-shelf classifier to identify image. efficient and requires very less labeled data. evaluate performance our method on a recently introduced CVSI dataset, demonstrating that can correctly 96.70% In...
We study visual question answering in a setting where the answer has to be mined from pool of relevant and irrelevant images given as context. For such setting, model must first retrieve these retrieved images. refer this problem retrieval-based (or RETVQA short). The is distinctively different more challenging than traditionally-studied Visual Question Answering (VQA), answered with single image Towards solving task, we propose unified Multi Image BART (MI-BART) that takes using our...
Non-native speakers with limited vocabulary often struggle to name specific objects despite being able visualize them, e.g., people outside Australia searching for ‘numbats.’ Further, users may want search such elusive difficult-to-sketch interactions, “numbat digging in the ground.” In common but complex situations, desire a interface that accepts composite multimodal queries comprising hand-drawn sketches of “difficult-to-name easy-to-draw” and text describing “difficult-to-sketch...
In this work, we study one-shot video object localization problem that aims to localize instances of unseen objects in the target using a single query image object. Toward addressing challenging problem, extend popular and successful detection method, namely DETR (Detection Transformer), introduce novel approach –query-guided transformer for videos (QDETRv). A distinctive feature QDETRv is its capacity exploit information from spatio-temporal context video, which significantly aids precisely...
Massive Open Online Courses (MOOCs) enable easy access to many educational materials, particularly lecture slides, on the web. Searching through them based user queries becomes an essential problem due availability of such vast information. To address this, we present Lecture Slide Deck Search Engine – a model that supports natural language and hand-drawn sketches performs searches large collection slide images computer science topics. This search engine is trained using novel semantic...
Sparse representation based image restoration techniques have shown to be successful in solving various inverse problems such as denoising, painting, and super-resolution, etc. on natural images videos. In this paper, we explore the use of sparse methods specifically restore degraded document images. While form a very small subset all possible admitting possibility representation, are significantly more restricted expected ideally suited for representation. However, binary nature textual...
Interpreting visual relationships is a core aspect of comprehensive video understanding. Given query relationship as <subject, predicate, object> and test video, our objective to localize the subject object that are connected via predicate. modern visio-lingual understanding capabilities, solving this problem achievable, provided there large-scale annotated training examples available. However, annotating for every combination subject, object, predicate cumbersome, expensive, possibly...
This paper presents a framework for jointly grounding objects that follow certain semantic relationship constraints given in scene graph. A typical natural contains several objects, often exhibiting visual relationships of varied complexities between them. These inter-object provide strong contextual cues towards improving performance compared to traditional object query-only-based localization task. graph is an efficient and structured way represent all the their image. In attempt bridging...