C. V. Jawahar

ORCID: 0000-0001-6767-7057
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Advanced Image and Video Retrieval Techniques
  • Handwritten Text Recognition Techniques
  • Image Retrieval and Classification Techniques
  • Multimodal Machine Learning Applications
  • Natural Language Processing Techniques
  • Advanced Vision and Imaging
  • Video Analysis and Summarization
  • Domain Adaptation and Few-Shot Learning
  • Human Pose and Action Recognition
  • Robotics and Sensor-Based Localization
  • Advanced Neural Network Applications
  • Video Surveillance and Tracking Methods
  • Vehicle License Plate Recognition
  • Image Processing and 3D Reconstruction
  • Face recognition and analysis
  • Topic Modeling
  • Hand Gesture Recognition Systems
  • Music and Audio Processing
  • Algorithms and Data Compression
  • Face and Expression Recognition
  • Anomaly Detection Techniques and Applications
  • Image Processing Techniques and Applications
  • Speech and Audio Processing
  • Advanced Image Processing Techniques
  • Digital Media Forensic Detection

Indian Institute of Technology Hyderabad
2015-2024

International Institute of Information Technology, Hyderabad
2015-2024

International Institute of Information Technology
2004-2024

Indian Institute of Technology Delhi
2011-2024

Amrita Vishwa Vidyapeetham
2023

Indian Institute of Technology Mandi
2023

International Institute of Islamic Thought
2022

University of Bath
2021

Indian Institute of Technology Kanpur
2011-2019

Chinese University of Hong Kong
2017

We investigate the fine grained object categorization problem of determining breed animal from an image. To this end we introduce a new annotated dataset pets covering 37 different breeds cats and dogs. The visual is very challenging as these animals, particularly cats, are deformable there can be quite subtle differences between breeds. make number contributions: first, model to classify pet automatically combines shape, captured by part detecting face, appearance, bag-of-words that...

10.1109/cvpr.2012.6248092 article EN 2009 IEEE Conference on Computer Vision and Pattern Recognition 2012-06-01

The automatic discovery of distinctive parts for an object or scene class is challenging since it requires simultaneously to learn the part appearance and also identify occurrences in images. In this paper, we propose a simple, efficient, effective method do so. We address problem by learning incrementally, starting from single occurrence with Exemplar SVM. manner, additional instances are discovered aligned reliably before being considered as training examples. entropy-rank curves means...

10.1109/cvpr.2013.124 article EN 2009 IEEE Conference on Computer Vision and Pattern Recognition 2013-06-01

Scene text recognition has gained significant attention from the computer vision community in recent years. Recognizing such is a challenging problem, even more so than of scanned documents. In this work, we focus on problem recognizing extracted street images. We present framework that exploits both bottom-up and top-down cues. The cues are derived individual character detections image. build Conditional Random Field model these to jointly strength interactions between them. impose obtained...

10.1109/cvpr.2012.6247990 preprint EN 2009 IEEE Conference on Computer Vision and Pattern Recognition 2012-06-01

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of dailylife activity spanning hundreds scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations 9 different countries. The approach to collection is designed uphold rigorous privacy ethics standards, with consenting participants robust de-identification procedures where relevant. Ego4D dramatically expands the volume...

10.1109/cvpr52688.2022.01842 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

While several datasets for autonomous navigation have become available in recent years, they tended to focus on structured driving environments. This usually corresponds well-delineated infrastructure such as lanes, a small number of well-defined categories traffic participants, low variation object or background appearance and strong adherence rules. We propose DS, novel dataset road scene understanding unstructured environments where the above assumptions are largely not satisfied. It...

10.1109/wacv.2019.00190 article EN 2019-01-01

In this work, we address the problem of cross-modal retrieval in presence multi-label annotations. particular, introduce Canonical Correlation Analysis (ml-CCA), an extension CCA, for learning shared subspaces taking into account high level semantic information form Unlike ml-CCA does not rely on explicit pairing between modalities, instead it uses to establish correspondences. This results a discriminative subspace which is better suited tasks. We also present Fast ml-CCA, computationally...

10.1109/iccv.2015.466 article EN 2015-12-01

The ICDAR 2019 Challenge on "Scanned receipts OCR and key information extraction" (SROIE) covers important aspects related to the automated analysis of scanned receipts. SROIE tasks play a role in many document systems hold significant commercial potential. Although lot work has been published over years administrative analysis, community advanced relatively slowly, as most datasets have kept private. One contributions is offer first, standardized dataset 1000 whole receipt images...

10.1109/icdar.2019.00244 preprint EN 2019-09-01

Current visual question answering datasets do not consider the rich semantic information conveyed by text within an image. In this work, we present a new dataset, ST-VQA, that aims to highlight importance of exploiting high-level in images as textual cues Visual Question Answering process. We use dataset define series tasks increasing difficulty for which reading scene context provided is necessary reason and generate appropriate answer. propose evaluation metric these account both reasoning...

10.1109/iccv.2019.00439 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2019-10-01

We focus on the problem of wearer's action recognition in first person a.k.a. egocentric videos. This is more challenging than third activity due to unavailability pose and sharp movements videos caused by natural head motion wearer. Carefully crafted features based hands objects cues for have been shown be successful limited targeted datasets. propose convolutional neural networks (CNNs) end learning classification actions. The proposed network makes use capturing hand pose, saliency map....

10.1109/cvpr.2016.287 article EN 2016-06-01

Abstract Histopathological images contain morphological markers of disease progression that have diagnostic and predictive values. In this study, we demonstrate how deep learning framework can be used for an automatic classification Renal Cell Carcinoma (RCC) subtypes, identification features predict survival outcome from digital histopathological images. Convolutional neural networks (CNN’s) trained on whole-slide distinguish clear cell chromophobe RCC normal tissue with a accuracy 93.39%...

10.1038/s41598-019-46718-3 article EN cc-by Scientific Reports 2019-07-19

We present a new dataset for Visual Question Answering (VQA) on document images called DocVQA. The consists of 50,000 questions defined 12,000+ images. Detailed analysis the in comparison with similar datasets VQA and reading comprehension is presented. report several baseline results by adopting existing models. Although models perform reasonably well certain types questions, there large performance gap compared to human (94.36% accuracy). need improve specifically where understanding...

10.1109/wacv48630.2021.00225 article EN 2021-01-01

Road network extraction from satellite images often produce fragmented road segments leading to maps unfit for real applications. Pixel-wise classification fails predict topologically correct and connected masks due the absence of connectivity supervision difficulty in enforcing topological constraints. In this paper, we propose a task called Orientation Learning, motivated by human behavior annotating roads tracing it at specific orientation. We also develop stacked multi-branch...

10.1109/cvpr.2019.01063 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

In this paper, we address the problem of automatically generating human-like descriptions for unseen images, given a collection images and their corresponding human-generated descriptions. Previous attempts task mostly rely on visual clues corpus statistics, but do not take much advantage semantic information inherent in available image Here, present generic method which benefits from all these three sources (i.e. clues, statistics descriptions) simultaneously, is capable constructing novel...

10.1609/aaai.v26i1.8205 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2021-09-20

Depth information has been shown to affect identification of visually salient regions in images. In this paper, we investigate the role depth saliency detection presence (i) competing saliencies due appearance, (ii) depth-induced blur and (iii) centre-bias. Having established through experiments that continues be a significant contributor these cues, propose 3D-saliency formulation takes into account structural features objects an indoor setting identify at levels. Computed is used...

10.5244/c.27.98 article EN 2013-01-01

The success of deep learning based models have centered around recent architectures and the availability large scale annotated data. In this work, we explore these two factors systematically for improving handwritten recognition scanned off-line document images. We propose a modified CNN-RNN hybrid architecture with major focus on effective training using: (i) efficient initialization network using synthetic data pretraining, (ii) image normalization slant correction (iii) domain specific...

10.1109/icfhr-2018.2018.00023 article EN 2018-08-01

Concerns on widespread use of biometric authentication systems are primarily centered around template security, revocability, and privacy. The cryptographic primitives to bolster the process can alleviate some these concerns as shown by cryptosystems. In this paper, we propose a <i xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">provably secure</i> xmlns:xlink="http://www.w3.org/1999/xlink">blind</i> protocol, which addresses user's privacy,...

10.1109/tifs.2010.2043188 article EN IEEE Transactions on Information Forensics and Security 2010-03-02

Template-based object detectors such as the deformable parts model of Felzenszwalb et al. [11] achieve state-of-the-art performance for a variety categories, but are still outperformed by simpler bag-of-words models highly flexible objects cats and dogs. In these cases we propose to use template-based detect distinctive part class, followed detecting rest via segmentation on image specific information learnt from that part. This approach is motivated two observations: (i) many classes...

10.1109/iccv.2011.6126398 article EN International Conference on Computer Vision 2011-11-01

In recent years, the need for semantic segmentation has arisen across several different applications and environments. However, expense redundancy of annotation often limits quantity labels available training in any domain, while deployment is easier if a single model works well domains. this paper, we pose novel problem universal semi-supervised propose solution framework, to meet dual needs lower costs. contrast counterpoints such as fine tuning, joint or unsupervised domain adaptation,...

10.1109/iccv.2019.00536 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2019-10-01

This paper presents a Bag of Visual Words (BoVW) based approach to retrieve similar word images from large database, efficiently and accurately. We show that text retrieval system can be adapted build image solution. helps in achieving scalability. demonstrate the method on more than 1 Million with sub-second time. validate four Indian languages, report mean average precision 0.75. represent as histogram visual words present image. are quantized representation local regions, for this work,...

10.1109/das.2012.96 article EN 2012-03-01

Recognizing text in images taken the wild is a challenging problem that has received great attention recent years. Previous methods addressed this by first detecting individual characters, and then forming them into words. Such approaches often suffer from weak character detections, due to large intra-class variations, even more so than characters scanned documents. We take different view of present holistic word recognition framework. In this, we represent scene image synthetic generated...

10.1109/icdar.2013.87 preprint EN 2013-08-01

Recognizing human faces in the wild is emerging as a critically important, and technically challenging computer vision problem. With few notable exceptions, most previous works last several decades have focused on recognizing captured laboratory setting. However, with introduction of databases such LFW Pubfigs, face recognition community gradually shifting its focus much more unconstrained settings. Since introduction, verification benchmark getting lot attention various researchers...

10.1109/ncvpripg.2013.6776225 article EN 2013-12-01

Chinese scene text reading is one of the most challenging problems in computer vision and has attracted great interest. Different from English text, more than 6000 commonly used characters can be arranged various layouts with numerous fonts. The signboards street view are a good choice for images since they have different backgrounds, fonts layouts. We organized competition called ICDAR2019-ReCTS, which mainly focuses on signboard. This report presents final results competition. A...

10.1109/icdar.2019.00253 preprint EN 2019-09-01

Deep convolutional features for word images and textual embedding schemes have shown great success in spotting. In this work, we follow these motivations to propose an End2End framework which jointly learns both the text image embeddings using state of art deep architectures. The three major contributions work are: (i) scheme learn a common representation its labels, (ii) building descriptor demonstrating utility as off-the-shelf spotting, (iii) use synthetic data complementary modality...

10.1109/das.2018.70 article EN 2018-04-01

Humans involuntarily tend to infer parts of the conversation from lip movements when speech is absent or corrupted by external noise. In this work, we explore task synthesis, i.e., learning generate natural given only a speaker. Acknowledging importance contextual and speaker-specific cues for accurate lip-reading, take different path existing works. We focus on sequences mappings individual speakers in unconstrained, large vocabulary settings. To end, collect release large-scale benchmark...

10.1109/cvpr42600.2020.01381 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020-06-01
Coming Soon ...