- Advanced Image and Video Retrieval Techniques
- Handwritten Text Recognition Techniques
- Image Retrieval and Classification Techniques
- Multimodal Machine Learning Applications
- Natural Language Processing Techniques
- Advanced Vision and Imaging
- Video Analysis and Summarization
- Domain Adaptation and Few-Shot Learning
- Human Pose and Action Recognition
- Robotics and Sensor-Based Localization
- Advanced Neural Network Applications
- Video Surveillance and Tracking Methods
- Vehicle License Plate Recognition
- Image Processing and 3D Reconstruction
- Face recognition and analysis
- Topic Modeling
- Hand Gesture Recognition Systems
- Music and Audio Processing
- Algorithms and Data Compression
- Face and Expression Recognition
- Anomaly Detection Techniques and Applications
- Image Processing Techniques and Applications
- Speech and Audio Processing
- Advanced Image Processing Techniques
- Digital Media Forensic Detection
Indian Institute of Technology Hyderabad
2015-2024
International Institute of Information Technology, Hyderabad
2015-2024
International Institute of Information Technology
2004-2024
Indian Institute of Technology Delhi
2011-2024
Amrita Vishwa Vidyapeetham
2023
Indian Institute of Technology Mandi
2023
International Institute of Islamic Thought
2022
University of Bath
2021
Indian Institute of Technology Kanpur
2011-2019
Chinese University of Hong Kong
2017
We investigate the fine grained object categorization problem of determining breed animal from an image. To this end we introduce a new annotated dataset pets covering 37 different breeds cats and dogs. The visual is very challenging as these animals, particularly cats, are deformable there can be quite subtle differences between breeds. make number contributions: first, model to classify pet automatically combines shape, captured by part detecting face, appearance, bag-of-words that...
The automatic discovery of distinctive parts for an object or scene class is challenging since it requires simultaneously to learn the part appearance and also identify occurrences in images. In this paper, we propose a simple, efficient, effective method do so. We address problem by learning incrementally, starting from single occurrence with Exemplar SVM. manner, additional instances are discovered aligned reliably before being considered as training examples. entropy-rank curves means...
Scene text recognition has gained significant attention from the computer vision community in recent years. Recognizing such is a challenging problem, even more so than of scanned documents. In this work, we focus on problem recognizing extracted street images. We present framework that exploits both bottom-up and top-down cues. The cues are derived individual character detections image. build Conditional Random Field model these to jointly strength interactions between them. impose obtained...
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of dailylife activity spanning hundreds scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations 9 different countries. The approach to collection is designed uphold rigorous privacy ethics standards, with consenting participants robust de-identification procedures where relevant. Ego4D dramatically expands the volume...
While several datasets for autonomous navigation have become available in recent years, they tended to focus on structured driving environments. This usually corresponds well-delineated infrastructure such as lanes, a small number of well-defined categories traffic participants, low variation object or background appearance and strong adherence rules. We propose DS, novel dataset road scene understanding unstructured environments where the above assumptions are largely not satisfied. It...
In this work, we address the problem of cross-modal retrieval in presence multi-label annotations. particular, introduce Canonical Correlation Analysis (ml-CCA), an extension CCA, for learning shared subspaces taking into account high level semantic information form Unlike ml-CCA does not rely on explicit pairing between modalities, instead it uses to establish correspondences. This results a discriminative subspace which is better suited tasks. We also present Fast ml-CCA, computationally...
The ICDAR 2019 Challenge on "Scanned receipts OCR and key information extraction" (SROIE) covers important aspects related to the automated analysis of scanned receipts. SROIE tasks play a role in many document systems hold significant commercial potential. Although lot work has been published over years administrative analysis, community advanced relatively slowly, as most datasets have kept private. One contributions is offer first, standardized dataset 1000 whole receipt images...
Current visual question answering datasets do not consider the rich semantic information conveyed by text within an image. In this work, we present a new dataset, ST-VQA, that aims to highlight importance of exploiting high-level in images as textual cues Visual Question Answering process. We use dataset define series tasks increasing difficulty for which reading scene context provided is necessary reason and generate appropriate answer. propose evaluation metric these account both reasoning...
We focus on the problem of wearer's action recognition in first person a.k.a. egocentric videos. This is more challenging than third activity due to unavailability pose and sharp movements videos caused by natural head motion wearer. Carefully crafted features based hands objects cues for have been shown be successful limited targeted datasets. propose convolutional neural networks (CNNs) end learning classification actions. The proposed network makes use capturing hand pose, saliency map....
Abstract Histopathological images contain morphological markers of disease progression that have diagnostic and predictive values. In this study, we demonstrate how deep learning framework can be used for an automatic classification Renal Cell Carcinoma (RCC) subtypes, identification features predict survival outcome from digital histopathological images. Convolutional neural networks (CNN’s) trained on whole-slide distinguish clear cell chromophobe RCC normal tissue with a accuracy 93.39%...
We present a new dataset for Visual Question Answering (VQA) on document images called DocVQA. The consists of 50,000 questions defined 12,000+ images. Detailed analysis the in comparison with similar datasets VQA and reading comprehension is presented. report several baseline results by adopting existing models. Although models perform reasonably well certain types questions, there large performance gap compared to human (94.36% accuracy). need improve specifically where understanding...
Road network extraction from satellite images often produce fragmented road segments leading to maps unfit for real applications. Pixel-wise classification fails predict topologically correct and connected masks due the absence of connectivity supervision difficulty in enforcing topological constraints. In this paper, we propose a task called Orientation Learning, motivated by human behavior annotating roads tracing it at specific orientation. We also develop stacked multi-branch...
In this paper, we address the problem of automatically generating human-like descriptions for unseen images, given a collection images and their corresponding human-generated descriptions. Previous attempts task mostly rely on visual clues corpus statistics, but do not take much advantage semantic information inherent in available image Here, present generic method which benefits from all these three sources (i.e. clues, statistics descriptions) simultaneously, is capable constructing novel...
Depth information has been shown to affect identification of visually salient regions in images. In this paper, we investigate the role depth saliency detection presence (i) competing saliencies due appearance, (ii) depth-induced blur and (iii) centre-bias. Having established through experiments that continues be a significant contributor these cues, propose 3D-saliency formulation takes into account structural features objects an indoor setting identify at levels. Computed is used...
The success of deep learning based models have centered around recent architectures and the availability large scale annotated data. In this work, we explore these two factors systematically for improving handwritten recognition scanned off-line document images. We propose a modified CNN-RNN hybrid architecture with major focus on effective training using: (i) efficient initialization network using synthetic data pretraining, (ii) image normalization slant correction (iii) domain specific...
Concerns on widespread use of biometric authentication systems are primarily centered around template security, revocability, and privacy. The cryptographic primitives to bolster the process can alleviate some these concerns as shown by cryptosystems. In this paper, we propose a <i xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">provably secure</i> xmlns:xlink="http://www.w3.org/1999/xlink">blind</i> protocol, which addresses user's privacy,...
Template-based object detectors such as the deformable parts model of Felzenszwalb et al. [11] achieve state-of-the-art performance for a variety categories, but are still outperformed by simpler bag-of-words models highly flexible objects cats and dogs. In these cases we propose to use template-based detect distinctive part class, followed detecting rest via segmentation on image specific information learnt from that part. This approach is motivated two observations: (i) many classes...
In recent years, the need for semantic segmentation has arisen across several different applications and environments. However, expense redundancy of annotation often limits quantity labels available training in any domain, while deployment is easier if a single model works well domains. this paper, we pose novel problem universal semi-supervised propose solution framework, to meet dual needs lower costs. contrast counterpoints such as fine tuning, joint or unsupervised domain adaptation,...
This paper presents a Bag of Visual Words (BoVW) based approach to retrieve similar word images from large database, efficiently and accurately. We show that text retrieval system can be adapted build image solution. helps in achieving scalability. demonstrate the method on more than 1 Million with sub-second time. validate four Indian languages, report mean average precision 0.75. represent as histogram visual words present image. are quantized representation local regions, for this work,...
Recognizing text in images taken the wild is a challenging problem that has received great attention recent years. Previous methods addressed this by first detecting individual characters, and then forming them into words. Such approaches often suffer from weak character detections, due to large intra-class variations, even more so than characters scanned documents. We take different view of present holistic word recognition framework. In this, we represent scene image synthetic generated...
Recognizing human faces in the wild is emerging as a critically important, and technically challenging computer vision problem. With few notable exceptions, most previous works last several decades have focused on recognizing captured laboratory setting. However, with introduction of databases such LFW Pubfigs, face recognition community gradually shifting its focus much more unconstrained settings. Since introduction, verification benchmark getting lot attention various researchers...
Chinese scene text reading is one of the most challenging problems in computer vision and has attracted great interest. Different from English text, more than 6000 commonly used characters can be arranged various layouts with numerous fonts. The signboards street view are a good choice for images since they have different backgrounds, fonts layouts. We organized competition called ICDAR2019-ReCTS, which mainly focuses on signboard. This report presents final results competition. A...
Deep convolutional features for word images and textual embedding schemes have shown great success in spotting. In this work, we follow these motivations to propose an End2End framework which jointly learns both the text image embeddings using state of art deep architectures. The three major contributions work are: (i) scheme learn a common representation its labels, (ii) building descriptor demonstrating utility as off-the-shelf spotting, (iii) use synthetic data complementary modality...
Humans involuntarily tend to infer parts of the conversation from lip movements when speech is absent or corrupted by external noise. In this work, we explore task synthesis, i.e., learning generate natural given only a speaker. Acknowledging importance contextual and speaker-specific cues for accurate lip-reading, take different path existing works. We focus on sequences mappings individual speakers in unconstrained, large vocabulary settings. To end, collect release large-scale benchmark...