- Video Analysis and Summarization
- Advanced Neural Network Applications
- Advanced Vision and Imaging
- Multimodal Machine Learning Applications
- Domain Adaptation and Few-Shot Learning
- Video Surveillance and Tracking Methods
- Advanced Image and Video Retrieval Techniques
- Speech Recognition and Synthesis
- Natural Language Processing Techniques
- Visual Attention and Saliency Detection
- Human Pose and Action Recognition
- Image and Video Stabilization
- Speech and Audio Processing
- Music and Audio Processing
- Speech and dialogue systems
- Face recognition and analysis
- Robotics and Sensor-Based Localization
- Autonomous Vehicle Technology and Safety
- Topic Modeling
- Image Processing Techniques and Applications
- Human Motion and Animation
- Image Enhancement Techniques
- Advanced Image Processing Techniques
- Anomaly Detection Techniques and Applications
- Multisensory perception and integration
Indian Institute of Technology Hyderabad
2016-2025
International Institute of Information Technology, Hyderabad
2017-2025
International Institute of Information Technology
2016-2021
Bentley University
2015
Narrative (Sweden)
2015
Laboratoire Jean Kuntzmann
2014
Institut national de recherche en informatique et en automatique
2012-2013
Centre Inria de l'Université Grenoble Alpes
2012
We propose the ViNet architecture for audio-visual saliency prediction. is a fully convolutional encoder-decoder architecture. The encoder uses visual features from network trained action recognition, and decoder infers map via trilinear interpolation 3D convolutions, combining multiple hierarchies. overall of conceptually simple; it causal runs in real-time (60 fps). does not use audio as input still outperforms state-of-the-art prediction models on nine different datasets (three...
We investigate the Vision-and-Language Navigation (VLN) problem in context of autonomous driving outdoor settings. solve by explicitly grounding navigable regions corresponding to textual command. At each timestamp, model predicts a segmentation mask intermediate or final region. Our work contrasts with existing efforts VLN, which pose this task as node selection problem, given discrete connected graph environment. do not assume availability such discretised map. moves towards continuity...
The combination of range sensors with color cameras can be very useful for robot navigation, semantic perception, manipulation, and telepresence. Several methods combining range- color-data have been investigated successfully used in various robotic applications. Most these systems suffer from the problems noise range-data resolution mismatch between sensor cameras, since current is much less than cameras. High-resolution depth maps obtained using stereo matching, but this often fails to...
In this paper, we propose a fully automatic method to register football broadcast video frames on the static top view model of playing surface. Automatic registration has been difficult due difficulty finding sufficient point correspondences. We investigate an alternate approach exploiting edge information from line markings field. formulate problem as nearest neighbour search over synthetically generated dictionary map and homography pairs. The synthetic generation allows us exhaustively...
Multi-object tracking has seen a lot of progress recently, albeit with substantial annotation costs for developing better and larger labeled datasets. In this work, we remove the need annotated datasets by proposing an unsupervised re-identification network, thus sidestepping labeling entirely, required training. Given unlabeled videos, our proposed method (SimpleReID) first generates labels using SORT trains ReID network to predict generated crossentropy loss. We demonstrate that SimpleReID...
This paper introduces ViNet-S, a 36MB model based on the ViNet architecture with U-Net design, featuring lightweight decoder that significantly reduces size and parameters without compromising performance. Additionally, ViNet-A (148MB) incorporates spatio-temporal action localization (STAL) features, differing from traditional video saliency models use classification backbones. Our studies show an ensemble of ViNet-S ViNet-A, by averaging predicted maps, achieves state-of-the-art performance...
We present EditIQ, a completely automated framework for cinematically editing scenes captured via stationary, large field-of-view and high-resolution camera. From the static camera feed, EditIQ initially generates multiple virtual feeds, emulating team of cameramen. These shots termed rushes are subsequently assembled using an algorithm, whose objective is to viewer with most vivid scene content. To understand key elements guide process, we employ two-pronged approach: (1) language model...
Monocular head pose estimation requires learning a model that computes the intrinsic Euler angles for (yaw, pitch, roll) from an input image of human face. Annotating ground truth images in wild is difficult and ad-hoc fitting procedures (which provides only coarse approximate annotations). This highlights need approaches which can train on data captured controlled environment generalize (with varying appearance illumination face). Most present day deep learn regression function directly...
Learning computational models for visual attention (saliency estimation) is an effort to inch machines/robots closer human cognitive abilities. Data-driven efforts have dominated the landscape since introduction of deep neural network architectures. In learning research, choices in architecture design are often empirical and frequently lead more complex than necessary. The complexity, turn, hinders application requirements. this paper, we identify four key components saliency models, i.e.,...
Multi-View Detection (MVD) is highly effective for occlusion reasoning in a crowded environment. While recent works using deep learning have made significant ad-vances the field, they overlooked generalization aspect, which makes them impractical real-world deployment. The key novelty of our work to formalize three critical forms and propose experiments evaluate them: with i) varying number cameras, ii) camera positions, fi-nally, iii) new scenes. We find that existing state-of-the-art...
We propose a framework for automatically generating multiple clips suitable video editing by simulating pan-tilt-zoom camera movements within the frame of single static camera. Assuming important actors and objects can be localized using computer vision techniques, our method requires only minimal user input to define subject matter each sub-clip. The composition sub-clip is computed in novel L1-norm optimization framework. Our approach encodes several common cinematographic practices into...
Abstract We present a novel approach to optimally retarget videos for varied displays with differing aspect ratios by preserving salient scene content discovered via eye tracking. Our algorithm performs editing cut, pan and zoom operations optimizing the path of cropping window within original video while seeking (i) preserve regions, (ii) adhere principles cinematography. is (a) agnostic as same methodology employed re‐edit wide‐angle recording or close‐up movie sequence captured static...
We present here, a novel network architecture called MergeNet for discovering small obstacles on-road scenes in the context of autonomous driving. The basis rests on central consideration training with less amount data since physical setup and annotation process is hard to scale. For making effective use limited data, we propose multi-stage procedure involving weight-sharing, separate learning low high level features from RGBD input refining stage which learns fuse obtained complementary...
Eliminating time-consuming post-prodution processes and delivering high-quality videos in today's fast-paced digital landscape are the key advantages of real-time approaches. To address these needs, we present Real Time GAZED: a adaptation GAZED framework integrated with CineFilter, novel camera trajectory stabilization approach. It enables users to create professionally edited real-time. Comparative evaluations against baseline methods, including non-real-time GAZED, demonstrate that...
We introduce a generative model for learning person and costume specific detectors from labeled examples. demonstrate the on task of localizing naming actors in long video sequences. More specifically, actor's head shoulders are each represented as constellation optional color regions. Detection can proceed despite changes view-point partial occlusions. explain how to learn models small number key frames or tracks, detect novel appearances maximum likelihood framework. present results...
We investigate Referring Image Segmentation (RIS), which outputs a segmentation map corresponding to the natural language description. Addressing RIS efficiently requires considering interactions happening across visual and linguistic modalities within each modality. Existing methods are limited because they either compute different forms of sequentially (leading error propagation) or ignore intramodal interactions. address this limitation by performing all three simultaneously through...
We propose a novel pipeline that blends encodings from natural language and 3D semantic maps obtained visual imagery to generate local trajectories are executed by low-level controller. The precludes the need for prior registered map through waypoint generator neural network. network (WGN) semantics (NLE) waypoints. A planner then generates trajectory ego location of vehicle (an outdoor car in this case) these locally generated waypoints while controller executes plans faithfully. efficacy...