Vineet Gandhi

ORCID: 0000-0001-8861-7731
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Video Analysis and Summarization
  • Advanced Neural Network Applications
  • Advanced Vision and Imaging
  • Multimodal Machine Learning Applications
  • Domain Adaptation and Few-Shot Learning
  • Video Surveillance and Tracking Methods
  • Advanced Image and Video Retrieval Techniques
  • Speech Recognition and Synthesis
  • Natural Language Processing Techniques
  • Visual Attention and Saliency Detection
  • Human Pose and Action Recognition
  • Image and Video Stabilization
  • Speech and Audio Processing
  • Music and Audio Processing
  • Speech and dialogue systems
  • Face recognition and analysis
  • Robotics and Sensor-Based Localization
  • Autonomous Vehicle Technology and Safety
  • Topic Modeling
  • Image Processing Techniques and Applications
  • Human Motion and Animation
  • Image Enhancement Techniques
  • Advanced Image Processing Techniques
  • Anomaly Detection Techniques and Applications
  • Multisensory perception and integration

Indian Institute of Technology Hyderabad
2016-2025

International Institute of Information Technology, Hyderabad
2017-2025

International Institute of Information Technology
2016-2021

Bentley University
2015

Narrative (Sweden)
2015

Laboratoire Jean Kuntzmann
2014

Institut national de recherche en informatique et en automatique
2012-2013

Centre Inria de l'Université Grenoble Alpes
2012

We propose the ViNet architecture for audio-visual saliency prediction. is a fully convolutional encoder-decoder architecture. The encoder uses visual features from network trained action recognition, and decoder infers map via trilinear interpolation 3D convolutions, combining multiple hierarchies. overall of conceptually simple; it causal runs in real-time (60 fps). does not use audio as input still outperforms state-of-the-art prediction models on nine different datasets (three...

10.1109/iros51168.2021.9635989 article EN 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2021-09-27

We investigate the Vision-and-Language Navigation (VLN) problem in context of autonomous driving outdoor settings. solve by explicitly grounding navigable regions corresponding to textual command. At each timestamp, model predicts a segmentation mask intermediate or final region. Our work contrasts with existing efforts VLN, which pose this task as node selection problem, given discrete connected graph environment. do not assume availability such discretised map. moves towards continuity...

10.1109/icra48891.2023.10160614 article EN 2023-05-29

The combination of range sensors with color cameras can be very useful for robot navigation, semantic perception, manipulation, and telepresence. Several methods combining range- color-data have been investigated successfully used in various robotic applications. Most these systems suffer from the problems noise range-data resolution mismatch between sensor cameras, since current is much less than cameras. High-resolution depth maps obtained using stereo matching, but this often fails to...

10.1109/icra.2012.6224771 preprint EN 2012-05-01

In this paper, we propose a fully automatic method to register football broadcast video frames on the static top view model of playing surface. Automatic registration has been difficult due difficulty finding sufficient point correspondences. We investigate an alternate approach exploiting edge information from line markings field. formulate problem as nearest neighbour search over synthetically generated dictionary map and homography pairs. The synthetic generation allows us exhaustively...

10.1109/wacv.2018.00040 article EN 2018-03-01

Multi-object tracking has seen a lot of progress recently, albeit with substantial annotation costs for developing better and larger labeled datasets. In this work, we remove the need annotated datasets by proposing an unsupervised re-identification network, thus sidestepping labeling entirely, required training. Given unlabeled videos, our proposed method (SimpleReID) first generates labels using SORT trains ReID network to predict generated crossentropy loss. We demonstrate that SimpleReID...

10.48550/arxiv.2006.02609 preprint EN cc-by-nc-sa arXiv (Cornell University) 2020-01-01

This paper introduces ViNet-S, a 36MB model based on the ViNet architecture with U-Net design, featuring lightweight decoder that significantly reduces size and parameters without compromising performance. Additionally, ViNet-A (148MB) incorporates spatio-temporal action localization (STAL) features, differing from traditional video saliency models use classification backbones. Our studies show an ensemble of ViNet-S ViNet-A, by averaging predicted maps, achieves state-of-the-art performance...

10.48550/arxiv.2502.00397 preprint EN arXiv (Cornell University) 2025-02-01

We present EditIQ, a completely automated framework for cinematically editing scenes captured via stationary, large field-of-view and high-resolution camera. From the static camera feed, EditIQ initially generates multiple virtual feeds, emulating team of cameramen. These shots termed rushes are subsequently assembled using an algorithm, whose objective is to viewer with most vivid scene content. To understand key elements guide process, we employ two-pronged approach: (1) language model...

10.1145/3708359.3712113 preprint EN 2025-03-19

10.1109/icassp49660.2025.10888852 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

10.1109/icassp49660.2025.10890101 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

10.1109/icassp49660.2025.10889824 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

10.1109/icassp49660.2025.10889895 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Monocular head pose estimation requires learning a model that computes the intrinsic Euler angles for (yaw, pitch, roll) from an input image of human face. Annotating ground truth images in wild is difficult and ad-hoc fitting procedures (which provides only coarse approximate annotations). This highlights need approaches which can train on data captured controlled environment generalize (with varying appearance illumination face). Most present day deep learn regression function directly...

10.1109/icassp.2019.8683503 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

Learning computational models for visual attention (saliency estimation) is an effort to inch machines/robots closer human cognitive abilities. Data-driven efforts have dominated the landscape since introduction of deep neural network architectures. In learning research, choices in architecture design are often empirical and frequently lead more complex than necessary. The complexity, turn, hinders application requirements. this paper, we identify four key components saliency models, i.e.,...

10.1109/iros45743.2020.9341574 article EN 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2020-10-24

Multi-View Detection (MVD) is highly effective for occlusion reasoning in a crowded environment. While recent works using deep learning have made significant ad-vances the field, they overlooked generalization aspect, which makes them impractical real-world deployment. The key novelty of our work to formalize three critical forms and propose experiments evaluate them: with i) varying number cameras, ii) camera positions, fi-nally, iii) new scenes. We find that existing state-of-the-art...

10.1109/wacvw58289.2023.00016 article EN 2023-01-01

We propose a framework for automatically generating multiple clips suitable video editing by simulating pan-tilt-zoom camera movements within the frame of single static camera. Assuming important actors and objects can be localized using computer vision techniques, our method requires only minimal user input to define subject matter each sub-clip. The composition sub-clip is computed in novel L1-norm optimization framework. Our approach encodes several common cinematographic practices into...

10.1145/2668904.2668936 preprint EN 2014-11-13

Abstract We present a novel approach to optimally retarget videos for varied displays with differing aspect ratios by preserving salient scene content discovered via eye tracking. Our algorithm performs editing cut, pan and zoom operations optimizing the path of cropping window within original video while seeking (i) preserve regions, (ii) adhere principles cinematography. is (a) agnostic as same methodology employed re‐edit wide‐angle recording or close‐up movie sequence captured static...

10.1111/cgf.13354 article EN Computer Graphics Forum 2018-05-01

We present here, a novel network architecture called MergeNet for discovering small obstacles on-road scenes in the context of autonomous driving. The basis rests on central consideration training with less amount data since physical setup and annotation process is hard to scale. For making effective use limited data, we propose multi-stage procedure involving weight-sharing, separate learning low high level features from RGBD input refining stage which learns fuse obtained complementary...

10.1109/icra.2018.8461065 article EN 2018-05-01

Eliminating time-consuming post-prodution processes and delivering high-quality videos in today's fast-paced digital landscape are the key advantages of real-time approaches. To address these needs, we present Real Time GAZED: a adaptation GAZED framework integrated with CineFilter, novel camera trajectory stabilization approach. It enables users to create professionally edited real-time. Comparative evaluations against baseline methods, including non-real-time GAZED, demonstrate that...

10.1109/wacv57701.2024.00406 article EN 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2024-01-03

We introduce a generative model for learning person and costume specific detectors from labeled examples. demonstrate the on task of localizing naming actors in long video sequences. More specifically, actor's head shoulders are each represented as constellation optional color regions. Detection can proceed despite changes view-point partial occlusions. explain how to learn models small number key frames or tracks, detect novel appearances maximum likelihood framework. present results...

10.1109/cvpr.2013.475 article EN 2009 IEEE Conference on Computer Vision and Pattern Recognition 2013-06-01

We investigate Referring Image Segmentation (RIS), which outputs a segmentation map corresponding to the natural language description. Addressing RIS efficiently requires considering interactions happening across visual and linguistic modalities within each modality. Existing methods are limited because they either compute different forms of sequentially (leading error propagation) or ignore intramodal interactions. address this limitation by performing all three simultaneously through...

10.18653/v1/2022.findings-acl.270 article EN cc-by Findings of the Association for Computational Linguistics: ACL 2022 2022-01-01

We propose a novel pipeline that blends encodings from natural language and 3D semantic maps obtained visual imagery to generate local trajectories are executed by low-level controller. The precludes the need for prior registered map through waypoint generator neural network. network (WGN) semantics (NLE) waypoints. A planner then generates trajectory ego location of vehicle (an outdoor car in this case) these locally generated waypoints while controller executes plans faithfully. efficacy...

10.1109/iros40897.2019.8967929 article EN 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2019-11-01
Coming Soon ...