NFDI4DS | UHH-SEMS - Publication Details

Vineet Gandhi

ORCID: 0000-0001-8861-7731

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5067298540

Research Areas

Video Analysis and Summarization
Advanced Neural Network Applications
Advanced Vision and Imaging
Multimodal Machine Learning Applications
Domain Adaptation and Few-Shot Learning
Video Surveillance and Tracking Methods
Advanced Image and Video Retrieval Techniques
Speech Recognition and Synthesis
Natural Language Processing Techniques
Visual Attention and Saliency Detection
Human Pose and Action Recognition
Image and Video Stabilization
Speech and Audio Processing
Music and Audio Processing
Speech and dialogue systems
Face recognition and analysis
Robotics and Sensor-Based Localization
Autonomous Vehicle Technology and Safety
Topic Modeling
Image Processing Techniques and Applications
Human Motion and Animation
Image Enhancement Techniques
Advanced Image Processing Techniques
Anomaly Detection Techniques and Applications
Multisensory perception and integration

Indian Institute of Technology Hyderabad
2016-2025

International Institute of Information Technology, Hyderabad
2017-2025

International Institute of Information Technology
2016-2021

Bentley University
2015

Narrative (Sweden)
2015

Laboratoire Jean Kuntzmann
2014

Institut national de recherche en informatique et en automatique
2012-2013

Centre Inria de l'Université Grenoble Alpes
2012

ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction

OPENALEX - Publications

Samyak Jain Pradeep Yarlagadda Shreyank Jyoti Shyamgopal Karthik Ramanathan Subramanian and 1 more

We propose the ViNet architecture for audio-visual saliency prediction. is a fully convolutional encoder-decoder architecture. The encoder uses visual features from network trained action recognition, and decoder infers map via trilinear interpolation 3D convolutions, combining multiple hierarchies. overall of conceptually simple; it causal runs in real-time (60 fps). does not use audio as input still outperforms state-of-the-art prediction models on nine different datasets (three...

10.1109/iros51168.2021.9635989 article EN 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2021-09-27

Ground then Navigate: Language-guided Navigation in Dynamic Scenes

OPENALEX - Publications

Kanishk Jain Varun Chhangani Amogh Tiwari K. Madhava Krishna Vineet Gandhi

We investigate the Vision-and-Language Navigation (VLN) problem in context of autonomous driving outdoor settings. solve by explicitly grounding navigable regions corresponding to textual command. At each timestamp, model predicts a segmentation mask intermediate or final region. Our work contrasts with existing efforts VLN, which pose this task as node selection problem, given discrete connected graph environment. do not assume availability such discretised map. moves towards continuity...

10.1109/icra48891.2023.10160614 article EN 2023-05-29

High-resolution depth maps based on TOF-stereo fusion

OPENALEX - Publications

Vineet Gandhi Jan Čech Radu Horaud

The combination of range sensors with color cameras can be very useful for robot navigation, semantic perception, manipulation, and telepresence. Several methods combining range- color-data have been investigated successfully used in various robotic applications. Most these systems suffer from the problems noise range-data resolution mismatch between sensor cameras, since current is much less than cameras. High-resolution depth maps obtained using stereo matching, but this often fails to...

10.1109/icra.2012.6224771 preprint EN 2012-05-01

Automated Top View Registration of Broadcast Football Videos

OPENALEX - Publications

Rahul Sharma Bharath Bhat Vineet Gandhi C. V. Jawahar

In this paper, we propose a fully automatic method to register football broadcast video frames on the static top view model of playing surface. Automatic registration has been difficult due difficulty finding sufficient point correspondences. We investigate an alternate approach exploiting edge information from line markings field. formulate problem as nearest neighbour search over synthetically generated dictionary map and homography pairs. The synthetic generation allows us exhaustively...

10.1109/wacv.2018.00040 article EN 2018-03-01

Simple Unsupervised Multi-Object Tracking

OPENALEX - Publications

Shyamgopal Karthik Ameya Prabhu Vineet Gandhi

Multi-object tracking has seen a lot of progress recently, albeit with substantial annotation costs for developing better and larger labeled datasets. In this work, we remove the need annotated datasets by proposing an unsupervised re-identification network, thus sidestepping labeling entirely, required training. Given unlabeled videos, our proposed method (SimpleReID) first generates labels using SORT trains ReID network to predict generated crossentropy loss. We demonstrate that SimpleReID...

10.48550/arxiv.2006.02609 preprint EN cc-by-nc-sa arXiv (Cornell University) 2020-01-01

Minimalistic Video Saliency Prediction via Efficient Decoder & Spatio Temporal Action Cues

OPENALEX - Publications

Rohit Girmaji Siddharth Jain Bhav Beri Saurabh Bansal Vineet Gandhi

This paper introduces ViNet-S, a 36MB model based on the ViNet architecture with U-Net design, featuring lightweight decoder that significantly reduces size and parameters without compromising performance. Additionally, ViNet-A (148MB) incorporates spatio-temporal action localization (STAL) features, differing from traditional video saliency models use classification backbones. Our studies show an ensemble of ViNet-S ViNet-A, by averaging predicted maps, achieves state-of-the-art performance...

10.48550/arxiv.2502.00397 preprint EN arXiv (Cornell University) 2025-02-01

EditIQ: Automated Cinematic Editing of Static Wide-Angle Videos via Dialogue Interpretation and Saliency Cues

OPENALEX - Publications

Rohit Girmaji Bhav Beri Ramanathan Subramanian Vineet Gandhi

We present EditIQ, a completely automated framework for cinematically editing scenes captured via stationary, large field-of-view and high-resolution camera. From the static camera feed, EditIQ initially generates multiple virtual feeds, emulating team of cameramen. These shots termed rushes are subsequently assembled using an algorithm, whose objective is to viewer with most vivid scene content. To understand key elements guide process, we employ two-pronged approach: (1) language model...

10.1145/3708359.3712113 preprint EN 2025-03-19

Minimalistic Video Saliency Prediction via Efficient Decoder & Spatio Temporal Action Cues

OPENALEX - Publications

Rohit Girmaji Siddharth Jain Bhav Beri Saurabh Bansal Vineet Gandhi

10.1109/icassp49660.2025.10888852 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Advancing NAM-to-Speech Conversion with Novel Methods and the MultiNAM Dataset

OPENALEX - Publications

Neil Shah Shirish Karande Vineet Gandhi

10.1109/icassp49660.2025.10890101 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Prompt-to-Correct: Automated Test-Time Pronunciation Correction with Voice Prompts

OPENALEX - Publications

A S Kashyap Neil Shah Vineet Gandhi

10.1109/icassp49660.2025.10889824 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

MRI2Speech: Speech Synthesis from Articulatory Movements Recorded by Real-time MRI

OPENALEX - Publications

Neil Shah A S Kashyap Shirish Karande Vineet Gandhi

10.1109/icassp49660.2025.10889895 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Enhanced Bitcoin Price Prediction Using RNN-GRU Algorithm with Optimized Parameters: Overcoming LSTM Ambiguities for Improved Accuracy and Efficiency

OPENALEX - Publications

M. Kathiravan M. Meenakshi M Kaviya S. Sreesubha V Sathyadurga and 1 more

10.1109/idciot64235.2025.10914986 article EN 2025-02-05

Nose, Eyes and Ears: Head Pose Estimation by Locating Facial Keypoints

OPENALEX - Publications

Aryaman Gupta Kalpit Thakkar Vineet Gandhi P. J. Narayanan

Monocular head pose estimation requires learning a model that computes the intrinsic Euler angles for (yaw, pitch, roll) from an input image of human face. Annotating ground truth images in wild is difficult and ad-hoc fitting procedures (which provides only coarse approximate annotations). This highlights need approaches which can train on data captured controlled environment generalize (with varying appearance illumination face). Most present day deep learn regression function directly...

10.1109/icassp.2019.8683503 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

Tidying Deep Saliency Prediction Architectures

OPENALEX - Publications

Navyasri Reddy Samyak Jain Pradeep Yarlagadda Vineet Gandhi

Learning computational models for visual attention (saliency estimation) is an effort to inch machines/robots closer human cognitive abilities. Data-driven efforts have dominated the landscape since introduction of deep neural network architectures. In learning research, choices in architecture design are often empirical and frequently lead more complex than necessary. The complexity, turn, hinders application requirements. this paper, we identify four key components saliency models, i.e.,...

10.1109/iros45743.2020.9341574 article EN 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2020-10-24

Bringing Generalization to Deep Multi-View Pedestrian Detection

OPENALEX - Publications

Jeet Vora Swetanjal Dutta Kanishk Jain Shyamgopal Karthik Vineet Gandhi

Multi-View Detection (MVD) is highly effective for occlusion reasoning in a crowded environment. While recent works using deep learning have made significant ad-vances the field, they overlooked generalization aspect, which makes them impractical real-world deployment. The key novelty of our work to formalize three critical forms and propose experiments evaluate them: with i) varying number cameras, ii) camera positions, fi-nally, iii) new scenes. We find that existing state-of-the-art...

10.1109/wacvw58289.2023.00016 article EN 2023-01-01

Multi-clip video editing from a single viewpoint

OPENALEX - Publications

Vineet Gandhi Rémi Ronfard Michael Gleicher

We propose a framework for automatically generating multiple clips suitable video editing by simulating pan-tilt-zoom camera movements within the frame of single static camera. Assuming important actors and objects can be localized using computer vision techniques, our method requires only minimal user input to define subject matter each sub-clip. The composition sub-clip is computed in novel L1-norm optimization framework. Our approach encodes several common cinematographic practices into...

10.1145/2668904.2668936 preprint EN 2014-11-13

Watch to Edit: Video Retargeting using Gaze

OPENALEX - Publications

Kranthi Kumar Rachavarapu Moneish Kumar Vineet Gandhi Ramanathan Subramanian

Abstract We present a novel approach to optimally retarget videos for varied displays with differing aspect ratios by preserving salient scene content discovered via eye tracking. Our algorithm performs editing cut, pan and zoom operations optimizing the path of cropping window within original video while seeking (i) preserve regions, (ii) adhere principles cinematography. is (a) agnostic as same methodology employed re‐edit wide‐angle recording or close‐up movie sequence captured static...

10.1111/cgf.13354 article EN Computer Graphics Forum 2018-05-01

MergeNet: A Deep Net Architecture for Small Obstacle Discovery

OPENALEX - Publications

Krishnam Gupta Syed Ashar Javed Vineet Gandhi K. Madhava Krishna

We present here, a novel network architecture called MergeNet for discovering small obstacles on-road scenes in the context of autonomous driving. The basis rests on central consideration training with less amount data since physical setup and annotation process is hard to scale. For making effective use limited data, we propose multi-stage procedure involving weight-sharing, separate learning low high level features from RGBD input refining stage which learns fuse obtained complementary...

10.1109/icra.2018.8461065 article EN 2018-05-01

Real Time GAZED: Online Shot Selection and Editing of Virtual Cameras from Wide-Angle Monocular Video Recordings

OPENALEX - Publications

Sudheer Achary Rohit Girmaji Adhiraj Anil Deshmukh Vineet Gandhi

Eliminating time-consuming post-prodution processes and delivering high-quality videos in today's fast-paced digital landscape are the key advantages of real-time approaches. To address these needs, we present Real Time GAZED: a adaptation GAZED framework integrated with CineFilter, novel camera trajectory stabilization approach. It enables users to create professionally edited real-time. Comparative evaluations against baseline methods, including non-real-time GAZED, demonstrate that...

10.1109/wacv57701.2024.00406 article EN 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2024-01-03

Detecting and Naming Actors in Movies Using Generative Appearance Models

OPENALEX - Publications

Vineet Gandhi Rémi Ronfard

We introduce a generative model for learning person and costume specific detectors from labeled examples. demonstrate the on task of localizing naming actors in long video sequences. More specifically, actor's head shoulders are each represented as constellation optional color regions. Detection can proceed despite changes view-point partial occlusions. explain how to learn models small number key frames or tracks, detect novel appearances maximum likelihood framework. present results...

10.1109/cvpr.2013.475 article EN 2009 IEEE Conference on Computer Vision and Pattern Recognition 2013-06-01

Comprehensive Multi-Modal Interactions for Referring Image Segmentation

OPENALEX - Publications

Kanishk Jain Vineet Gandhi

We investigate Referring Image Segmentation (RIS), which outputs a segmentation map corresponding to the natural language description. Addressing RIS efficiently requires considering interactions happening across visual and linguistic modalities within each modality. Existing methods are limited because they either compute different forms of sequentially (leading error propagation) or ignore intramodal interactions. address this limitation by performing all three simultaneously through...

10.18653/v1/2022.findings-acl.270 article EN cc-by Findings of the Association for Computational Linguistics: ACL 2022 2022-01-01

Talk to the Vehicle: Language Conditioned Autonomous Navigation of Self Driving Cars

OPENALEX - Publications

N N Sriram Tirth Maniar Kalyanasundaram Jayaganesh Vineet Gandhi Brojeshwar Bhowmick and 1 more

We propose a novel pipeline that blends encodings from natural language and 3D semantic maps obtained visual imagery to generate local trajectories are executed by low-level controller. The precludes the need for prior registered map through waypoint generator neural network. network (WGN) semantics (NLE) waypoints. A planner then generates trajectory ego location of vehicle (an outdoor car in this case) these locally generated waypoints while controller executes plans faithfully. efficacy...

10.1109/iros40897.2019.8967929 article EN 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2019-11-01

Coming Soon ...