- Advanced Vision and Imaging
- Human Pose and Action Recognition
- Robotics and Sensor-Based Localization
- Video Surveillance and Tracking Methods
- Music and Audio Processing
- Computer Graphics and Visualization Techniques
- Image Enhancement Techniques
- Multimodal Machine Learning Applications
- 3D Shape Modeling and Analysis
- Video Analysis and Summarization
- Optical measurement and interference techniques
- Advanced Image Processing Techniques
- Speech and Audio Processing
- Anomaly Detection Techniques and Applications
- Advanced Image and Video Retrieval Techniques
- Digital Media Forensic Detection
- Advanced Neural Network Applications
- Hand Gesture Recognition Systems
- Generative Adversarial Networks and Image Synthesis
- Image and Signal Denoising Methods
- Infrared Thermography in Medicine
- 3D Surveying and Cultural Heritage
- Autonomous Vehicle Technology and Safety
- Music Technology and Sound Studies
- Domain Adaptation and Few-Shot Learning
University of Surrey
2015-2024
Signal Processing (United States)
2024
Samsung (India)
2012
Indian Institute of Technology Kanpur
2011
With the increasing global popularity of self-driving cars, there is an immediate need for challenging real-world datasets benchmarking and training various computer vision tasks such as 3D object detection. Existing either represent simple scenarios or provide only day-time data. In this paper, we introduce a new A*3D dataset which consists RGB images LiDAR data with significant diversity scene, time, weather. The high-density (≈ 10 times more than pioneering KITTI dataset), heavy...
This paper introduces a general approach to dynamic scene reconstruction from multiple moving cameras without prior knowledge or limiting constraints on the structure, appearance, illumination. Existing techniques wide-baseline camera views primarily focus accurate in controlled environments, where are fixed and calibrated background is known. These approaches not robust for scenes captured with sparse cameras. Previous outdoor assume of static appearance structure. The primary contributions...
This paper presents an approach for reconstruction of 4D temporally coherent models complex dynamic scenes. No prior knowledge is required scene structure or camera calibration allowing from multiple moving cameras. Sparse-to-dense temporal correspondence integrated with joint multi-view segmentation and to obtain a complete representation static objects. Temporal coherence exploited overcome visual ambiguities resulting in improved Robust objects achieved by introducing geodesic star...
Text-to-image models (T2I) such as StableDiffusion have been used to generate high quality images of people. However, due the random nature generation process, person has a different appearance e.g. pose, face, and clothing, despite using same text prompt. The inconsistency makes T2I unsuitable for pose transfer. We address this by proposing multimodal diffusion model that accepts text, visual prompting. Our is first unified method perform all image tasks-generation, transfer, mask-less...
Dense action detection involves detecting multiple co-occurring actions in an untrimmed video while classes are often ambiguous and represent overlapping concepts. To address this challenge task, we introduce a novel perspective inspired by how humans tackle complex tasks breaking them into manageable sub-tasks. Instead of relying on single network to the entire problem, as current approaches, propose decomposing problem key concepts present classes, specifically, dense static dynamic...
In this paper we propose a framework for spatially and temporally coherent semantic co-segmentation reconstruction of complex dynamic scenes from multiple static or moving cameras. Semantic exploits the coherence in class labels both spatially, between views at single time instant, temporally, widely spaced instants objects with similar shape appearance. We demonstrate that results improved segmentation scenes. A joint formulation is proposed semantically object-based by enforcing consistent...
In the context of Audio Visual Question Answering (AVQA) tasks, audio and visual modalities could be learnt on three levels: 1) Spatial, 2) Temporal, 3) Semantic. Existing AVQA methods suffer from two major shortcomings; audio-visual (AV) information passing through network isn't aligned Spatial Temporal levels; and, inter-modal (audio visual) Semantic is often not balanced within a context; this results in poor performance. paper, we propose novel end-to-end Contextual Multi-modal Alignment...
A common problem in wide-baseline matching is the sparse and non-uniform distribution of correspondences when using conventional detectors, such as SIFT, SURF, FAST, A-KAZE, MSER. In this paper, we introduce a novel segmentation-based feature detector (SFD) that produces an increased number accurate features for matching. multi-scale SFD proposed bilateral image decomposition to produce large scale-invariant reconstruction. All input images are over-segmented into regions any existing...
In the domain of audio transformer architectures, prior research has extensively investigated isotropic architectures that capture global context through full self-attention and hierarchical progressively transition from local to utilising structures with convolutions or window-based attention. However, idea imbuing each individual block both contexts, thereby creating a hybrid block, remains relatively under-explored in field.To facilitate this exploration, we introduce Multi Axis Audio...
We present a new end-to-end learning framework to obtain detailed and spatially coherent reconstructions of multiple people from single image. Existing multi-person methods suffer two main drawbacks: they are often model-based therefore cannot capture accurate 3D models with loose clothing hair; or require manual intervention resolve occlusions interactions. Our method addresses both limitations by introducing the first approach perform model-free implicit reconstruction for realistic...
We present PAT, a transformer-based network that learns complex temporal co-occurrence action dependencies in video by exploiting multi-scale features. In existing methods, the self-attention mechanism transformers loses positional information, which is essential for robust detection. To address this issue, we (i) embed relative encoding and (ii) exploit relationships designing novel non-hierarchical network, contrast to recent approaches use hierarchical structure. argue joining with...
Light-field video has recently been used in virtual and augmented reality applications to increase realism immersion. However, existing light-field methods are generally limited static scenes due the requirement acquire a dense scene representation. The large amount of data absence infer temporal coherence pose major challenges storage, compression editing compared conventional video. In this paper, we propose first method extract spatio-temporally coherent A novel obtain Epipolar Plane...
Convolutional neural networks (CNNs) and Transformer-based have recently enjoyed significant attention for various audio classification tagging tasks following their wide adoption in the computer vision domain. Despite difference information distribution between spectrograms natural images, there has been limited exploration of effective retrieval from using domain-specific layers tailored In this paper, we leverage power Multi-Axis Vision Transformer (MaxViT) to create DTF-AT (Decoupled...
We introduce the first approach to solve challenging problem of unsupervised 4D visual scene understanding for complex dynamic scenes with multiple interacting people from multi-view video. Our simultaneously estimates a detailed model that includes per-pixel semantically and temporally coherent reconstruction, together instance-level segmentation exploiting photo-consistency, semantic motion information. further leverage recent advances in 3D pose estimation constrain joint instance...
We present a generalised self-supervised learning approach for monocular estimation of the real depth across scenes with diverse ranges from 1--100s meters. Existing supervised methods require accurate measurements training. This limitation has led to introduction that are trained on stereo image pairs fixed camera baseline estimate disparity which is transformed given known calibration. Self-supervised approaches have demonstrated impressive results but do not generalise different or...
A common problem in wide-baseline stereo is the sparse and non-uniform distribution of correspondences when using conventional detectors such as SIFT, SURF, FAST MSER. In this paper we introduce a novel segmentation based feature detector SFD that produces an increased number 'good' features for accurate reconstruction. Each image segmented into regions by over-segmentation points are detected at intersection boundaries three or more regions. Segmentation-based detection locates local maxima...
Abstract Existing techniques for dynamic scene reconstruction from multiple wide-baseline cameras primarily focus on in controlled environments, with fixed calibrated and strong prior constraints. This paper introduces a general approach to obtain 4D representation of complex scenes multi-view static or moving without knowledge the structure, appearance, illumination. Contributions work are: an automatic method initial coarse initialize joint estimation; sparse-to-dense temporal...
Abstract Simultaneous semantically coherent object-based long-term 4D scene flow estimation, co-segmentation and reconstruction is proposed exploiting the coherence in semantic class labels both spatially, between views at a single time instant, temporally, widely spaced instants of dynamic objects with similar shape appearance. In this paper we propose framework for spatially temporally general scenes from multiple view videos captured network static or moving cameras. Semantic results...
Existing methods for stereo work on narrow baseline image pairs giving limited performance between wide views. This paper proposes a framework to learn and estimate dense people from pairs. A synthetic patch dataset (S2P2) is introduced matching people. The proposed not only learns human specific features data but also exploits pooling layer augmentation adapt real data. network the patches wide-baseline estimation. In addition match learning, constraint in solve reconstruction of humans....
With the increasing global popularity of self-driving cars, there is an immediate need for challenging real-world datasets benchmarking and training various computer vision tasks such as 3D object detection. Existing either represent simple scenarios or provide only day-time data. In this paper, we introduce a new A*3D dataset which consists RGB images LiDAR data with significant diversity scene, time, weather. The high-density ($\approx~10$ times more than pioneering KITTI dataset), heavy...