Armin Mustafa

ORCID: 0000-0002-1779-2775
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Advanced Vision and Imaging
  • Human Pose and Action Recognition
  • Robotics and Sensor-Based Localization
  • Video Surveillance and Tracking Methods
  • Music and Audio Processing
  • Computer Graphics and Visualization Techniques
  • Image Enhancement Techniques
  • Multimodal Machine Learning Applications
  • 3D Shape Modeling and Analysis
  • Video Analysis and Summarization
  • Optical measurement and interference techniques
  • Advanced Image Processing Techniques
  • Speech and Audio Processing
  • Anomaly Detection Techniques and Applications
  • Advanced Image and Video Retrieval Techniques
  • Digital Media Forensic Detection
  • Advanced Neural Network Applications
  • Hand Gesture Recognition Systems
  • Generative Adversarial Networks and Image Synthesis
  • Image and Signal Denoising Methods
  • Infrared Thermography in Medicine
  • 3D Surveying and Cultural Heritage
  • Autonomous Vehicle Technology and Safety
  • Music Technology and Sound Studies
  • Domain Adaptation and Few-Shot Learning

University of Surrey
2015-2024

Signal Processing (United States)
2024

Samsung (India)
2012

Indian Institute of Technology Kanpur
2011

With the increasing global popularity of self-driving cars, there is an immediate need for challenging real-world datasets benchmarking and training various computer vision tasks such as 3D object detection. Existing either represent simple scenarios or provide only day-time data. In this paper, we introduce a new A*3D dataset which consists RGB images LiDAR data with significant diversity scene, time, weather. The high-density (≈ 10 times more than pioneering KITTI dataset), heavy...

10.1109/icra40945.2020.9197385 article EN 2020-05-01

This paper introduces a general approach to dynamic scene reconstruction from multiple moving cameras without prior knowledge or limiting constraints on the structure, appearance, illumination. Existing techniques wide-baseline camera views primarily focus accurate in controlled environments, where are fixed and calibrated background is known. These approaches not robust for scenes captured with sparse cameras. Previous outdoor assume of static appearance structure. The primary contributions...

10.1109/iccv.2015.109 article EN 2015-12-01

This paper presents an approach for reconstruction of 4D temporally coherent models complex dynamic scenes. No prior knowledge is required scene structure or camera calibration allowing from multiple moving cameras. Sparse-to-dense temporal correspondence integrated with joint multi-view segmentation and to obtain a complete representation static objects. Temporal coherence exploited overcome visual ambiguities resulting in improved Robust objects achieved by introducing geodesic star...

10.1109/cvpr.2016.504 article EN 2016-06-01

Text-to-image models (T2I) such as StableDiffusion have been used to generate high quality images of people. However, due the random nature generation process, person has a different appearance e.g. pose, face, and clothing, despite using same text prompt. The inconsistency makes T2I unsuitable for pose transfer. We address this by proposing multimodal diffusion model that accepts text, visual prompting. Our is first unified method perform all image tasks-generation, transfer, mask-less...

10.1109/iccvw60793.2023.00451 article EN 2023-10-02

Dense action detection involves detecting multiple co-occurring actions in an untrimmed video while classes are often ambiguous and represent overlapping concepts. To address this challenge task, we introduce a novel perspective inspired by how humans tackle complex tasks breaking them into manageable sub-tasks. Instead of relying on single network to the entire problem, as current approaches, propose decomposing problem key concepts present classes, specifically, dense static dynamic...

10.48550/arxiv.2501.18509 preprint EN arXiv (Cornell University) 2025-01-30

In this paper we propose a framework for spatially and temporally coherent semantic co-segmentation reconstruction of complex dynamic scenes from multiple static or moving cameras. Semantic exploits the coherence in class labels both spatially, between views at single time instant, temporally, widely spaced instants objects with similar shape appearance. We demonstrate that results improved segmentation scenes. A joint formulation is proposed semantically object-based by enforcing consistent...

10.1109/cvpr.2017.592 article EN 2017-07-01

In the context of Audio Visual Question Answering (AVQA) tasks, audio and visual modalities could be learnt on three levels: 1) Spatial, 2) Temporal, 3) Semantic. Existing AVQA methods suffer from two major shortcomings; audio-visual (AV) information passing through network isn't aligned Spatial Temporal levels; and, inter-modal (audio visual) Semantic is often not balanced within a context; this results in poor performance. paper, we propose novel end-to-end Contextual Multi-modal Alignment...

10.1109/wacv57701.2024.00709 article EN 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2024-01-03

A common problem in wide-baseline matching is the sparse and non-uniform distribution of correspondences when using conventional detectors, such as SIFT, SURF, FAST, A-KAZE, MSER. In this paper, we introduce a novel segmentation-based feature detector (SFD) that produces an increased number accurate features for matching. multi-scale SFD proposed bilateral image decomposition to produce large scale-invariant reconstruction. All input images are over-segmented into regions any existing...

10.1109/tip.2018.2872906 article EN IEEE Transactions on Image Processing 2018-09-28

In the domain of audio transformer architectures, prior research has extensively investigated isotropic architectures that capture global context through full self-attention and hierarchical progressively transition from local to utilising structures with convolutions or window-based attention. However, idea imbuing each individual block both contexts, thereby creating a hybrid block, remains relatively under-explored in field.To facilitate this exploration, we introduce Multi Axis Audio...

10.1109/icassp48485.2024.10447697 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

We present a new end-to-end learning framework to obtain detailed and spatially coherent reconstructions of multiple people from single image. Existing multi-person methods suffer two main drawbacks: they are often model-based therefore cannot capture accurate 3D models with loose clothing hair; or require manual intervention resolve occlusions interactions. Our method addresses both limitations by introducing the first approach perform model-free implicit reconstruction for realistic...

10.1109/cvpr46437.2021.01424 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

We present PAT, a transformer-based network that learns complex temporal co-occurrence action dependencies in video by exploiting multi-scale features. In existing methods, the self-attention mechanism transformers loses positional information, which is essential for robust detection. To address this issue, we (i) embed relative encoding and (ii) exploit relationships designing novel non-hierarchical network, contrast to recent approaches use hierarchical structure. argue joining with...

10.1109/iccvw60793.2023.00321 article EN 2023-10-02

Light-field video has recently been used in virtual and augmented reality applications to increase realism immersion. However, existing light-field methods are generally limited static scenes due the requirement acquire a dense scene representation. The large amount of data absence infer temporal coherence pose major challenges storage, compression editing compared conventional video. In this paper, we propose first method extract spatio-temporally coherent A novel obtain Epipolar Plane...

10.1109/3dv.2017.00014 article EN 2021 International Conference on 3D Vision (3DV) 2017-10-01

Convolutional neural networks (CNNs) and Transformer-based have recently enjoyed significant attention for various audio classification tagging tasks following their wide adoption in the computer vision domain. Despite difference information distribution between spectrograms natural images, there has been limited exploration of effective retrieval from using domain-specific layers tailored In this paper, we leverage power Multi-Axis Vision Transformer (MaxViT) to create DTF-AT (Decoupled...

10.1609/aaai.v38i16.29716 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2024-03-24

10.1109/cvprw63382.2024.00597 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2024-06-17

We introduce the first approach to solve challenging problem of unsupervised 4D visual scene understanding for complex dynamic scenes with multiple interacting people from multi-view video. Our simultaneously estimates a detailed model that includes per-pixel semantically and temporally coherent reconstruction, together instance-level segmentation exploiting photo-consistency, semantic motion information. further leverage recent advances in 3D pose estimation constrain joint instance...

10.1109/iccv.2019.01052 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2019-10-01

We present a generalised self-supervised learning approach for monocular estimation of the real depth across scenes with diverse ranges from 1--100s meters. Existing supervised methods require accurate measurements training. This limitation has led to introduction that are trained on stereo image pairs fixed camera baseline estimate disparity which is transformed given known calibration. Self-supervised approaches have demonstrated impressive results but do not generalise different or...

10.48550/arxiv.2004.06267 preprint EN other-oa arXiv (Cornell University) 2020-01-01

A common problem in wide-baseline stereo is the sparse and non-uniform distribution of correspondences when using conventional detectors such as SIFT, SURF, FAST MSER. In this paper we introduce a novel segmentation based feature detector SFD that produces an increased number 'good' features for accurate reconstruction. Each image segmented into regions by over-segmentation points are detected at intersection boundaries three or more regions. Segmentation-based detection locates local maxima...

10.1109/3dv.2015.39 article EN International Conference on 3D Vision 2015-10-01

Abstract Existing techniques for dynamic scene reconstruction from multiple wide-baseline cameras primarily focus on in controlled environments, with fixed calibrated and strong prior constraints. This paper introduces a general approach to obtain 4D representation of complex scenes multi-view static or moving without knowledge the structure, appearance, illumination. Contributions work are: an automatic method initial coarse initialize joint estimation; sparse-to-dense temporal...

10.1007/s11263-020-01367-2 article EN cc-by International Journal of Computer Vision 2020-08-18

Abstract Simultaneous semantically coherent object-based long-term 4D scene flow estimation, co-segmentation and reconstruction is proposed exploiting the coherence in semantic class labels both spatially, between views at a single time instant, temporally, widely spaced instants of dynamic objects with similar shape appearance. In this paper we propose framework for spatially temporally general scenes from multiple view videos captured network static or moving cameras. Semantic results...

10.1007/s11263-019-01241-w article EN cc-by International Journal of Computer Vision 2019-10-03

Existing methods for stereo work on narrow baseline image pairs giving limited performance between wide views. This paper proposes a framework to learn and estimate dense people from pairs. A synthetic patch dataset (S2P2) is introduced matching people. The proposed not only learns human specific features data but also exploits pooling layer augmentation adapt real data. network the patches wide-baseline estimation. In addition match learning, constraint in solve reconstruction of humans....

10.1109/iccvw.2019.00271 article EN 2019-10-01

With the increasing global popularity of self-driving cars, there is an immediate need for challenging real-world datasets benchmarking and training various computer vision tasks such as 3D object detection. Existing either represent simple scenarios or provide only day-time data. In this paper, we introduce a new A*3D dataset which consists RGB images LiDAR data with significant diversity scene, time, weather. The high-density ($\approx~10$ times more than pioneering KITTI dataset), heavy...

10.48550/arxiv.1909.07541 preprint EN other-oa arXiv (Cornell University) 2019-01-01
Coming Soon ...