Muhammad Maaz

ORCID: 0000-0002-3869-631X
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Multimodal Machine Learning Applications
  • Advanced Neural Network Applications
  • Domain Adaptation and Few-Shot Learning
  • Meta-analysis and systematic reviews
  • Orthodontics and Dentofacial Orthopedics
  • Temporomandibular Joint Disorders
  • COVID-19 diagnosis using AI
  • Topic Modeling
  • Advanced Image and Video Retrieval Techniques
  • Artificial Intelligence in Healthcare and Education
  • Human Pose and Action Recognition
  • Game Theory and Voting Systems
  • Natural Language Processing Techniques
  • Facial Rejuvenation and Surgery Techniques
  • Healthcare Policy and Management
  • Names, Identity, and Discrimination Research
  • Cancer-related molecular mechanisms research
  • Brain Tumor Detection and Classification
  • COVID-19 epidemiological studies
  • Dental Implant Techniques and Outcomes
  • Speech and dialogue systems
  • scientometrics and bibliometrics research
  • Ethics in Clinical Research
  • Medical Image Segmentation Techniques
  • COVID-19 Pandemic Impacts

Mohamed bin Zayed University of Artificial Intelligence
2022-2025

National University of Computer and Emerging Sciences
2019-2024

University of Toronto
2020-2024

Aga Khan University Hospital
2022-2024

Creative Commons
2023

Hazara University
2023

Bahria University
2023

Capital University of Science and Technology
2022

MILA University
2022

Canada Research Chairs
2022

Pre-trained vision-language (V-L) models such as CLIP have shown excellent generalization ability to downstream tasks. However, they are sensitive the choice of input text prompts and require careful selection prompt templates perform well. Inspired by Natural Language Processing (NLP) literature, recent adaptation approaches learn textual inputs fine-tune for We note that using prompting adapt representations in a single branch (language or vision) is sub-optimal since it does not allow...

10.1109/cvpr52729.2023.01832 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

Owing to the success of transformer models, recent works study their applicability in 3D medical segmentation tasks. Within self-attention mechanism is one main building blocks that strives capture long-range dependencies, compared local convolutional-based design. However, operation has quadratic complexity which proves be a computational bottleneck, especially volumetric imaging, where inputs are with numerous slices. In this paper, we propose image approach, named UNETR++, offers both...

10.1109/tmi.2024.3398728 article EN cc-by IEEE Transactions on Medical Imaging 2024-05-09

Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model. Since on a similar scale for videos is infeasible, recent approaches focus the effective transfer of image-based video domain. In this pursuit, new parametric modules are added learn temporal information and inter-frame relationships which require meticulous design efforts. Furthermore, when resulting models learned videos, they tend overfit given task distribution lack in aspect. This begs...

10.1109/cvpr52729.2023.00633 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

Self-attention has become a defacto choice for capturing global context in various vision applications. However, its quadratic computational complexity with respect to image resolution limits use real-time applications, especially deployment on resource-constrained mobile devices. Although hybrid approaches have been proposed combine the advantages of convolutions and self-attention better speed-accuracy trade-off, expensive matrix multiplication operations remain bottleneck. In this work,...

10.1109/iccv51070.2023.01598 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

Systematic reviews are a cornerstone for synthesizing the available evidence on given topic. They simultaneously allow gaps in literature to be identified and provide direction future research. However, due ever-increasing volume complexity of literature, traditional methods conducting systematic less efficient more time-consuming. Numerous artificial intelligence (AI) tools being released with potential optimize efficiency academic writing assist various stages review process including...

10.1002/jcv2.12234 article EN cc-by JCPP Advances 2024-04-23

Existing open-vocabulary object detectors typically enlarge their vocabulary sizes by leveraging different forms of weak supervision. This helps generalize to novel objects at inference. Two popular weak-supervision used in detection (OVD) include pretrained CLIP model and image-level We note that both these modes supervision are not optimally aligned for the task: is trained with image-text pairs lacks precise localization while has been heuristics do accurately specify local regions. In...

10.48550/arxiv.2207.03482 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Owing to the success of transformer models, recent works study their applicability in 3D medical segmentation tasks. Within self-attention mechanism is one main building blocks that strives capture long-range dependencies. However, operation has quadratic complexity which proves be a computational bottleneck, especially volumetric imaging, where inputs are with numerous slices. In this paper, we propose image approach, named UNETR++, offers both high-quality masks as well efficiency terms...

10.48550/arxiv.2212.04497 preprint EN cc-by arXiv (Cornell University) 2022-01-01

ABSTRACT I introduce a new graph‐theoretic property called abundant neighborhoods . This is motivated by studying the thickness of economic markets. A vertex is, roughly, guaranteed to match if and only it has an neighborhood. fact holds across numerous variants two‐sided markets that are studied economics, operations research, computer science literature. formalism study these under unifying framework, which call matching rules , allowing us hitherto different types (equivalently, graph...

10.1002/nav.22265 article EN cc-by Naval Research Logistics (NRL) 2025-04-30

Health and scientific researchers in non-English speaking countries such as Pakistan, are not proficient English, which limits their ability to communicate ideas findings the international community. ChatGPT is a large language model that can help non-native English speakers write high-quality papers much faster by assisting them conveying clear understandable manner, well avoiding common errors. In fact, has already been used publication of research papers, literature reviews, editorials....

10.29271/jcpsp.2023.10.1198 article EN Journal of College of Physicians And Surgeons Pakistan 2023-10-01

Conversation agents fueled by Large Language Models (LLMs) are providing a new way to interact with visual data. While there have been initial attempts for image-based conversation models, this work addresses the underexplored field of video-based introducing Video-ChatGPT. It is multimodal model that merges video-adapted encoder LLM. The capable understanding and generating human-like conversations about videos. We introduce dataset 100,000 video-instruction pairs used train Video-ChatGPT...

10.48550/arxiv.2306.05424 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Building on the advances of language models, Large Multimodal Models (LMMs) have contributed significant improvements in video understanding. While current LMMs utilize advanced Language (LLMs), they rely either image or encoders to process visual inputs, each which has its own limitations. Image excel at capturing rich spatial details from frame sequences but lack explicit temporal context, can be important videos with intricate action sequences. On other hand, provide context are often...

10.48550/arxiv.2406.09418 preprint EN arXiv (Cornell University) 2024-06-13

Large Multimodal Models (LMMs) extend Language to the vision domain. Initial LMMs used holistic images and text prompts generate ungrounded textual responses. Recently, region-level have been visually grounded However, they are limited only referring a single object category at time, require users specify regions, or cannot offer dense pixel-wise grounding. In this work, we present Grounding LMM (GLaMM), first model that can natural language responses seamlessly intertwined with...

10.48550/arxiv.2311.03356 preprint EN cc-by arXiv (Cornell University) 2023-01-01

In pursuit of more inclusive Vision-Language Models (VLMs), this study introduces a Large Multilingual Multimodal Model called \textsc{Palo}. \textsc{Palo} offers visual reasoning capabilities in 10 major languages, including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese, that span total $\sim$5B people (65\% the world population). Our approach involves semi-automated translation to adapt multimodal instruction dataset from English target languages...

10.48550/arxiv.2402.14818 preprint EN arXiv (Cornell University) 2024-02-22

A contemporary concept states that dental midline deviation towards the direction of facial flow line (FFL) can mask compromised smile esthetics. This study aimed to identify a range deviations be perceived or away from FFL influencing

10.1111/jerd.13298 article EN Journal of Esthetic and Restorative Dentistry 2024-08-16
Coming Soon ...