- Multimodal Machine Learning Applications
- Advanced Neural Network Applications
- Domain Adaptation and Few-Shot Learning
- Meta-analysis and systematic reviews
- Orthodontics and Dentofacial Orthopedics
- Temporomandibular Joint Disorders
- COVID-19 diagnosis using AI
- Topic Modeling
- Advanced Image and Video Retrieval Techniques
- Artificial Intelligence in Healthcare and Education
- Human Pose and Action Recognition
- Game Theory and Voting Systems
- Natural Language Processing Techniques
- Facial Rejuvenation and Surgery Techniques
- Healthcare Policy and Management
- Names, Identity, and Discrimination Research
- Cancer-related molecular mechanisms research
- Brain Tumor Detection and Classification
- COVID-19 epidemiological studies
- Dental Implant Techniques and Outcomes
- Speech and dialogue systems
- scientometrics and bibliometrics research
- Ethics in Clinical Research
- Medical Image Segmentation Techniques
- COVID-19 Pandemic Impacts
Mohamed bin Zayed University of Artificial Intelligence
2022-2025
National University of Computer and Emerging Sciences
2019-2024
University of Toronto
2020-2024
Aga Khan University Hospital
2022-2024
Creative Commons
2023
Hazara University
2023
Bahria University
2023
Capital University of Science and Technology
2022
MILA University
2022
Canada Research Chairs
2022
Pre-trained vision-language (V-L) models such as CLIP have shown excellent generalization ability to downstream tasks. However, they are sensitive the choice of input text prompts and require careful selection prompt templates perform well. Inspired by Natural Language Processing (NLP) literature, recent adaptation approaches learn textual inputs fine-tune for We note that using prompting adapt representations in a single branch (language or vision) is sub-optimal since it does not allow...
Owing to the success of transformer models, recent works study their applicability in 3D medical segmentation tasks. Within self-attention mechanism is one main building blocks that strives capture long-range dependencies, compared local convolutional-based design. However, operation has quadratic complexity which proves be a computational bottleneck, especially volumetric imaging, where inputs are with numerous slices. In this paper, we propose image approach, named UNETR++, offers both...
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model. Since on a similar scale for videos is infeasible, recent approaches focus the effective transfer of image-based video domain. In this pursuit, new parametric modules are added learn temporal information and inter-frame relationships which require meticulous design efforts. Furthermore, when resulting models learned videos, they tend overfit given task distribution lack in aspect. This begs...
Self-attention has become a defacto choice for capturing global context in various vision applications. However, its quadratic computational complexity with respect to image resolution limits use real-time applications, especially deployment on resource-constrained mobile devices. Although hybrid approaches have been proposed combine the advantages of convolutions and self-attention better speed-accuracy trade-off, expensive matrix multiplication operations remain bottleneck. In this work,...
Systematic reviews are a cornerstone for synthesizing the available evidence on given topic. They simultaneously allow gaps in literature to be identified and provide direction future research. However, due ever-increasing volume complexity of literature, traditional methods conducting systematic less efficient more time-consuming. Numerous artificial intelligence (AI) tools being released with potential optimize efficiency academic writing assist various stages review process including...
Existing open-vocabulary object detectors typically enlarge their vocabulary sizes by leveraging different forms of weak supervision. This helps generalize to novel objects at inference. Two popular weak-supervision used in detection (OVD) include pretrained CLIP model and image-level We note that both these modes supervision are not optimally aligned for the task: is trained with image-text pairs lacks precise localization while has been heuristics do accurately specify local regions. In...
Owing to the success of transformer models, recent works study their applicability in 3D medical segmentation tasks. Within self-attention mechanism is one main building blocks that strives capture long-range dependencies. However, operation has quadratic complexity which proves be a computational bottleneck, especially volumetric imaging, where inputs are with numerous slices. In this paper, we propose image approach, named UNETR++, offers both high-quality masks as well efficiency terms...
ABSTRACT I introduce a new graph‐theoretic property called abundant neighborhoods . This is motivated by studying the thickness of economic markets. A vertex is, roughly, guaranteed to match if and only it has an neighborhood. fact holds across numerous variants two‐sided markets that are studied economics, operations research, computer science literature. formalism study these under unifying framework, which call matching rules , allowing us hitherto different types (equivalently, graph...
Health and scientific researchers in non-English speaking countries such as Pakistan, are not proficient English, which limits their ability to communicate ideas findings the international community. ChatGPT is a large language model that can help non-native English speakers write high-quality papers much faster by assisting them conveying clear understandable manner, well avoiding common errors. In fact, has already been used publication of research papers, literature reviews, editorials....
Conversation agents fueled by Large Language Models (LLMs) are providing a new way to interact with visual data. While there have been initial attempts for image-based conversation models, this work addresses the underexplored field of video-based introducing Video-ChatGPT. It is multimodal model that merges video-adapted encoder LLM. The capable understanding and generating human-like conversations about videos. We introduce dataset 100,000 video-instruction pairs used train Video-ChatGPT...
Building on the advances of language models, Large Multimodal Models (LMMs) have contributed significant improvements in video understanding. While current LMMs utilize advanced Language (LLMs), they rely either image or encoders to process visual inputs, each which has its own limitations. Image excel at capturing rich spatial details from frame sequences but lack explicit temporal context, can be important videos with intricate action sequences. On other hand, provide context are often...
Large Multimodal Models (LMMs) extend Language to the vision domain. Initial LMMs used holistic images and text prompts generate ungrounded textual responses. Recently, region-level have been visually grounded However, they are limited only referring a single object category at time, require users specify regions, or cannot offer dense pixel-wise grounding. In this work, we present Grounding LMM (GLaMM), first model that can natural language responses seamlessly intertwined with...
In pursuit of more inclusive Vision-Language Models (VLMs), this study introduces a Large Multilingual Multimodal Model called \textsc{Palo}. \textsc{Palo} offers visual reasoning capabilities in 10 major languages, including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese, that span total $\sim$5B people (65\% the world population). Our approach involves semi-automated translation to adapt multimodal instruction dataset from English target languages...
A contemporary concept states that dental midline deviation towards the direction of facial flow line (FFL) can mask compromised smile esthetics. This study aimed to identify a range deviations be perceived or away from FFL influencing