- Domain Adaptation and Few-Shot Learning
- Multimodal Machine Learning Applications
- Advanced Neural Network Applications
- Adversarial Robustness in Machine Learning
- Advanced Image and Video Retrieval Techniques
- Anomaly Detection Techniques and Applications
- Topic Modeling
- Semantic Web and Ontologies
- Natural Language Processing Techniques
- Mathematics, Computing, and Information Processing
- COVID-19 diagnosis using AI
- Generative Adversarial Networks and Image Synthesis
- Software System Performance and Reliability
- Numerical Methods and Algorithms
- Competitive and Knowledge Intelligence
- Educational Environments and Student Outcomes
- Elevator Systems and Control
- Robot Manipulation and Learning
- Education and Technology Integration
- Multimedia Communication and Technology
- AI-based Problem Solving and Planning
- Advanced Software Engineering Methodologies
- Advanced Database Systems and Queries
- Video Analysis and Summarization
- Model Reduction and Neural Networks
IBM Research - Haifa
2019-2024
Weizmann Institute of Science
2021-2023
Tel Aviv University
2019-2021
IBM (United States)
2021
Vision and Language ( <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$VL$</tex> ) models have demonstrated remarkable zero-shot performance in a variety of tasks. However, some aspects complex language understanding still remain challenge. We introduce the collective notion Structured & Concepts (SVLC) which includes object attributes, relations, states are present text visible image. Recent studies shown that even best struggle with SVLC. A...
Automatic methods for Neural Architecture Search (NAS) have been shown to produce state-of-the-art network models. Yet, their main drawback is the computational complexity of search process. As some primal optimized over a discrete space, thousands days GPU were required convergence. A recent approach based on constructing differentiable space that enables gradient-based optimization, which reduces time few days. While successful, it still includes noncontinuous steps, e.g., pruning many...
Large-scale pre-trained Vision & Language (VL) models have shown remarkable performance in many applications, enabling replacing a fixed set of supported classes with zero-shot open vocabulary reasoning over (almost arbitrary) natural language prompts. However, recent works uncovered fundamental weakness these models. For example, their difficulty to understand Visual Concepts (VLC) that go 'beyond nouns' such as the meaning non-object words (e.g., attributes, actions, relations, states,...
We introduce Granite Vision, a lightweight large language model with vision capabilities, specifically designed to excel in enterprise use cases, particularly visual document understanding. Our is trained on comprehensive instruction-following dataset, including document-related tasks, such as content extraction from tables, charts, diagrams, sketches, and infographics, well general image tasks. The architecture of Vision centered around modality alignment decoder-only, 2 billion parameter...
Nowadays, there is an abundance of data involving images and surrounding free-form text weakly corresponding to those images. Weakly Supervised phrase-Grounding (WSG) deals with the task using this learn localize (or ground) arbitrary phrases in without any additional annotations. However, most recent SotA methods for WSG assume existence a pre-trained object detector, relying on it produce ROIs localization. In work, we focus Detector-Free (DF-WSG) solve detector. The key idea behind our...
Few-shot detection and classification have advanced significantly in recent years. Yet, approaches require strong annotation (bounding boxes) both for pre-training adaptation to novel classes, rarely provide localization of objects the scene. In this paper, we introduce StarNet - a few-shot model featuring an end-to-end differentiable non-parametric star-model head. Through head, backbone is meta-trained using only image-level labels produce good features jointly localizing classifying...
Compositional Reasoning (CR) entails grasping the significance of attributes, relations, and word order. Recent Vision-Language Models (VLMs), comprising a visual encoder Large Language Model (LLM) decoder, have demonstrated remarkable proficiency in such reasoning tasks. This prompts crucial question: VLMs effectively tackled CR challenge? We conjecture that existing benchmarks may not adequately push boundaries modern due to reliance on an LLM-only negative text generation pipeline....
In this work, we propose a novel method (GLOV) enabling Large Language Models (LLMs) to act as implicit Optimizers for Vision-Langugage (VLMs) enhance downstream vision tasks. Our GLOV meta-prompts an LLM with the task description, querying it suitable VLM prompts (e.g., zero-shot classification CLIP). These are ranked according purity measure obtained through fitness function. each respective optimization step, fed in-context examples (with their accuracies) equip knowledge of type text...
Vision and Language (VL) models offer an effective method for aligning representation spaces of images text, leading to numerous applications such as cross-modal retrieval, visual question answering, captioning, more. However, the aligned image-text learned by all popular VL are still suffering from so-called `object bias' - their representations behave `bags nouns', mostly ignoring or downsizing attributes, relations, states objects described/appearing in texts/images. Although some great...
Few-Shot Learning (FSL) is a topic of rapidly growing interest. Typically, in FSL model trained on dataset consisting many small tasks (meta-tasks) and learns to adapt novel that it will encounter during test time. This also referred as meta-learning. Another closely related meta-learning with lot interest the community Neural Architecture Search (NAS), automatically finding optimal architecture instead engineering manually. In this work, we combine these two aspects So far, methods have...
Inspired by the emergence of Large Language Models (LLMs) that can truly understand human language, significant progress has been made in aligning other, non-language, modalities to be `understandable' an LLM, primarily via converting their samples into a sequence embedded language-like tokens directly fed LLM (decoder) input stream. However, so far limited attention given transferring (and evaluating) one core capabilities emerging VLMs, namely In-Context Learning (ICL) ability, or other...
Language models struggle with handling numerical data and performing arithmetic operations. We hypothesize that this limitation can be partially attributed to non-intuitive textual numbers representation. When a digit is read or generated by causal language model it does not know its place value (e.g. thousands vs. hundreds) until the entire number processed. To address issue, we propose simple adjustment how are represented including count of digits before each number. For instance, instead...
Comparing two images in terms of Commonalities and Differences (CaD) is a fundamental human capability that forms the basis advanced visual reasoning interpretation. It essential for generation detailed contextually relevant descriptions, performing comparative analysis, novelty detection, making informed decisions based on data. However, surprisingly, little attention has been given to these concepts best current mimic intelligence - Large Multimodal Models (LMMs). We develop contribute new...
It has been shown that Large Language Models' (LLMs) performance can be improved for many tasks using Chain of Thought (CoT) or In-Context Learning (ICL), which involve demonstrating the steps needed to solve a task few examples. However, while datasets with input-output pairs are relatively easy produce, providing demonstrations include intermediate requires cumbersome manual work. These may executable programs, as in agentic flows, step-by-step reasoning CoT. In this work, we propose...
The large-scale training of multi-modal models on data scraped from the web has shown outstanding utility in infusing these with required world knowledge to perform effectively multiple downstream tasks. However, one downside scraping can be potential sacrifice benchmarks which abilities are often evaluated. To safeguard against test contamination and truly foundation we propose LiveXiv: A scalable evolving live benchmark based scientific ArXiv papers. LiveXiv accesses domain-specific...
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks, including image recognition, video understanding, and Visual Question Answering (VQA) when explicitly trained for these tasks. Despite advances, we find that current VLMs lack a fundamental cognitive ability: learning to localize objects in scene by taking into account the context. In this work, focus on task of few-shot personalized localization, where model is given small set annotated images...