Sivan Doveh

ORCID: 0000-0003-2431-0620
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Domain Adaptation and Few-Shot Learning
  • Multimodal Machine Learning Applications
  • Advanced Neural Network Applications
  • Adversarial Robustness in Machine Learning
  • Advanced Image and Video Retrieval Techniques
  • Anomaly Detection Techniques and Applications
  • Topic Modeling
  • Semantic Web and Ontologies
  • Natural Language Processing Techniques
  • Mathematics, Computing, and Information Processing
  • COVID-19 diagnosis using AI
  • Generative Adversarial Networks and Image Synthesis
  • Software System Performance and Reliability
  • Numerical Methods and Algorithms
  • Competitive and Knowledge Intelligence
  • Educational Environments and Student Outcomes
  • Elevator Systems and Control
  • Robot Manipulation and Learning
  • Education and Technology Integration
  • Multimedia Communication and Technology
  • AI-based Problem Solving and Planning
  • Advanced Software Engineering Methodologies
  • Advanced Database Systems and Queries
  • Video Analysis and Summarization
  • Model Reduction and Neural Networks

IBM Research - Haifa
2019-2024

Weizmann Institute of Science
2021-2023

Tel Aviv University
2019-2021

IBM (United States)
2021

Vision and Language ( <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$VL$</tex> ) models have demonstrated remarkable zero-shot performance in a variety of tasks. However, some aspects complex language understanding still remain challenge. We introduce the collective notion Structured & Concepts (SVLC) which includes object attributes, relations, states are present text visible image. Recent studies shown that even best struggle with SVLC. A...

10.1109/cvpr52729.2023.00261 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

Automatic methods for Neural Architecture Search (NAS) have been shown to produce state-of-the-art network models. Yet, their main drawback is the computational complexity of search process. As some primal optimized over a discrete space, thousands days GPU were required convergence. A recent approach based on constructing differentiable space that enables gradient-based optimization, which reduces time few days. While successful, it still includes noncontinuous steps, e.g., pruning many...

10.48550/arxiv.1904.04123 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Large-scale pre-trained Vision & Language (VL) models have shown remarkable performance in many applications, enabling replacing a fixed set of supported classes with zero-shot open vocabulary reasoning over (almost arbitrary) natural language prompts. However, recent works uncovered fundamental weakness these models. For example, their difficulty to understand Visual Concepts (VLC) that go 'beyond nouns' such as the meaning non-object words (e.g., attributes, actions, relations, states,...

10.1109/iccv51070.2023.01844 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

10.18653/v1/2024.emnlp-main.12 article EN Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2024-01-01

We introduce Granite Vision, a lightweight large language model with vision capabilities, specifically designed to excel in enterprise use cases, particularly visual document understanding. Our is trained on comprehensive instruction-following dataset, including document-related tasks, such as content extraction from tables, charts, diagrams, sketches, and infographics, well general image tasks. The architecture of Vision centered around modality alignment decoder-only, 2 billion parameter...

10.48550/arxiv.2502.09927 preprint EN arXiv (Cornell University) 2025-02-14

Nowadays, there is an abundance of data involving images and surrounding free-form text weakly corresponding to those images. Weakly Supervised phrase-Grounding (WSG) deals with the task using this learn localize (or ground) arbitrary phrases in without any additional annotations. However, most recent SotA methods for WSG assume existence a pre-trained object detector, relying on it produce ROIs localization. In work, we focus Detector-Free (DF-WSG) solve detector. The key idea behind our...

10.1109/iccv48922.2021.00182 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

10.1007/s00521-021-06309-8 article EN Neural Computing and Applications 2021-07-20

Few-shot detection and classification have advanced significantly in recent years. Yet, approaches require strong annotation (bounding boxes) both for pre-training adaptation to novel classes, rarely provide localization of objects the scene. In this paper, we introduce StarNet - a few-shot model featuring an end-to-end differentiable non-parametric star-model head. Through head, backbone is meta-trained using only image-level labels produce good features jointly localizing classifying...

10.1609/aaai.v35i2.16268 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2021-05-18

Compositional Reasoning (CR) entails grasping the significance of attributes, relations, and word order. Recent Vision-Language Models (VLMs), comprising a visual encoder Large Language Model (LLM) decoder, have demonstrated remarkable proficiency in such reasoning tasks. This prompts crucial question: VLMs effectively tackled CR challenge? We conjecture that existing benchmarks may not adequately push boundaries modern due to reliance on an LLM-only negative text generation pipeline....

10.48550/arxiv.2406.08164 preprint EN arXiv (Cornell University) 2024-06-12

In this work, we propose a novel method (GLOV) enabling Large Language Models (LLMs) to act as implicit Optimizers for Vision-Langugage (VLMs) enhance downstream vision tasks. Our GLOV meta-prompts an LLM with the task description, querying it suitable VLM prompts (e.g., zero-shot classification CLIP). These are ranked according purity measure obtained through fitness function. each respective optimization step, fed in-context examples (with their accuracies) equip knowledge of type text...

10.48550/arxiv.2410.06154 preprint EN arXiv (Cornell University) 2024-10-08

Vision and Language (VL) models offer an effective method for aligning representation spaces of images text, leading to numerous applications such as cross-modal retrieval, visual question answering, captioning, more. However, the aligned image-text learned by all popular VL are still suffering from so-called `object bias' - their representations behave `bags nouns', mostly ignoring or downsizing attributes, relations, states objects described/appearing in texts/images. Although some great...

10.48550/arxiv.2305.19595 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Few-Shot Learning (FSL) is a topic of rapidly growing interest. Typically, in FSL model trained on dataset consisting many small tasks (meta-tasks) and learns to adapt novel that it will encounter during test time. This also referred as meta-learning. Another closely related meta-learning with lot interest the community Neural Architecture Search (NAS), automatically finding optimal architecture instead engineering manually. In this work, we combine these two aspects So far, methods have...

10.48550/arxiv.1912.00412 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Inspired by the emergence of Large Language Models (LLMs) that can truly understand human language, significant progress has been made in aligning other, non-language, modalities to be `understandable' an LLM, primarily via converting their samples into a sequence embedded language-like tokens directly fed LLM (decoder) input stream. However, so far limited attention given transferring (and evaluating) one core capabilities emerging VLMs, namely In-Context Learning (ICL) ability, or other...

10.48550/arxiv.2403.12736 preprint EN arXiv (Cornell University) 2024-03-19

Language models struggle with handling numerical data and performing arithmetic operations. We hypothesize that this limitation can be partially attributed to non-intuitive textual numbers representation. When a digit is read or generated by causal language model it does not know its place value (e.g. thousands vs. hundreds) until the entire number processed. To address issue, we propose simple adjustment how are represented including count of digits before each number. For instance, instead...

10.48550/arxiv.2404.00459 preprint EN arXiv (Cornell University) 2024-03-30

Comparing two images in terms of Commonalities and Differences (CaD) is a fundamental human capability that forms the basis advanced visual reasoning interpretation. It essential for generation detailed contextually relevant descriptions, performing comparative analysis, novelty detection, making informed decisions based on data. However, surprisingly, little attention has been given to these concepts best current mimic intelligence - Large Multimodal Models (LMMs). We develop contribute new...

10.48550/arxiv.2406.09240 preprint EN arXiv (Cornell University) 2024-06-13

It has been shown that Large Language Models' (LLMs) performance can be improved for many tasks using Chain of Thought (CoT) or In-Context Learning (ICL), which involve demonstrating the steps needed to solve a task few examples. However, while datasets with input-output pairs are relatively easy produce, providing demonstrations include intermediate requires cumbersome manual work. These may executable programs, as in agentic flows, step-by-step reasoning CoT. In this work, we propose...

10.48550/arxiv.2410.10348 preprint EN arXiv (Cornell University) 2024-10-14

The large-scale training of multi-modal models on data scraped from the web has shown outstanding utility in infusing these with required world knowledge to perform effectively multiple downstream tasks. However, one downside scraping can be potential sacrifice benchmarks which abilities are often evaluated. To safeguard against test contamination and truly foundation we propose LiveXiv: A scalable evolving live benchmark based scientific ArXiv papers. LiveXiv accesses domain-specific...

10.48550/arxiv.2410.10783 preprint EN arXiv (Cornell University) 2024-10-14

Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks, including image recognition, video understanding, and Visual Question Answering (VQA) when explicitly trained for these tasks. Despite advances, we find that current VLMs lack a fundamental cognitive ability: learning to localize objects in scene by taking into account the context. In this work, focus on task of few-shot personalized localization, where model is given small set annotated images...

10.48550/arxiv.2411.13317 preprint EN arXiv (Cornell University) 2024-11-20
Coming Soon ...