NFDI4DS | UHH-SEMS - Publication Details

Sivan Doveh

ORCID: 0000-0003-2431-0620

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5016502423

Research Areas

Domain Adaptation and Few-Shot Learning
Multimodal Machine Learning Applications
Advanced Neural Network Applications
Adversarial Robustness in Machine Learning
Advanced Image and Video Retrieval Techniques
Anomaly Detection Techniques and Applications
Topic Modeling
Semantic Web and Ontologies
Natural Language Processing Techniques
Mathematics, Computing, and Information Processing
COVID-19 diagnosis using AI
Generative Adversarial Networks and Image Synthesis
Software System Performance and Reliability
Numerical Methods and Algorithms
Competitive and Knowledge Intelligence
Educational Environments and Student Outcomes
Elevator Systems and Control
Robot Manipulation and Learning
Education and Technology Integration
Multimedia Communication and Technology
AI-based Problem Solving and Planning
Advanced Software Engineering Methodologies
Advanced Database Systems and Queries
Video Analysis and Summarization
Model Reduction and Neural Networks

IBM Research - Haifa
2019-2024

Weizmann Institute of Science
2021-2023

Tel Aviv University
2019-2021

IBM (United States)
2021

Teaching Structured Vision & Language Concepts to Vision & Language Models

OPENALEX - Publications

Sivan Doveh Assaf Arbelle Sivan Harary Eli Schwartz Roei Herzig and 5 more

Vision and Language ( <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$VL$</tex> ) models have demonstrated remarkable zero-shot performance in a variety of tasks. However, some aspects complex language understanding still remain challenge. We introduce the collective notion Structured & Concepts (SVLC) which includes object attributes, relations, states are present text visible image. Recent studies shown that even best struggle with SVLC. A...

10.1109/cvpr52729.2023.00261 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

MAEDAY: MAE for few- and zero-shot AnomalY-Detection

OPENALEX - Publications

Eli Schwartz Assaf Arbelle Leonid Karlinsky Sivan Harary Florian Scheidegger and 2 more

10.1016/j.cviu.2024.103958 article EN Computer Vision and Image Understanding 2024-02-16

ASAP: Architecture Search, Anneal and Prune

OPENALEX - Publications

Asaf Noy Niv Nayman Tal Ridnik Nadav Zamir Sivan Doveh and 3 more

Automatic methods for Neural Architecture Search (NAS) have been shown to produce state-of-the-art network models. Yet, their main drawback is the computational complexity of search process. As some primal optimized over a discrete space, thousands days GPU were required convergence. A recent approach based on constructing differentiable space that enables gradient-based optimization, which reduces time few days. While successful, it still includes noncontinuous steps, e.g., pruning many...

10.48550/arxiv.1904.04123 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Going Beyond Nouns With Vision & Language Models Using Synthetic Data

OPENALEX - Publications

Paola Cascante-Bonilla Khaled Shehada James Smith Sivan Doveh Donghyun Kim and 6 more

Large-scale pre-trained Vision & Language (VL) models have shown remarkable performance in many applications, enabling replacing a fixed set of supported classes with zero-shot open vocabulary reasoning over (almost arbitrary) natural language prompts. However, recent works uncovered fundamental weakness these models. For example, their difficulty to understand Visual Concepts (VLC) that go 'beyond nouns' such as the meaning non-object words (e.g., attributes, actions, relations, states,...

10.1109/iccv51070.2023.01844 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

NumeroLogic: Number Encoding for Enhanced LLMs’ Numerical Reasoning

OPENALEX - Publications

Eli Schwartz Leshem Choshen Joseph Shtok Sivan Doveh Leonid Karlinsky and 1 more

10.18653/v1/2024.emnlp-main.12 article EN Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2024-01-01

Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence

OPENALEX - Publications

Granite Vision Team Leonid Karlinsky Assaf Arbelle A Daniels Ahmed Nassar and 58 more

We introduce Granite Vision, a lightweight large language model with vision capabilities, specifically designed to excel in enterprise use cases, particularly visual document understanding. Our is trained on comprehensive instruction-following dataset, including document-related tasks, such as content extraction from tables, charts, diagrams, sketches, and infographics, well general image tasks. The architecture of Vision centered around modality alignment decoder-only, 2 billion parameter...

10.48550/arxiv.2502.09927 preprint EN arXiv (Cornell University) 2025-02-14

Detector-Free Weakly Supervised Grounding by Separation

OPENALEX - Publications

Assaf Arbelle Sivan Doveh Amit Alfassy Joseph Shtok Guy Lev and 12 more

Nowadays, there is an abundance of data involving images and surrounding free-form text weakly corresponding to those images. Weakly Supervised phrase-Grounding (WSG) deals with the task using this learn localize (or ground) arbitrary phrases in without any additional annotations. However, most recent SotA methods for WSG assume existence a pre-trained object detector, relying on it produce ROIs localization. In work, we focus Detector-Free (DF-WSG) solve detector. The key idea behind our...

10.1109/iccv48922.2021.00182 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

DEGAS: differentiable efficient generator search

OPENALEX - Publications

Sivan Doveh Raja Giryes

10.1007/s00521-021-06309-8 article EN Neural Computing and Applications 2021-07-20

StarNet: towards Weakly Supervised Few-Shot Object Detection

OPENALEX - Publications

Leonid Karlinsky Joseph Shtok Amit Alfassy Moshe Lichtenstein Sivan Harary and 6 more

Few-shot detection and classification have advanced significantly in recent years. Yet, approaches require strong annotation (bounding boxes) both for pre-training adaptation to novel classes, rarely provide localization of objects the scene. In this paper, we introduce StarNet - a few-shot model featuring an end-to-end differentiable non-parametric star-model head. Through head, backbone is meta-trained using only image-level labels produce good features jointly localizing classifying...

10.1609/aaai.v35i2.16268 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2021-05-18

MetAdapt: Meta-learned task-adaptive architecture for few-shot classification

OPENALEX - Publications

Sivan Doveh Eli Schwartz Chao Xue Rogério Feris Alex Bronstein and 2 more

10.1016/j.patrec.2021.05.010 article EN Pattern Recognition Letters 2021-06-12

ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs

OPENALEX - Publications

Irene Huang Wei Lin M. Jehanzeb Mirza Jacob A. Hansen Sivan Doveh and 9 more

Compositional Reasoning (CR) entails grasping the significance of attributes, relations, and word order. Recent Vision-Language Models (VLMs), comprising a visual encoder Large Language Model (LLM) decoder, have demonstrated remarkable proficiency in such reasoning tasks. This prompts crucial question: VLMs effectively tackled CR challenge? We conjecture that existing benchmarks may not adequately push boundaries modern due to reliance on an LLM-only negative text generation pipeline....

10.48550/arxiv.2406.08164 preprint EN arXiv (Cornell University) 2024-06-12

GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

OPENALEX - Publications

M. Jehanzeb Mirza Mengjie Zhao Zhuoyuan Mao Sivan Doveh Wei Lin and 10 more

In this work, we propose a novel method (GLOV) enabling Large Language Models (LLMs) to act as implicit Optimizers for Vision-Langugage (VLMs) enhance downstream vision tasks. Our GLOV meta-prompts an LLM with the task description, querying it suitable VLM prompts (e.g., zero-shot classification CLIP). These are ranked according purity measure obtained through fitness function. each respective optimization step, fed in-context examples (with their accuracies) equip knowledge of type text...

10.48550/arxiv.2410.06154 preprint EN arXiv (Cornell University) 2024-10-08

Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models

OPENALEX - Publications

Sivan Doveh Assaf Arbelle Sivan Harary Roei Herzig Donghyun Kim and 7 more

Vision and Language (VL) models offer an effective method for aligning representation spaces of images text, leading to numerous applications such as cross-modal retrieval, visual question answering, captioning, more. However, the aligned image-text learned by all popular VL are still suffering from so-called `object bias' - their representations behave `bags nouns', mostly ignoring or downsizing attributes, relations, states objects described/appearing in texts/images. Although some great...

10.48550/arxiv.2305.19595 preprint EN other-oa arXiv (Cornell University) 2023-01-01

MetAdapt: Meta-Learned Task-Adaptive Architecture for Few-Shot Classification

OPENALEX - Publications

Sivan Doveh Eli Schwartz Chao Xue Rogério Feris Alex Bronstein and 2 more

Few-Shot Learning (FSL) is a topic of rapidly growing interest. Typically, in FSL model trained on dataset consisting many small tasks (meta-tasks) and learns to adapt novel that it will encounter during test time. This also referred as meta-learning. Another closely related meta-learning with lot interest the community Neural Architecture Search (NAS), automatically finding optimal architecture instead engineering manually. In this work, we combine these two aspects So far, methods have...

10.48550/arxiv.1912.00412 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Towards Multimodal In-Context Learning for Vision & Language Models

OPENALEX - Publications

Sivan Doveh Shaked Perek M. Jehanzeb Mirza Amit Alfassy Assaf Arbelle and 2 more

Inspired by the emergence of Large Language Models (LLMs) that can truly understand human language, significant progress has been made in aligning other, non-language, modalities to be `understandable' an LLM, primarily via converting their samples into a sequence embedded language-like tokens directly fed LLM (decoder) input stream. However, so far limited attention given transferring (and evaluating) one core capabilities emerging VLMs, namely In-Context Learning (ICL) ability, or other...

10.48550/arxiv.2403.12736 preprint EN arXiv (Cornell University) 2024-03-19

NumeroLogic: Number Encoding for Enhanced LLMs' Numerical Reasoning

OPENALEX - Publications

Eli Schwartz Leshem Choshen Joseph Shtok Sivan Doveh Leonid Karlinsky and 1 more

Language models struggle with handling numerical data and performing arithmetic operations. We hypothesize that this limitation can be partially attributed to non-intuitive textual numbers representation. When a digit is read or generated by causal language model it does not know its place value (e.g. thousands vs. hundreds) until the entire number processed. To address issue, we propose simple adjustment how are represented including count of digits before each number. For instance, instead...

10.48550/arxiv.2404.00459 preprint EN arXiv (Cornell University) 2024-03-30

Comparison Visual Instruction Tuning

OPENALEX - Publications

Wei Lin M. Jehanzeb Mirza Sivan Doveh Rogério Feris Raja Giryes and 2 more

Comparing two images in terms of Commonalities and Differences (CaD) is a fundamental human capability that forms the basis advanced visual reasoning interpretation. It essential for generation detailed contextually relevant descriptions, performing comparative analysis, novelty detection, making informed decisions based on data. However, surprisingly, little attention has been given to these concepts best current mimic intelligence - Large Multimodal Models (LMMs). We develop contribute new...

10.48550/arxiv.2406.09240 preprint EN arXiv (Cornell University) 2024-06-13

Augmenting In-Context-Learning in LLMs via Automatic Data Labeling and Refinement

OPENALEX - Publications

Joseph Shtok Amit Alfassy Foad Abo Dahood Edward J. Schwartz Sivan Doveh and 1 more

It has been shown that Large Language Models' (LLMs) performance can be improved for many tasks using Chain of Thought (CoT) or In-Context Learning (ICL), which involve demonstrating the steps needed to solve a task few examples. However, while datasets with input-output pairs are relatively easy produce, providing demonstrations include intermediate requires cumbersome manual work. These may executable programs, as in agentic flows, step-by-step reasoning CoT. In this work, we propose...

10.48550/arxiv.2410.10348 preprint EN arXiv (Cornell University) 2024-10-14

LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content

OPENALEX - Publications

Nimrod Shabtay Felipe Maia Polo Sivan Doveh Wei Lin M. Jehanzeb Mirza and 6 more

The large-scale training of multi-modal models on data scraped from the web has shown outstanding utility in infusing these with required world knowledge to perform effectively multiple downstream tasks. However, one downside scraping can be potential sacrifice benchmarks which abilities are often evaluated. To safeguard against test contamination and truly foundation we propose LiveXiv: A scalable evolving live benchmark based scientific ArXiv papers. LiveXiv accesses domain-specific...

10.48550/arxiv.2410.10783 preprint EN arXiv (Cornell University) 2024-10-14

Teaching VLMs to Localize Specific Objects from In-context Examples

OPENALEX - Publications

Sivan Doveh Nimrod Shabtay Wei Lin Eli Schwartz Hilde Kuehne and 7 more

Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks, including image recognition, video understanding, and Visual Question Answering (VQA) when explicitly trained for these tasks. Despite advances, we find that current VLMs lack a fundamental cognitive ability: learning to localize objects in scene by taking into account the context. In this work, focus on task of few-shot personalized localization, where model is given small set annotated images...

10.48550/arxiv.2411.13317 preprint EN arXiv (Cornell University) 2024-11-20

Coming Soon ...