Nina Shvetsova

ORCID: 0009-0004-9848-3238
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Multimodal Machine Learning Applications
  • Domain Adaptation and Few-Shot Learning
  • Human Pose and Action Recognition
  • Video Analysis and Summarization
  • Economic Issues in Ukraine
  • Advanced Image and Video Retrieval Techniques
  • Soil and Environmental Studies
  • Natural Language Processing Techniques
  • Bryophyte Studies and Records
  • Music and Audio Processing
  • Geological Studies and Exploration
  • Economic and Business Development Strategies
  • Economic Development and Digital Transformation
  • Environmental Science and Water Management
  • Peatlands and Wetlands Ecology
  • Subtitles and Audiovisual Media
  • Economic Growth and Productivity
  • Thermoregulation and physiological responses
  • Agricultural Productivity and Crop Improvement
  • Business and Economic Development
  • Water Resources and Management
  • Text and Document Classification Technologies
  • COVID-19 diagnosis using AI
  • Botany and Plant Ecology Studies
  • Infrared Thermography in Medicine

Goethe University Frankfurt
2023-2024

Max Planck Institute for Informatics
2023

University of Bonn
2023

Helmholtz Moscow Research Institute of Eye Diseases
2023

Northern (Arctic) Federal University
2020

Ural Branch of the Russian Academy of Sciences
2020

Vasyl' Stus Donetsk National University
2014

Siberian Branch of the Russian Academy of Sciences
2009

Large scale Vision Language (VL) models have shown tremendous success in aligning representations between visual and text modalities. This enables remarkable progress zero-shot recognition, image generation & editing, many other exciting tasks. However, VL tend to over-represent objects while paying much less attention verbs, require additional tuning on video data for best action recognition performance. While previous work relied large-scale, fully-annotated data, this we propose an...

10.1109/iccv51070.2023.00267 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

Vision-language foundation models have shown impressive capabilities across various zero-shot tasks, including training-free localization and grounding, primarily focusing on localizing objects in images. However, leveraging those to localize actions events videos is challenging, as less physical outline are usually described by higher-level concepts. In this work, we propose VideoGEM, the first spatial action grounding method based pretrained image- video-language backbones. Namely, adapt...

10.48550/arxiv.2503.20348 preprint EN arXiv (Cornell University) 2025-03-26

Disorders of ocular perfusion are associated with huge amout diseases, including such socially significant as diabetic retinopathy and glaucoma. To date, there is no gold standard for measuring perfusion. An innovative method two-dimensional assessment eye blood flow — laser speckle flowgraphy (LSFG) has been developed recent years implemented in ophthalmological practice. Purpose : to evaluate the possibilities LSFG determining find out age dependence obtained indicators flow. Materials...

10.21516/2072-0076-2023-16-2-54-62 article EN cc-by Russian Ophthalmological Journal 2023-06-30

Contrastive learning has become an important tool in representations from unlabeled data mainly relying on the idea of minimizing distance between positive pairs, e.g., views same images, and maximizing negative different images. This paper proposes a new variation contrastive objective, Group Ordering Constraints (GroCo), that leverages sorting distances pairs computing respective loss based how many have larger than thus are not ordered correctly. To this end, GroCo is differentiable...

10.1109/iccv51070.2023.01508 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

Instructional videos are an excellent source for learning multimodal representations by leveraging video-subtitle pairs extracted with automatic speech recognition systems (ASR) from the audio signal in videos. However, contrast to human-annotated captions, both and subtitles naturally differ visual content of thus provide only noisy supervision learning. As a result, large-scale annotation-free web video training data remains sub-optimal text-video models. In this work, we propose leverage...

10.48550/arxiv.2310.04900 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Large scale Vision-Language (VL) models have shown tremendous success in aligning representations between visual and text modalities. This enables remarkable progress zero-shot recognition, image generation & editing, many other exciting tasks. However, VL tend to over-represent objects while paying much less attention verbs, require additional tuning on video data for best action recognition performance. While previous work relied large-scale, fully-annotated data, this we propose an...

10.48550/arxiv.2303.08914 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Multilingual text-video retrieval methods have improved significantly in recent years, but the performance for languages other than English still lags. We propose a Cross-Lingual Cross-Modal Knowledge Distillation method to improve multilingual retrieval. Inspired by fact that outperforms languages, we train student model using input text different match cross-modal predictions from teacher models English. cross entropy based objective which forces distribution over student's similarity...

10.1109/icassp49357.2023.10094821 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

Large-scale noisy web image-text datasets have been proven to be efficient for learning robust vision-language models. However, transfer them the task of video retrieval, models still need fine-tuned on hand-curated paired text-video data adapt diverse styles descriptions. To address this problem without hand-annotated pairs, we propose a new setting, retrieval with uncurated & unpaired data, that uses only text queries together videos during training any data. end, an approach, In-Style,...

10.1109/iccv51070.2023.02009 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

Spatio-temporal grounding describes the task of localizing events in space and time, e.g., video data, based on verbal descriptions only. Models for this are usually trained with human-annotated sentences bounding box supervision. This work addresses from a multimodal supervision perspective, proposing framework spatio-temporal action loose subtitle only, without human annotation. To end, we combine local representation learning, which focuses leveraging fine-grained spatial information,...

10.48550/arxiv.2303.16990 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Self-supervised learning on large-scale multi-modal datasets allows semantically meaningful embeddings in a joint representation space without relying human annotations. These enable zero-shot cross-modal tasks like retrieval and classification. However, these methods often struggle to generalize well out-of-domain data as they ignore the semantic structure present modality-specific embeddings. In this context, we propose novel Semantic-Structure-Preserving Consistency approach improve...

10.1109/iccv51070.2023.02010 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

Vision-language models trained on large, randomly collected data had significant impact in many areas since they appeared. But as show great performance various fields, such image-text-retrieval, their inner workings are still not fully understood. The current work analyses the true zero-shot capabilities of those models. We start from analysis training corpus assessing to what extent (and which of) test classes really and how this correlates with individual performance. follow up...

10.48550/arxiv.2209.06103 preprint EN other-oa arXiv (Cornell University) 2022-01-01

The task of multimodal learning has seen a growing interest recently as it allows for training neural architectures based on different modalities such vision, text, and audio. One challenge in models is that they need to jointly learn semantic concepts their relationships across input representations. Capsule networks have been shown perform well context capturing the relation between low-level features higher-level concepts. However, capsules so far mainly used only small-scale fully...

10.48550/arxiv.2112.00775 preprint EN other-oa arXiv (Cornell University) 2021-01-01

The paper shows evolution of scientific views in determining the intension 'investment' and its role development economy. main regularities historical process related to formation knowledge investment are examined. author ascertains interrelation between theoretical conceptions their impact on transformation economic processes. There four stages essence meaning category identified: Stage 1: from first commercial relations great geological discoveries. At that stage concept was not yet...

10.21847/1728-9343.2014.6(132).36501 article EN Схід 2014-01-01

Contrastive learning has become an important tool in representations from unlabeled data mainly relying on the idea of minimizing distance between positive pairs, e.g., views same images, and maximizing negative different images. This paper proposes a new variation contrastive objective, Group Ordering Constraints (GroCo), that leverages sorting distances pairs computing respective loss based how many have larger than thus are not ordered correctly. To this end, GroCo is differentiable...

10.48550/arxiv.2301.02009 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Self-supervised learning on large-scale multi-modal datasets allows semantically meaningful embeddings in a joint representation space without relying human annotations. These enable zero-shot cross-modal tasks like retrieval and classification. However, these methods often struggle to generalize well out-of-domain data as they ignore the semantic structure present modality-specific embeddings. In this context, we propose novel Semantic-Structure-Preserving Consistency approach improve...

10.48550/arxiv.2308.13077 preprint EN cc-by-nc-sa arXiv (Cornell University) 2023-01-01

Large-scale noisy web image-text datasets have been proven to be efficient for learning robust vision-language models. However, when transferring them the task of video retrieval, models still need fine-tuned on hand-curated paired text-video data adapt diverse styles descriptions. To address this problem without hand-annotated pairs, we propose a new setting, retrieval with uncurated & unpaired data, that during training utilizes only text queries together videos any data. end, an approach,...

10.48550/arxiv.2309.08928 preprint EN other-oa arXiv (Cornell University) 2023-01-01
Coming Soon ...