NFDI4DS | UHH-SEMS - Publication Details

Class-Balanced Loss Based on Effective Number of Samples

OPENALEX - Publications

Yin Cui Menglin Jia Tsung-Yi Lin Yang Song Serge Belongie

With the rapid increase of large-scale, real-world datasets, it becomes critical to address problem long-tailed data distribution (i.e., a few classes account for most data, while are under-represented). Existing solutions typically adopt class re-balancing strategies such as re-sampling and re-weighting based on number observations each class. In this work, we argue that samples increases, additional benefit newly added point will diminish. We introduce novel theoretical framework measure...

10.1109/cvpr.2019.00949 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

The iNaturalist Species Classification and Detection Dataset

OPENALEX - Publications

Grant Van Horn Oisin Mac Aodha Yang Song Yin Cui Chen Sun and 4 more

Existing image classification datasets used in computer vision tend to have a uniform distribution of images across object categories. In contrast, the natural world is heavily imbalanced, as some species are more abundant and easier photograph than others. To encourage further progress challenging real conditions we present iNaturalist detection dataset, consisting 859,000 from over 5,000 different plants animals. It features visually similar species, captured wide variety situations, all...

10.1109/cvpr.2018.00914 article EN 2018-06-01

Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation

OPENALEX - Publications

Golnaz Ghiasi Yin Cui Aravind Srinivas Rui Qian Tsung-Yi Lin and 3 more

Building instance segmentation models that are data-efficient and can handle rare object categories is an important challenge in computer vision. Leveraging data augmentations a promising direction towards addressing this challenge. Here, we perform systematic study of the Copy-Paste augmentation (e.g., [13], [12]) for where randomly paste objects onto image. Prior studies on relied modeling surrounding visual context pasting objects. However, find simple mechanism good enough provide solid...

10.1109/cvpr46437.2021.00294 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

Collaborative Metric Learning

OPENALEX - Publications

Cheng-Kang Hsieh Longqi Yang Yin Cui Tsung-Yi Lin Serge Belongie and 1 more

Metric learning algorithms produce distance metrics that capture the important relationships among data. In this work, we study connection between metric and collaborative filtering. We propose Collaborative Learning (CML) which learns a joint space to encode not only users' preferences but also user-user item-item similarity. The proposed algorithm outperforms state-of-the-art filtering on wide range of recommendation tasks uncovers underlying spectrum fine-grained preferences. CML achieves...

10.1145/3038912.3052639 article EN 2017-04-03

Large Scale Fine-Grained Categorization and Domain-Specific Transfer Learning

OPENALEX - Publications

Yin Cui Yang Song Chen Sun Andrew Howard Serge Belongie

Transferring the knowledge learned from large scale datasets (e.g., ImageNet) via fine-tuning offers an effective solution for domain-specific fine-grained visual categorization (FGVC) tasks recognizing bird species or car make & model). In such scenarios, data annotation often calls specialized domain and thus is difficult to scale. this work, we first tackle a problem in FGVC. Our method won place iNaturalist 2017 classification challenge. Central success of our approach training scheme...

10.1109/cvpr.2018.00432 article EN 2018-06-01

Rethinking Pre-training and Self-training

OPENALEX - Publications

Barret Zoph Golnaz Ghiasi Tsung-Yi Lin Yin Cui Hanxiao Liu and 2 more

Pre-training is a dominant paradigm in computer vision. For example, supervised ImageNet pre-training commonly used to initialize the backbones of object detection and segmentation models. He et al., however, show surprising result that has limited impact on COCO detection. Here we investigate self-training as another method utilize additional data same setup contrast it against pre-training. Our study reveals generality flexibility with three insights: 1) stronger augmentation more labeled...

10.48550/arxiv.2006.06882 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Kernel Pooling for Convolutional Neural Networks

OPENALEX - Publications

Yin Cui Feng Zhou Jiang Wang Xiao Liu Yuanqing Lin and 1 more

Convolutional Neural Networks (CNNs) with Bilinear Pooling, initially in their full form and later using compact representations, have yielded impressive performance gains on a wide range of visual tasks, including fine-grained categorization, question answering, face recognition, description texture style. The key to success lies the spatially invariant modeling pairwise (2 <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">nd</sup> order) feature...

10.1109/cvpr.2017.325 article EN 2017-07-01

Spatiotemporal Contrastive Video Representation Learning

OPENALEX - Publications

Rui Qian Tianjian Meng Boqing Gong Ming–Hsuan Yang Huisheng Wang and 2 more

We present a self-supervised Contrastive Video Representation Learning (CVRL) method to learn spatiotemporal visual representations from unlabeled videos. Our are learned using contrastive loss, where two augmented clips the same short video pulled together in embedding space, while different videos pushed away. study what makes for good data augmentations learning and find that both spatial temporal information crucial. carefully design involving cues. Concretely, we propose temporally...

10.1109/cvpr46437.2021.00689 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

Learning deep representations for ground-to-aerial geolocalization

OPENALEX - Publications

Tsung-Yi Lin Yin Cui Serge Belongie James Hays

The recent availability of geo-tagged images and rich geospatial data has inspired a number algorithms for image based geolocalization. Most approaches predict the location query by matching to ground-level with known locations (e.g., street-view data). However, most Earth does not have reference photos available. Fortunately, more complete coverage is provided oblique aerial or "bird's eye" imagery. In this work, we localize it database We use publicly available build dataset 78K aligned...

10.1109/cvpr.2015.7299135 article EN 2015-06-01

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

OPENALEX - Publications

Hassan Akbari Liangzhe Yuan Rui Qian Wei-Hong Chuang Shih‐Fu Chang and 2 more

We present a framework for learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Specifically, our Video-Audio-Text (VATT) takes raw signals as inputs and extracts that are rich enough to benefit variety of downstream tasks. train VATT end-to-end scratch contrastive losses evaluate its performance by the tasks video action recognition, audio event classification, image text-to-video retrieval. Furthermore, we study modality-agnostic,...

10.48550/arxiv.2104.11178 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Fine-Grained Categorization and Dataset Bootstrapping Using Deep Metric Learning with Humans in the Loop

OPENALEX - Publications

Yin Cui Feng Zhou Yuanqing Lin Serge Belongie

Existing fine-grained visual categorization methods often suffer from three challenges: lack of training data, large number categories, and high intraclass vs. low inter-class variance. In this work we propose a generic iterative framework for dataset bootstrapping that handles these challenges. Using deep metric learning with humans in the loop, learn dimensional feature embedding anchor points on manifolds each category. These capture intra-class variances remain discriminative between...

10.1109/cvpr.2016.130 preprint EN 2016-06-01

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

OPENALEX - Publications

Xiuye Gu Tsung-Yi Lin Weicheng Kuo Yin Cui

We aim at advancing open-vocabulary object detection, which detects objects described by arbitrary text inputs. The fundamental challenge is the availability of training data. It costly to further scale up number classes contained in existing detection datasets. To overcome this challenge, we propose ViLD, a method via Vision and Language knowledge Distillation. Our distills from pretrained image classification model (teacher) into two-stage detector (student). Specifically, use teacher...

10.48550/arxiv.2104.13921 preprint EN other-oa arXiv (Cornell University) 2021-01-01

SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization

OPENALEX - Publications

Xianzhi Du Tsung-Yi Lin Pengchong Jin Golnaz Ghiasi Mingxing Tan and 3 more

Convolutional neural networks typically encode an input image into a series of intermediate features with decreasing resolutions. While this structure is suited to classification tasks, it does not perform well for tasks requiring simultaneous recognition and localization (e.g., object detection). The encoder-decoder architectures are proposed resolve by applying decoder network onto backbone model designed tasks. In paper, we argue architecture ineffective in generating strong multi-scale...

10.1109/cvpr42600.2020.01161 preprint EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020-06-01

Unbiased offline recommender evaluation for missing-not-at-random implicit feedback

OPENALEX - Publications

Longqi Yang Yin Cui Xuan Yuan Chenyang Wang Serge Belongie and 1 more

Implicit-feedback Recommenders (ImplicitRec) leverage positive only user-item interactions, such as clicks, to learn personalized user preferences. are often evaluated and compared offline using datasets collected from online platforms. These platforms subject popularity bias (i.e., popular items more likely be presented interacted with), therefore logged ground truth data Missing-Not-At-Random (MNAR). As a result, the widely used Average-Over-All (AOA) evaluator is biased toward accurately...

10.1145/3240323.3240355 article EN 2018-09-27

Learning to Evaluate Image Captioning

OPENALEX - Publications

Yin Cui Guandao Yang Andreas Veit Xun Huang Serge Belongie

Evaluation metrics for image captioning face two challenges. Firstly, commonly used such as CIDEr, METEOR, ROUGE and BLEU often do not correlate well with human judgments. Secondly, each metric has known blind spots to pathological caption constructions, rule-based lack provisions repair once identified. For example, the newly proposed SPICE correlates judgments, but fails capture syntactic structure of a sentence. To address these challenges, we propose novel learning based discriminative...

10.1109/cvpr.2018.00608 preprint EN 2018-06-01

Object-Based Visual Sentiment Concept Analysis and Application

OPENALEX - Publications

Tao Chen Felix X. Yu Jiawei Chen Yin Cui Yanying Chen and 1 more

This paper studies the problem of modeling object-based visual concepts such as "crazy car" and "shy dog" with a goal to extract emotion related information from social multimedia content. We focus on detecting adjective-noun pairs because their strong co-occurrence relation image tags about emotions. is very challenging due highly subjective nature adjectives like "crazy" "shy" ambiguity associated annotations. However, associating concrete physical nouns makes combined more detectable...

10.1145/2647868.2654935 article EN 2014-10-31

Class-Balanced Loss Based on Effective Number of Samples

OPENALEX - Publications

Yin Cui Menglin Jia Tsung-Yi Lin Yang Song Serge Belongie

With the rapid increase of large-scale, real-world datasets, it becomes critical to address problem long-tailed data distribution (i.e., a few classes account for most data, while are under-represented). Existing solutions typically adopt class re-balancing strategies such as re-sampling and re-weighting based on number observations each class. In this work, we argue that samples increases, additional benefit newly added point will diminish. We introduce novel theoretical framework measure...

10.48550/arxiv.1901.05555 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Transfer learning in computer vision tasks: Remember where you come from

OPENALEX - Publications

Xuhong Li Yves Grandvalet Franck Davoine Jingchun Cheng Yin Cui and 4 more

10.1016/j.imavis.2019.103853 article EN publisher-specific-oa Image and Vision Computing 2019-11-29

The iMaterialist Fashion Attribute Dataset

OPENALEX - Publications

Sheng Guo Weilin Huang Xiao Zhang Prasanna Srikhanta Yin Cui and 4 more

Many Large-scale image databases such as ImageNet have significantly advanced classification and other visual recognition tasks. However much of these datasets are constructed only for single-label coarse object-level classification. For real-world applications, multiple labels fine-grained categories often needed, yet very few exist publicly, especially those large-scale high quality. In this work, we contribute to the community a new dataset called iMaterialist Fashion Attribute...

10.1109/iccvw.2019.00377 preprint EN 2019-10-01

Federated Multi-Target Domain Adaptation

OPENALEX - Publications

Chun-Han Yao Boqing Gong Hang Qi Yin Cui Yukun Zhu and 1 more

Federated learning methods enable us to train machine models on distributed user data while preserving its privacy. However, it is not always feasible obtain high-quality supervisory signals from users, especially for vision tasks. Unlike typical federated settings with labeled client data, we consider a more practical scenario where the unlabeled, and centralized dataset available server. We further take server-client inter-client domain shifts into account pose adaptation problem one...

10.1109/wacv51458.2022.00115 article EN 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2022-01-01

Event-Driven Semantic Concept Discovery by Exploiting Weakly Tagged Internet Images

OPENALEX - Publications

Jiawei Chen Yin Cui Guangnan Ye Dong Liu Shih‐Fu Chang

Analysis and detection of complex events in videos require a semantic representation the video content. Existing methods typically users to pre-define an exhaustive concept lexicon manually annotate presence concepts each video, which is infeasible for real-world event problems. In this paper, we propose automatic discovery scheme by exploiting Internet images their associated tags. Given target its textual descriptions, crawl collection tags performing text based image search using noun...

10.1145/2578726.2578729 article EN 2014-04-01

Spatiotemporal Contrastive Video Representation Learning

OPENALEX - Publications

Rui Qian Tianjian Meng Boqing Gong Ming–Hsuan Yang Huisheng Wang and 2 more

We present a self-supervised Contrastive Video Representation Learning (CVRL) method to learn spatiotemporal visual representations from unlabeled videos. Our are learned using contrastive loss, where two augmented clips the same short video pulled together in embedding space, while different videos pushed away. study what makes for good data augmentations learning and find that both spatial temporal information crucial. carefully design involving cues. Concretely, we propose temporally...

10.48550/arxiv.2008.03800 preprint EN other-oa arXiv (Cornell University) 2020-01-01