Yin Cui

ORCID: 0000-0003-0070-5118
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Domain Adaptation and Few-Shot Learning
  • Multimodal Machine Learning Applications
  • Advanced Neural Network Applications
  • Advanced Image and Video Retrieval Techniques
  • Human Pose and Action Recognition
  • COVID-19 diagnosis using AI
  • Generative Adversarial Networks and Image Synthesis
  • Video Surveillance and Tracking Methods
  • Face recognition and analysis
  • Advanced Vision and Imaging
  • Video Analysis and Summarization
  • Recommender Systems and Techniques
  • Education and Work Dynamics
  • Medical Imaging and Analysis
  • Image Retrieval and Classification Techniques
  • Higher Education and Teaching Methods
  • Music and Audio Processing
  • Anomaly Detection Techniques and Applications
  • 3D Surveying and Cultural Heritage
  • Brain Tumor Detection and Classification
  • Smart Agriculture and AI
  • Optical measurement and interference techniques
  • Identification and Quantification in Food
  • Species Distribution and Climate Change
  • Privacy-Preserving Technologies in Data

Shanghai Electric (China)
2025

Google (United States)
2019-2023

Beijing Sport University
2021-2022

Brain (Germany)
2019-2021

Columbia University
2014-2021

North China University of Science and Technology
2021

Cornell University
2015-2019

Jilin University
2009-2019

Yangzhou University
2013

Yanshan University
2006-2012

With the rapid increase of large-scale, real-world datasets, it becomes critical to address problem long-tailed data distribution (i.e., a few classes account for most data, while are under-represented). Existing solutions typically adopt class re-balancing strategies such as re-sampling and re-weighting based on number observations each class. In this work, we argue that samples increases, additional benefit newly added point will diminish. We introduce novel theoretical framework measure...

10.1109/cvpr.2019.00949 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

Existing image classification datasets used in computer vision tend to have a uniform distribution of images across object categories. In contrast, the natural world is heavily imbalanced, as some species are more abundant and easier photograph than others. To encourage further progress challenging real conditions we present iNaturalist detection dataset, consisting 859,000 from over 5,000 different plants animals. It features visually similar species, captured wide variety situations, all...

10.1109/cvpr.2018.00914 article EN 2018-06-01

Building instance segmentation models that are data-efficient and can handle rare object categories is an important challenge in computer vision. Leveraging data augmentations a promising direction towards addressing this challenge. Here, we perform systematic study of the Copy-Paste augmentation (e.g., [13], [12]) for where randomly paste objects onto image. Prior studies on relied modeling surrounding visual context pasting objects. However, find simple mechanism good enough provide solid...

10.1109/cvpr46437.2021.00294 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

Metric learning algorithms produce distance metrics that capture the important relationships among data. In this work, we study connection between metric and collaborative filtering. We propose Collaborative Learning (CML) which learns a joint space to encode not only users' preferences but also user-user item-item similarity. The proposed algorithm outperforms state-of-the-art filtering on wide range of recommendation tasks uncovers underlying spectrum fine-grained preferences. CML achieves...

10.1145/3038912.3052639 article EN 2017-04-03

Transferring the knowledge learned from large scale datasets (e.g., ImageNet) via fine-tuning offers an effective solution for domain-specific fine-grained visual categorization (FGVC) tasks recognizing bird species or car make & model). In such scenarios, data annotation often calls specialized domain and thus is difficult to scale. this work, we first tackle a problem in FGVC. Our method won place iNaturalist 2017 classification challenge. Central success of our approach training scheme...

10.1109/cvpr.2018.00432 article EN 2018-06-01

Pre-training is a dominant paradigm in computer vision. For example, supervised ImageNet pre-training commonly used to initialize the backbones of object detection and segmentation models. He et al., however, show surprising result that has limited impact on COCO detection. Here we investigate self-training as another method utilize additional data same setup contrast it against pre-training. Our study reveals generality flexibility with three insights: 1) stronger augmentation more labeled...

10.48550/arxiv.2006.06882 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Convolutional Neural Networks (CNNs) with Bilinear Pooling, initially in their full form and later using compact representations, have yielded impressive performance gains on a wide range of visual tasks, including fine-grained categorization, question answering, face recognition, description texture style. The key to success lies the spatially invariant modeling pairwise (2 <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">nd</sup> order) feature...

10.1109/cvpr.2017.325 article EN 2017-07-01

We present a self-supervised Contrastive Video Representation Learning (CVRL) method to learn spatiotemporal visual representations from unlabeled videos. Our are learned using contrastive loss, where two augmented clips the same short video pulled together in embedding space, while different videos pushed away. study what makes for good data augmentations learning and find that both spatial temporal information crucial. carefully design involving cues. Concretely, we propose temporally...

10.1109/cvpr46437.2021.00689 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

The recent availability of geo-tagged images and rich geospatial data has inspired a number algorithms for image based geolocalization. Most approaches predict the location query by matching to ground-level with known locations (e.g., street-view data). However, most Earth does not have reference photos available. Fortunately, more complete coverage is provided oblique aerial or "bird's eye" imagery. In this work, we localize it database We use publicly available build dataset 78K aligned...

10.1109/cvpr.2015.7299135 article EN 2015-06-01

We present a framework for learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Specifically, our Video-Audio-Text (VATT) takes raw signals as inputs and extracts that are rich enough to benefit variety of downstream tasks. train VATT end-to-end scratch contrastive losses evaluate its performance by the tasks video action recognition, audio event classification, image text-to-video retrieval. Furthermore, we study modality-agnostic,...

10.48550/arxiv.2104.11178 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Existing fine-grained visual categorization methods often suffer from three challenges: lack of training data, large number categories, and high intraclass vs. low inter-class variance. In this work we propose a generic iterative framework for dataset bootstrapping that handles these challenges. Using deep metric learning with humans in the loop, learn dimensional feature embedding anchor points on manifolds each category. These capture intra-class variances remain discriminative between...

10.1109/cvpr.2016.130 preprint EN 2016-06-01

We aim at advancing open-vocabulary object detection, which detects objects described by arbitrary text inputs. The fundamental challenge is the availability of training data. It costly to further scale up number classes contained in existing detection datasets. To overcome this challenge, we propose ViLD, a method via Vision and Language knowledge Distillation. Our distills from pretrained image classification model (teacher) into two-stage detector (student). Specifically, use teacher...

10.48550/arxiv.2104.13921 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Convolutional neural networks typically encode an input image into a series of intermediate features with decreasing resolutions. While this structure is suited to classification tasks, it does not perform well for tasks requiring simultaneous recognition and localization (e.g., object detection). The encoder-decoder architectures are proposed resolve by applying decoder network onto backbone model designed tasks. In paper, we argue architecture ineffective in generating strong multi-scale...

10.1109/cvpr42600.2020.01161 preprint EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020-06-01

Implicit-feedback Recommenders (ImplicitRec) leverage positive only user-item interactions, such as clicks, to learn personalized user preferences. are often evaluated and compared offline using datasets collected from online platforms. These platforms subject popularity bias (i.e., popular items more likely be presented interacted with), therefore logged ground truth data Missing-Not-At-Random (MNAR). As a result, the widely used Average-Over-All (AOA) evaluator is biased toward accurately...

10.1145/3240323.3240355 article EN 2018-09-27

Evaluation metrics for image captioning face two challenges. Firstly, commonly used such as CIDEr, METEOR, ROUGE and BLEU often do not correlate well with human judgments. Secondly, each metric has known blind spots to pathological caption constructions, rule-based lack provisions repair once identified. For example, the newly proposed SPICE correlates judgments, but fails capture syntactic structure of a sentence. To address these challenges, we propose novel learning based discriminative...

10.1109/cvpr.2018.00608 preprint EN 2018-06-01

This paper studies the problem of modeling object-based visual concepts such as "crazy car" and "shy dog" with a goal to extract emotion related information from social multimedia content. We focus on detecting adjective-noun pairs because their strong co-occurrence relation image tags about emotions. is very challenging due highly subjective nature adjectives like "crazy" "shy" ambiguity associated annotations. However, associating concrete physical nouns makes combined more detectable...

10.1145/2647868.2654935 article EN 2014-10-31

With the rapid increase of large-scale, real-world datasets, it becomes critical to address problem long-tailed data distribution (i.e., a few classes account for most data, while are under-represented). Existing solutions typically adopt class re-balancing strategies such as re-sampling and re-weighting based on number observations each class. In this work, we argue that samples increases, additional benefit newly added point will diminish. We introduce novel theoretical framework measure...

10.48550/arxiv.1901.05555 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Many Large-scale image databases such as ImageNet have significantly advanced classification and other visual recognition tasks. However much of these datasets are constructed only for single-label coarse object-level classification. For real-world applications, multiple labels fine-grained categories often needed, yet very few exist publicly, especially those large-scale high quality. In this work, we contribute to the community a new dataset called iMaterialist Fashion Attribute...

10.1109/iccvw.2019.00377 preprint EN 2019-10-01

Federated learning methods enable us to train machine models on distributed user data while preserving its privacy. However, it is not always feasible obtain high-quality supervisory signals from users, especially for vision tasks. Unlike typical federated settings with labeled client data, we consider a more practical scenario where the unlabeled, and centralized dataset available server. We further take server-client inter-client domain shifts into account pose adaptation problem one...

10.1109/wacv51458.2022.00115 article EN 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2022-01-01

Analysis and detection of complex events in videos require a semantic representation the video content. Existing methods typically users to pre-define an exhaustive concept lexicon manually annotate presence concepts each video, which is infeasible for real-world event problems. In this paper, we propose automatic discovery scheme by exploiting Internet images their associated tags. Given target its textual descriptions, crawl collection tags performing text based image search using noun...

10.1145/2578726.2578729 article EN 2014-04-01

We present a self-supervised Contrastive Video Representation Learning (CVRL) method to learn spatiotemporal visual representations from unlabeled videos. Our are learned using contrastive loss, where two augmented clips the same short video pulled together in embedding space, while different videos pushed away. study what makes for good data augmentations learning and find that both spatial temporal information crucial. carefully design involving cues. Concretely, we propose temporally...

10.48550/arxiv.2008.03800 preprint EN other-oa arXiv (Cornell University) 2020-01-01
Coming Soon ...