- Domain Adaptation and Few-Shot Learning
- Multimodal Machine Learning Applications
- Advanced Neural Network Applications
- Advanced Image and Video Retrieval Techniques
- Human Pose and Action Recognition
- COVID-19 diagnosis using AI
- Generative Adversarial Networks and Image Synthesis
- Video Surveillance and Tracking Methods
- Face recognition and analysis
- Advanced Vision and Imaging
- Video Analysis and Summarization
- Recommender Systems and Techniques
- Education and Work Dynamics
- Medical Imaging and Analysis
- Image Retrieval and Classification Techniques
- Higher Education and Teaching Methods
- Music and Audio Processing
- Anomaly Detection Techniques and Applications
- 3D Surveying and Cultural Heritage
- Brain Tumor Detection and Classification
- Smart Agriculture and AI
- Optical measurement and interference techniques
- Identification and Quantification in Food
- Species Distribution and Climate Change
- Privacy-Preserving Technologies in Data
Shanghai Electric (China)
2025
Google (United States)
2019-2023
Beijing Sport University
2021-2022
Brain (Germany)
2019-2021
Columbia University
2014-2021
North China University of Science and Technology
2021
Cornell University
2015-2019
Jilin University
2009-2019
Yangzhou University
2013
Yanshan University
2006-2012
With the rapid increase of large-scale, real-world datasets, it becomes critical to address problem long-tailed data distribution (i.e., a few classes account for most data, while are under-represented). Existing solutions typically adopt class re-balancing strategies such as re-sampling and re-weighting based on number observations each class. In this work, we argue that samples increases, additional benefit newly added point will diminish. We introduce novel theoretical framework measure...
Existing image classification datasets used in computer vision tend to have a uniform distribution of images across object categories. In contrast, the natural world is heavily imbalanced, as some species are more abundant and easier photograph than others. To encourage further progress challenging real conditions we present iNaturalist detection dataset, consisting 859,000 from over 5,000 different plants animals. It features visually similar species, captured wide variety situations, all...
Building instance segmentation models that are data-efficient and can handle rare object categories is an important challenge in computer vision. Leveraging data augmentations a promising direction towards addressing this challenge. Here, we perform systematic study of the Copy-Paste augmentation (e.g., [13], [12]) for where randomly paste objects onto image. Prior studies on relied modeling surrounding visual context pasting objects. However, find simple mechanism good enough provide solid...
Metric learning algorithms produce distance metrics that capture the important relationships among data. In this work, we study connection between metric and collaborative filtering. We propose Collaborative Learning (CML) which learns a joint space to encode not only users' preferences but also user-user item-item similarity. The proposed algorithm outperforms state-of-the-art filtering on wide range of recommendation tasks uncovers underlying spectrum fine-grained preferences. CML achieves...
Transferring the knowledge learned from large scale datasets (e.g., ImageNet) via fine-tuning offers an effective solution for domain-specific fine-grained visual categorization (FGVC) tasks recognizing bird species or car make & model). In such scenarios, data annotation often calls specialized domain and thus is difficult to scale. this work, we first tackle a problem in FGVC. Our method won place iNaturalist 2017 classification challenge. Central success of our approach training scheme...
Pre-training is a dominant paradigm in computer vision. For example, supervised ImageNet pre-training commonly used to initialize the backbones of object detection and segmentation models. He et al., however, show surprising result that has limited impact on COCO detection. Here we investigate self-training as another method utilize additional data same setup contrast it against pre-training. Our study reveals generality flexibility with three insights: 1) stronger augmentation more labeled...
Convolutional Neural Networks (CNNs) with Bilinear Pooling, initially in their full form and later using compact representations, have yielded impressive performance gains on a wide range of visual tasks, including fine-grained categorization, question answering, face recognition, description texture style. The key to success lies the spatially invariant modeling pairwise (2 <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">nd</sup> order) feature...
We present a self-supervised Contrastive Video Representation Learning (CVRL) method to learn spatiotemporal visual representations from unlabeled videos. Our are learned using contrastive loss, where two augmented clips the same short video pulled together in embedding space, while different videos pushed away. study what makes for good data augmentations learning and find that both spatial temporal information crucial. carefully design involving cues. Concretely, we propose temporally...
The recent availability of geo-tagged images and rich geospatial data has inspired a number algorithms for image based geolocalization. Most approaches predict the location query by matching to ground-level with known locations (e.g., street-view data). However, most Earth does not have reference photos available. Fortunately, more complete coverage is provided oblique aerial or "bird's eye" imagery. In this work, we localize it database We use publicly available build dataset 78K aligned...
We present a framework for learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Specifically, our Video-Audio-Text (VATT) takes raw signals as inputs and extracts that are rich enough to benefit variety of downstream tasks. train VATT end-to-end scratch contrastive losses evaluate its performance by the tasks video action recognition, audio event classification, image text-to-video retrieval. Furthermore, we study modality-agnostic,...
Existing fine-grained visual categorization methods often suffer from three challenges: lack of training data, large number categories, and high intraclass vs. low inter-class variance. In this work we propose a generic iterative framework for dataset bootstrapping that handles these challenges. Using deep metric learning with humans in the loop, learn dimensional feature embedding anchor points on manifolds each category. These capture intra-class variances remain discriminative between...
We aim at advancing open-vocabulary object detection, which detects objects described by arbitrary text inputs. The fundamental challenge is the availability of training data. It costly to further scale up number classes contained in existing detection datasets. To overcome this challenge, we propose ViLD, a method via Vision and Language knowledge Distillation. Our distills from pretrained image classification model (teacher) into two-stage detector (student). Specifically, use teacher...
Convolutional neural networks typically encode an input image into a series of intermediate features with decreasing resolutions. While this structure is suited to classification tasks, it does not perform well for tasks requiring simultaneous recognition and localization (e.g., object detection). The encoder-decoder architectures are proposed resolve by applying decoder network onto backbone model designed tasks. In paper, we argue architecture ineffective in generating strong multi-scale...
Implicit-feedback Recommenders (ImplicitRec) leverage positive only user-item interactions, such as clicks, to learn personalized user preferences. are often evaluated and compared offline using datasets collected from online platforms. These platforms subject popularity bias (i.e., popular items more likely be presented interacted with), therefore logged ground truth data Missing-Not-At-Random (MNAR). As a result, the widely used Average-Over-All (AOA) evaluator is biased toward accurately...
Evaluation metrics for image captioning face two challenges. Firstly, commonly used such as CIDEr, METEOR, ROUGE and BLEU often do not correlate well with human judgments. Secondly, each metric has known blind spots to pathological caption constructions, rule-based lack provisions repair once identified. For example, the newly proposed SPICE correlates judgments, but fails capture syntactic structure of a sentence. To address these challenges, we propose novel learning based discriminative...
This paper studies the problem of modeling object-based visual concepts such as "crazy car" and "shy dog" with a goal to extract emotion related information from social multimedia content. We focus on detecting adjective-noun pairs because their strong co-occurrence relation image tags about emotions. is very challenging due highly subjective nature adjectives like "crazy" "shy" ambiguity associated annotations. However, associating concrete physical nouns makes combined more detectable...
With the rapid increase of large-scale, real-world datasets, it becomes critical to address problem long-tailed data distribution (i.e., a few classes account for most data, while are under-represented). Existing solutions typically adopt class re-balancing strategies such as re-sampling and re-weighting based on number observations each class. In this work, we argue that samples increases, additional benefit newly added point will diminish. We introduce novel theoretical framework measure...
Many Large-scale image databases such as ImageNet have significantly advanced classification and other visual recognition tasks. However much of these datasets are constructed only for single-label coarse object-level classification. For real-world applications, multiple labels fine-grained categories often needed, yet very few exist publicly, especially those large-scale high quality. In this work, we contribute to the community a new dataset called iMaterialist Fashion Attribute...
Federated learning methods enable us to train machine models on distributed user data while preserving its privacy. However, it is not always feasible obtain high-quality supervisory signals from users, especially for vision tasks. Unlike typical federated settings with labeled client data, we consider a more practical scenario where the unlabeled, and centralized dataset available server. We further take server-client inter-client domain shifts into account pose adaptation problem one...
Analysis and detection of complex events in videos require a semantic representation the video content. Existing methods typically users to pre-define an exhaustive concept lexicon manually annotate presence concepts each video, which is infeasible for real-world event problems. In this paper, we propose automatic discovery scheme by exploiting Internet images their associated tags. Given target its textual descriptions, crawl collection tags performing text based image search using noun...
We present a self-supervised Contrastive Video Representation Learning (CVRL) method to learn spatiotemporal visual representations from unlabeled videos. Our are learned using contrastive loss, where two augmented clips the same short video pulled together in embedding space, while different videos pushed away. study what makes for good data augmentations learning and find that both spatial temporal information crucial. carefully design involving cues. Concretely, we propose temporally...