- Advanced Image and Video Retrieval Techniques
- Advanced Neural Network Applications
- Domain Adaptation and Few-Shot Learning
- Multimodal Machine Learning Applications
- Image Retrieval and Classification Techniques
- Handwritten Text Recognition Techniques
- Face and Expression Recognition
- Face recognition and analysis
- Human Pose and Action Recognition
- Generative Adversarial Networks and Image Synthesis
- Medical Image Segmentation Techniques
- Natural Language Processing Techniques
- Remote-Sensing Image Classification
- Brain Tumor Detection and Classification
- Machine Learning and Data Classification
- Image Processing and 3D Reconstruction
- Video Surveillance and Tracking Methods
- Vehicle License Plate Recognition
- Anomaly Detection Techniques and Applications
- Video Analysis and Summarization
- Neural Networks and Applications
- Visual Attention and Saliency Detection
- Advanced Vision and Imaging
- 3D Shape Modeling and Analysis
- Organic Electronics and Photovoltaics
Tsinghua University
2025
Alibaba Group (China)
2023-2024
First Affiliated Hospital of Fujian Medical University
2024
Fujian Medical University
2024
Shenzhen University
2023
Shenzhen Academy of Robotics
2023
Wilmington University
2020-2022
National Cheng Kung University
2021-2022
Alibaba Group (United States)
2021-2022
Universitas Kristen Indonesia Maluku
2021
A family of loss functions built on pair-based computation have been proposed in the literature which provide a myriad solutions for deep metric learning. In this pa-per, we general weighting framework under-standing recent functions. Our contributions are three-fold: (1) establish General Pair Weighting (GPW) framework, casts sampling problem learning into unified view pair through gradient analysis, providing powerful tool understanding functions; (2) show that with GPW, various existing...
One-stage object detection is commonly implemented by optimizing two sub-tasks: classification and localization, using heads with parallel branches, which might lead to a certain level of spatial misalignment in predictions between the tasks. In this work, we propose Task-aligned Object Detection (TOOD) that explicitly aligns tasks learning-based manner. First, design novel Head (T-Head) offers better balance learning task-interactive task-specific features, as well greater flexibility learn...
Siamese-based trackers have achieved excellent performance on visual object tracking. However, the target template is not updated online, and features of search image are computed independently in a Siamese architecture. In this paper, we propose Deformable Attention Networks, referred to as SiamAttn, by introducing new attention mechanism that computes deformable self-attention cross-attention. The learns strong context information via spatial attention, selectively emphasizes...
Recent deep learning models have demonstrated strong capabilities for classifying text and non-text components in natural images. They extract a high-level feature globally computed from whole image component (patch), where the cluttered background information may dominate true features representation. This leads to less discriminative power poorer robustness. In this paper, we present new system scene detection by proposing novel text-attentional convolutional neural network (Text-CNN) that...
We develop a Deep-Text Recurrent Network (DTRN)that regards scene text reading as sequence labelling problem. leverage recent advances of deep convolutional neural networks to generate an ordered highlevel from whole word image, avoiding the difficult character segmentation Then recurrent model, building on long short-term memory (LSTM), is developed robustly recognize generated CNN sequences, departing most existing approaches recognising each independently. Our model has number appealing...
We present a novel single-shot text detector that directly outputs word-level bounding boxes in natural image. propose an attention mechanism which roughly identifies regions via automatically learned attentional map. This substantially suppresses background interference the convolutional features, is key to producing accurate inference of words, particularly at extremely small sizes. results single model essentially works coarse-to-fine manner. It departs from recent FCN-based detectors...
In this paper, we present a new approach for text localization in natural images, by discriminating and non-text regions at three levels: pixel, component line levels. Firstly, powerful low-level filter called the Stroke Feature Transform (SFT) is proposed, which extends widely-used Width (SWT) incorporating color cues of pixels, leading to significantly enhanced performance on inter-component separation intra-component connection. Secondly, based output SFT, apply two classifiers,...
Text detection and recognition in natural images have long been considered as two separate tasks that are processed sequentially. Jointly training is non-trivial due to significant differences learning difficulties convergence rates. In this work, we present a conceptually simple yet efficient framework simultaneously processes the united framework. Our main contributions three-fold: (1) propose novel text-alignment layer allows it precisely compute convolutional features of text instance...
Mining informative negative instances are of central importance to deep metric learning (DML). However, the hard-mining ability existing DML methods is intrinsically limited by mini-batch training, where only a accessible at each iteration. In this paper, we identify “slow drift” phenomena observing that embedding features drift exceptionally slow even as model parameters updating throughout training process. It suggests computed preceding iterations can considerably approximate their...
We present ClothFlow, an appearance-flow-based generative model to synthesize clothed person for posed-guided image generation and virtual try-on. By estimating a dense flow between source target clothing regions, ClothFlow effectively models the geometric changes naturally transfers appearance novel images as shown in Figure 1. achieve this with three-stage framework: 1) Conditioned on pose, we first estimate semantic layout provide richer guidance process. 2) Built two feature pyramid...
Recent progress has been made on developing a unified framework for joint text detection and recognition in natural images, but existing models were mostly built two-stage by involving ROI pooling, which can degrade the performance task. In this work, we propose convolutional character networks, referred as CharNet, is an one-stage model that process two tasks simultaneously one pass. CharNet directly outputs bounding boxes of words characters, with corresponding labels. We utilize basic...
Fine-grained image categorization is challenging due to the subtle inter-class differences. We posit that exploiting rich relationships between channels can help capture such differences since different correspond semantics. In this paper, we propose a channel interaction network (CIN), which models channel-wise interplay both within an and across images. For single image, self-channel (SCI) module proposed explore correlation image. This allows model learn complementary features from...
Convolutional neural networks (CNNs) have made remarkable progress on scene recognition, partially due to these recent large-scale datasets, such as the Places and Places2. Scene categories are often defined by multi-level information, including local objects, global layout, background environment, thus leading large intra-class variations. In addition, with increasing number of categories, label ambiguity has become another crucial issue in classification. This paper focuses recognition...
VGGNets have turned out to be effective for object recognition in still images. However, it is unable yield good performance by directly adapting the VGGNet models trained on ImageNet dataset scene recognition. This report describes our implementation of training large-scale Places205 dataset. Specifically, we train three models, namely VGGNet-11, VGGNet-13, and VGGNet-16, using a Multi-GPU extension Caffe toolbox with high computational efficiency. We verify Places205-VGGNet datasets:...
Heterogeneous face recognition is an important, yet challenging problem in community. It refers to matching a probe image gallery of images taken from alternate imaging modality. The major challenge heterogeneous lies the great discrepancies between different modalities. Conventional feature descriptors, e.g., local binary patterns, histogram oriented gradients, and scale-invariant transform, are mostly designed handcrafted way thus generally fail extract common discriminant information...
Convolutional neural networks (CNN) have recently achieved remarkable successes in various image classification and understanding tasks. The deep features obtained at the top fully-connected layer of CNN (FC-features) exhibit rich global semantic information are extremely effective classification. On other hand, convolutional middle layers also contain meaningful local information, but not fully explored for representation. In this paper, we propose a novel Locally-Supervised Deep Hybrid...
The conventional detectors tend to make imbalanced classification and suffer performance drop, when the distribution of training data is severely skewed. In this paper, we propose use mean score indicate accuracy for each category during training. Based on indicator, balance via an Equilibrium Loss (EBL) a Memory-augmented Feature Sampling (MFS) method. Specifically, EBL increases intensity adjustment decision boundary weak classes by designed score-guided loss margin between any two...
Visual compatibility is critical for fashion analysis, yet missing in existing image synthesis systems. In this paper, we propose to explicitly model visual through inpainting. We present Fashion Inpainting Networks (FiNet), a two-stage image-to-image generation framework that able perform compatible and diverse Disentangling the of shape appearance ensure photorealistic results, our consists network an network. More importantly, each network, introduce two encoders interacting with one...
Many Large-scale image databases such as ImageNet have significantly advanced classification and other visual recognition tasks. However much of these datasets are constructed only for single-label coarse object-level classification. For real-world applications, multiple labels fine-grained categories often needed, yet very few exist publicly, especially those large-scale high quality. In this work, we contribute to the community a new dataset called iMaterialist Fashion Attribute...