- Advanced Neural Network Applications
- Advanced Image and Video Retrieval Techniques
- Domain Adaptation and Few-Shot Learning
- Multimodal Machine Learning Applications
- Visual Attention and Saliency Detection
- Topic Modeling
- Adversarial Robustness in Machine Learning
- Natural Language Processing Techniques
- Video Surveillance and Tracking Methods
- Human Pose and Action Recognition
- Handwritten Text Recognition Techniques
- Medical Image Segmentation Techniques
- Computer Graphics and Visualization Techniques
- Video Analysis and Summarization
- Olfactory and Sensory Function Studies
- Gait Recognition and Analysis
- Advanced Image Processing Techniques
- Image Retrieval and Classification Techniques
- Generative Adversarial Networks and Image Synthesis
- Advanced Vision and Imaging
- Image and Video Quality Assessment
- Image Processing and 3D Reconstruction
- Anomaly Detection Techniques and Applications
- Machine Learning and Data Classification
- COVID-19 diagnosis using AI
Adobe Systems (United States)
2020-2024
Nanyang Technological University
2016-2019
Alibaba Group (United States)
2016
Multimedia University
2015
Typical person re-identification (ReID) methods usually describe each pedestrian with a single feature vector and match them in task-specific metric space. However, the based on are not sufficient enough to overcome visual ambiguity, which frequently occurs real scenario. In this paper, we propose novel end-to-end trainable framework, called Dual ATtention Matching network (DuATM), learn context-aware sequences perform attentive sequence comparison simultaneously. The core component of our...
In the last few years, deep learning has led to very good performance on a variety of problems, such as visual recognition, speech recognition and natural language processing. Among different types neural networks, convolutional networks have been most extensively studied. Leveraging rapid growth in amount annotated data great improvements strengths graphics processor units, research emerged swiftly achieved state-of-the-art results various tasks. this paper, we provide broad survey recent...
Convolutional-deconvolution networks can be adopted to perform end-to-end saliency detection. But, they do not work well with objects of multiple scales. To overcome such a limitation, in this work, we propose recurrent attentional convolutional-deconvolution network (RACDNN). Using spatial transformer and units, RACDNN is able iteratively attend selected image sub-regions refinement progressively. Besides tackling the scale problem, also learn context-aware features from past iterations...
We develop an approach to learning visual representations that embraces multimodal data, driven by a combination of intra- and inter-modal similarity preservation objectives. Unlike existing pre-training methods, which solve proxy prediction task in single domain, our method exploits intrinsic data properties within each modality semantic information from cross-modal correlation simultaneously, hence improving the quality learned representations. By including training unified framework with...
Deep CNNs have achieved superior performance in many tasks of computer vision and image understanding. However, it is still difficult to effectively apply deep video object segmentation(VOS) since treating frames as separate static will lose the information hidden motion. To tackle this problem, we propose a Motion-guided Cascaded Refinement Network for VOS. By assuming motion normally different from background motion, frame first an active contour model on optical flow coarsely segment...
We propose SelfDoc, a task-agnostic pre-training framework for document image understanding. Because documents are multimodal and intended sequential reading, our exploits the positional, textual, visual information of every semantically meaningful component in document, it models contextualization between each block content. Unlike existing models, model is coarse-grained instead treating individual words as input, therefore avoiding an overly fine-grained with excessive contextualization....
In instance-level detection tasks (e.g., object detection), reducing input resolution is an easy option to improve runtime efficiency. However, this traditionally hurts the performance much. This paper focuses on boosting of low-resolution models by distilling knowledge from a high- or multi-resolution model. We first identify challenge applying distillation (KD) teacher and student networks that act different resolutions. To tackle it, we explore idea spatially aligning feature maps between...
Open-vocabulary instance segmentation aims at segmenting novel classes without mask annotations. It is an important step toward reducing laborious human supervision. Most existing works first pretrain a model on captioned images covering many and then finetune it limited base with However, the high-level textual information learned from caption pretraining alone cannot effectively encode details required for pixelwise segmentation. To address this, we propose cross-modal pseudo-labeling...
We propose a new framework for conditional image synthesis from semantic layouts of any precision levels, ranging pure text to 2D canvas with precise shapes. More specifically, the input layout consists one or more regions free-form descriptions and adjustable which can be set based on desired controllability. The naturally reduces text-to-image (T2I) at lowest level no shape information, it becomes segmentation-to-image (S2I) highest level. By supporting levels in-between, our is flexible...
We introduce a new image segmentation task, called Entity Segmentation (ES), which aims to segment all visual entities (objects and stuffs) in an without predicting their semantic labels. By removing the need of class label prediction, models trained for such task can focus more on improving quality. It has many practical applications as manipulation editing where quality masks is crucial but labels are less important. conduct first-ever study investigate feasibility convolutional...
Typical person re-identification (ReID) methods usually describe each pedestrian with a single feature vector and match them in task-specific metric space. However, the based on are not sufficient enough to overcome visual ambiguity, which frequently occurs real scenario. In this paper, we propose novel end-to-end trainable framework, called Dual ATtention Matching network (DuATM), learn context-aware sequences perform attentive sequence comparison simultaneously. The core component of our...
To segment 4K or 6K ultra high-resolution images needs extra computation consideration in image segmentation. Common strategies, such as downsampling, patch cropping, and cascade model, cannot address well the balance issue between accuracy cost. Motivated by fact that humans distinguish among objects continuously from coarse to precise levels, we propose Continuous Refinement Model (CRM) for segmentation refinement task. CRM aligns feature map with target aggregates features reconstruct...
Dense image segmentation tasks (e.g., semantic, panoptic) are useful for editing, but existing methods can hardly generalize well in an in-the-wild setting where there unrestricted domains, classes, and resolution & quality variations. Motivated by these observations, we construct a new entity dataset, with strong focus on high-quality dense the wild. The dataset contains images spanning diverse domains entities, along plentiful high-resolution mask annotations training testing. Given...
Convolutional-deconvolution networks can be adopted to perform end-to-end saliency detection. But, they do not work well with objects of multiple scales. To overcome such a limitation, in this work, we propose recurrent attentional convolutional-deconvolution network (RACDNN). Using spatial transformer and units, RACDNN is able iteratively attend selected image sub-regions refinement progressively. Besides tackling the scale problem, also learn context-aware features from past iterations...
In this work, we propose a motion-guided cascaded refinement network for video object segmentation. By assuming the foreground objects show different motion patterns from background, each frame apply an active contour model on optical flow to coarsely segment foreground. The proposed Cascaded Refinement Network (CRN) then takes as guidance coarse segmentation generate accurate in full resolution. way, information and deep CNNs can complement other well accurately frames. To deal with...
Document intelligence automates the extraction of information from documents and supports many business applications. Recent self-supervised learning methods on large-scale unlabeled document datasets have opened up promising directions towards reducing annotation efforts by training models with objectives. However, most existing pretraining are still language-dominated. We present UDoc, a new unified framework for understanding. UDoc is designed to support understanding tasks, extending...
Large scale object detection datasets are constantly increasing their size in terms of the number classes and annotations count. Yet, object-level categories annotated is an order magnitude smaller than image-level classification labels. State-of-the art models trained a supervised fashion this limits they can detect. In paper, we propose novel weight transfer network (WTN) to effectively efficiently knowledge from network's weights allow without box supervision. We first introduce input...
Deluge Networks (DelugeNets) are deep neural networks which efficiently facilitate massive cross-layer information inflows from preceding layers to succeeding layers. The connections between in DelugeNets established through depthwise convolutional with learnable filters, acting as a flexible yet efficient selection mechanism. can propagate across many greater flexibility and utilize network parameters more effectively compared ResNets, whilst being than DenseNets. Remarkably, DelugeNet...
It is desirable to train convolutional networks (CNNs) run more efficiently during inference. In many cases however, the computational budget that system has for inference cannot be known beforehand training, or dependent on changing real-time resource availability. Thus, it inadequate just inference-efficient CNNs, whose costs are not adjustable and adapt varied budgets. We propose a novel approach cost-adjustable in CNNs - Stochastic Downsampling Point (SDPoint). During SDPoint applies...