- Advanced Image and Video Retrieval Techniques
- Domain Adaptation and Few-Shot Learning
- Multimodal Machine Learning Applications
- Image Retrieval and Classification Techniques
- Topic Modeling
- Face and Expression Recognition
- Data Stream Mining Techniques
- Advanced Bandit Algorithms Research
- Machine Learning and Algorithms
- Machine Learning and Data Classification
- Advanced Neural Network Applications
- Natural Language Processing Techniques
- Machine Learning and ELM
- Speech and dialogue systems
- Text and Document Classification Technologies
- Video Surveillance and Tracking Methods
- Human Pose and Action Recognition
- Face recognition and analysis
- Video Analysis and Summarization
- Spam and Phishing Detection
- Anomaly Detection Techniques and Applications
- Sparse and Compressive Sensing Techniques
- Software Engineering Research
- Recommender Systems and Techniques
- Complex Network Analysis Techniques
Singapore Management University
2015-2024
Salesforce (United States)
2019-2023
A*STAR Graduate Academy
2023
Institute for Infocomm Research
2023
Nanyang Technological University
2009-2022
Singapore University of Technology and Design
2022
National University of Singapore
2011-2020
Agency for Science, Technology and Research
2020
Hong Kong University of Science and Technology
2020
University of Hong Kong
2020
Image Super-Resolution (SR) is an important class of image processing techniqueso enhance the resolution images and videos in computer vision. Recent years have witnessed remarkable progress super-resolution using deep learning techniques. This article aims to provide a comprehensive survey on recent advances approaches. In general, we can roughly group existing studies SR techniques into three major categories: supervised SR, unsupervised domain-specific SR. addition, also cover some other...
Learning effective feature representations and similarity measures are crucial to the retrieval performance of a content-based image (CBIR) system. Despite extensive research efforts for decades, it remains one most challenging open problems that considerably hinders successes real-world CBIR systems. The key challenge has been attributed well-known ``semantic gap'' issue exists between low-level pixels captured by machines high-level semantic concepts perceived human. Among various...
Pre-trained models for Natural Languages (NL) like BERT and GPT have been recently shown to transfer well Programming (PL) largely benefit a broad set of code-related tasks. Despite their success, most current methods either rely on an encoder-only (or decoder-only) pre-training that is suboptimal generation (resp. understanding) tasks or process the code snippet in same way as NL, neglecting special characteristics PL such token types. We present CodeT5, unified pre-trained encoder-decoder...
Large-scale vision and language representation learning has shown promising improvements on various vision-language tasks. Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) word tokens. Because the are unaligned, it is challenging for learn image-text interactions. In this paper, we introduce contrastive loss ALign text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more...
The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training large-scale models. This paper proposes BLIP-2, a generic and efficient strategy that bootstraps vision-language from off-the-shelf frozen pre-trained image encoders large language BLIP-2 bridges the modality gap with lightweight Querying Transformer, which is in two stages. first stage representation learning encoder. second vision-to-language generative model. achieves...
Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based Furthermore, improvement been largely achieved by scaling up dataset with noisy image-text pairs collected from web, which is a suboptimal source of supervision. In this paper, we propose BLIP, new VLP framework transfers flexibly to both understanding and generation BLIP effectively...
This paper presents a new method for detecting salient objects in images using convolutional neural networks (CNNs). The proposed network, named PAGE-Net, offers two key contributions. first is the exploitation of an essential pyramid attention structure object detection. enables network to concentrate more on regions while considering multi-scale saliency information. Such stacked design provides powerful tool efficiently improve representation ability corresponding layer with enlarged...
With the popularity of smartphones and mobile devices, application (a.k.a. "app") markets have been growing exponentially in terms number users downloads. App developers spend considerable effort on collecting exploiting user feedback to improve satisfaction, but suffer from absence effective review analytics tools. To facilitate app discover most "informative" reviews a large rapidly increasing pool reviews, we present "AR-Miner" — novel computational framework for Review Mining, which...
This paper presents Prototypical Contrastive Learning (PCL), an unsupervised representation learning method that addresses the fundamental limitations of instance-wise contrastive learning. PCL not only learns low-level features for task instance discrimination, but more importantly, it implicitly encodes semantic structures data into learned embedding space. Specifically, we introduce prototypes as latent variables to help find maximum-likelihood estimation network parameters in...
Deep neural networks are known to be annotation-hungry. Numerous efforts have been devoted reducing the annotation cost when learning with deep networks. Two prominent directions include noisy labels and semi-supervised by exploiting unlabeled data. In this work, we propose DivideMix, a novel framework for leveraging techniques. particular, DivideMix models per-sample loss distribution mixture model dynamically divide training data into labeled set clean samples an samples, trains on both in...
The goal of active learning is to select the most informative examples for manual labeling. Most previous studies in have focused on selecting a single unlabeled example each iteration. This could be inefficient since classification model has retrained every labeled example. In this paper, we present framework "batch mode learning" that applies Fisher information matrix number simultaneously. key computational challenge how efficiently identify subset can result largest reduction...
Most modern trackers typically employ a bounding box given in the first frame to track visual objects, where their tracking results are often sensitive initialization. In this paper, we propose new method, Reliable Patch Trackers (RPT), which attempts identify and exploit reliable patches that can be tracked effectively through whole process. Specifically, present reliability metric measure how reliably patch tracked, probability model is proposed estimate distribution of under sequential...
Learning effective fusion of multi-modality features is at the heart visual question answering. We propose a novel method dynamically fuse multi-modal with intra- and inter-modality information flow, which alternatively pass dynamic between across language modalities. It can robustly capture high-level interactions vision domains, thus significantly improves performance also show that, proposed intra modality attention flow conditioned on other modulate intra-modality current modality, vital...
Relevant Component Analysis (RCA) has been proposed for learning distance metrics with contextual constraints image retrieval. However, RCA two important disadvantages. One is the lack of exploiting negative which can also be informative, and other its incapability capturing complex nonlinear relationships between data instances information. In this paper, we propose algorithms to overcome these disadvantages, i.e., Discriminative (DCA) Kernel DCA. Compared complicated methods metric...
Feature selection is an important technique for data mining. Despite its importance, most studies of feature are restricted to batch learning. Unlike traditional learning methods, online represents a promising family efficient and scalable machine algorithms large-scale applications. Most existing require accessing all the attributes/features training instances. Such classical setting not always appropriate real-world applications when instances high dimensionality or it expensive acquire...
Deep Neural Networks (DNNs) are typically trained by backpropagation in a batch setting, requiring the entire training data to be made available prior learning task. This is not scalable for many real-world scenarios where new arrives sequentially stream. We aim address an open challenge of ``Online Learning" (ODL) DNNs on fly online setting. Unlike traditional that often optimizes some convex objective function with respect shallow model (e.g., linear/kernel-based hypothesis), ODL more...
Learning graphs from data automatically has shown encouraging performance on clustering and semisupervised learning tasks. However, real are often corrupted, which may cause the learned graph to be inexact or unreliable. In this paper, we propose a novel robust scheme learn reliable real-world noisy by adaptively removing noise errors in raw data. We show that our proposed model can also viewed as version of manifold regularized PCA, where quality plays critical role. The is able boost...
Malicious URL, a.k.a. malicious website, is a common and serious threat to cybersecurity. URLs host unsolicited content (spam, phishing, drive-by exploits, etc.) lure unsuspecting users become victims of scams (monetary loss, theft private information, malware installation), cause losses billions dollars every year. It imperative detect act on such threats in timely manner. Traditionally, this detection done mostly through the usage blacklists. However, blacklists cannot be exhaustive, lack...
This paper conducts a systematic study on the role of visual attention in Unsupervised Video Object Segmentation (UVOS) tasks. By elaborately annotating three popular video segmentation datasets (DAVIS, Youtube-Objects and SegTrack V2) with dynamic eye-tracking data UVOS setting, for first time, we quantitatively verified high consistency behavior among human observers, found strong correlation between explicit primary object judgements during dynamic, task-driven viewing. Such novel...
Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building vision-language is challenging due to the rich input distributions task diversity resulting from additional visual input. Although pretraining has widely studied, remains under-explored. In this paper, we conduct a systematic comprehensive study on based pretrained BLIP-2 models. We gather 26 publicly available datasets, covering wide...