- Advanced Image and Video Retrieval Techniques
- Multimodal Machine Learning Applications
- Domain Adaptation and Few-Shot Learning
- Advanced Neural Network Applications
- Human Pose and Action Recognition
- Collaboration in agile enterprises
- Adversarial Robustness in Machine Learning
- Image Retrieval and Classification Techniques
- COVID-19 diagnosis using AI
- Innovation and Knowledge Management
- Topic Modeling
- Sparse and Compressive Sensing Techniques
- Cryptography and Data Security
- Optical measurement and interference techniques
- Video Analysis and Summarization
- Supply Chain and Inventory Management
- Metaheuristic Optimization Algorithms Research
- Image Processing Techniques and Applications
- Robotics and Sensor-Based Localization
- Service-Oriented Architecture and Web Services
- Safety and Risk Management
- Supply Chain Resilience and Risk Management
- Machine Learning and Data Classification
- Generative Adversarial Networks and Image Synthesis
- Digital Image Processing Techniques
Institute of Computing Technology
2016-2024
University of Chinese Academy of Sciences
2019-2024
University of Science and Technology of China
2020-2024
Aerospace Information Research Institute
2023
South China University of Technology
2023
Chinese Academy of Sciences
2016-2023
Yunnan University
2023
Accenture (United States)
2022
China Three Gorges University
2022
Harbin Institute of Technology
2006-2019
Since scenes are composed in part of objects, accurate recognition requires knowledge about both and objects. In this paper we address two related problems: 1) scale induced dataset bias multi-scale convolutional neural network (CNN) architectures, 2) how to combine effectively scene-centric object-centric (i.e. Places ImageNet) CNNs. An earlier attempt, Hybrid-CNN[23], showed that incorporating ImageNet did not help much. Here propose an alternative method taking the into account, resulting...
Automatically describing the content of an image has been attracting considerable research attention in multimedia field. To represent image, many approaches directly utilize convolutional neural networks (CNNs) to extract visual representations, which are fed into recurrent generate natural language. Recently, some have detected semantic concepts from images and then encoded them high-level representations. Although substantial progress achieved, most previous methods treat entities...
Vision-and-language navigation (VLN) is the task to enable an embodied agent navigate a remote location following natural language instruction in real scenes. Most of previous approaches utilize entire features or object-centric represent navigable candidates. However, these representations are not efficient enough for perform actions arrive target location. As knowledge provides crucial information which complementary visible content, this paper, we propose Knowledge Enhanced Reasoning...
Vision-and-language navigation (VLN) enables the agent to navigate a remote location following natural language instruction in 3D environments. To represent previously visited environment, most approaches for VLN implement memory using recurrent states, topological maps, or top-down semantic maps. In contrast these approaches, we build egocentric and dynamically growing Grid Memory Map (i.e., GridMM) structure environment. From global perspective, historical observations are projected into...
Dense captioning is a challenging task which not only detects visual elements in images but also generates natural language sentences to describe them. Previous approaches do leverage object information for this task. However, objects provide valuable cues help predict the locations of caption regions as often highly overlap with (i.e. are usually parts or combinations them). Meanwhile, important describing target region corresponding description depicts its properties, involves interactions...
Tabular data synthesis is crucial in machine learning, yet existing general methods-primarily based on statistical or deep learning models-are highly data-dependent and often fall short recommender systems. This limitation arises from their difficulty capturing complex distributions understanding complicated feature relations sparse limited data, along with inability to grasp semantic relations. Recently, Large Language Models (LLMs) have shown potential generating synthetic through few-shot...
Column Generation (CG) is an effective and iterative algorithm to solve large-scale linear programs (LP). During each CG iteration, new columns are added improve the solution of LP. Typically, greedily selects one column with most negative reduced cost, which can be improved by adding more at once. However, selecting all costs would lead addition redundant that do not objective value. Therefore, appropriate add still open problem previous machine-learning-based approaches for only a constant...
The goal of few-shot image recognition (FSIR) is to identify novel categories with a small number annotated samples by exploiting transferable knowledge from training data (base categories). Most current studies assume that the can be well used categories. However, such capability may impacted dataset bias, and this problem has rarely been investigated before. Besides, most learning methods are biased different datasets, which also an important issue needs deeply. In paper, we first...
Recently, automatic generation of image captions has attracted great interest not only because its extensive applications but also it connects computer vision and natural language processing. By combining convolutional neural networks (CNNs), which learn visual representations from images, recurrent (RNNs), translate the learned features into text sequences, content a can be transformed linguistic sequences. Existing approaches typically focus on extracted form an object-oriented CNN (train...
Recently, object recognition techniques have been rapidly developed. Most of existing focused on recognizing several independent concepts. The relationship objects is also an important problem, which shows in-depth semantic information images. In this work, toward general visual detection, we propose a method to integrate spatial distribution facilitate relation detection. Spatial can not only reflect positional but describe structural between objects. distributions are described with...
Referring expressions are natural language descriptions of objects within a given scene. Context is crucial importance for referring expression, as the description not only depicts properties object but also involves relationships referred with other ones. Most previous work uses either whole image or one particular contextual context. However, context these approaches holistic and insufficient, expression often describes multiple in an image. To leverage rich information from all image,...
Many cloud platforms emerge to meet urgent requirements for large-volume personal image store, sharing and search. Though most would agree that images contain rich sensitive information (e.g., People, location event) people's privacy concerns hinder their participation into untrusted services, today's provide little support protection. Facing large-scale from multiple users, it is extremely challenging the maintain index structure schedule parallel computation without learning anything about...
In high-voltage power systems, insulators are essential components in transmission lines for increasing shooting distance and securing wires. Unmanned aerial vehicle imaging becomes a common way of inspecting the state insulators. However, automatic detection with complex backgrounds is still challenging task. Most existing object methods based on anchors, which do not have sufficient ability to describe objects that string-like structure. To tackle it, inspired by keypoints-based method, we...
Learning similarity of two images is an important problem in computer vision and has many potential applications. Most the previous works focus on generating image similarities three aspects: global feature distance computing, local matching, concepts comparison. However, task directly detecting class agnostic common objects from not been studied before, which goes one step further to capture at region level. In this paper, we propose end-to-end Common Object Detection Network (CODN) detect...
While the multi-branch architecture is one of key ingredients to success computer vision tasks, it has not been well investigated in natural language processing, especially sequence learning tasks. In this work, we propose a simple yet effective variant Transformer called attentive (briefly, MAT), where attention layer average multiple branches and each branch an independent multi-head layer. We leverage two training techniques regularize training: drop-branch, which randomly drops...
We propose a novel data augmentation method named 'FenceMask' that exhibits outstanding performance in various computer vision tasks. It is based on the 'simulation of object occlusion' strategy, which aim to achieve balance between occlusion and information retention input data. By enhancing sparsity regularity block, our overcome difficulty small notably improve over baselines. Sufficient experiments prove better than other simulate approaches. tested it CIFAR10, CIFAR100 ImageNet datasets...
High-resolution cameras produce huge volume of high quality images everyday. It is extremely challenging to store, share and especially search those images, for which increasing number cloud services are presented support such functionalities. However, tend contain rich sensitive information (\eg, people, location event), people's privacy concerns hinder their readily participation into the provided by untrusted third parties. In this work, we introduce PIC: a Privacy-preserving large-scale...
Video-language pre-training has attracted considerable attention recently for its promising performance on various downstream tasks. Most existing methods utilize the modality-specific or modality-joint representation architectures cross-modality pre-training. Different from previous methods, this paper presents a novel architecture named Memory-augmented Inter-Modality Bridge (MemBridge), which uses learnable intermediate modality representations as bridge interaction between videos and...
Innovation is a complex process, involving variety of factors at different levels. process and the that affect it should be coordinated managed in systematic way. However, coherent framework for innovation management does not yet exist. In this paper, an attempt made to develop systems thinking simultaneously addresses exploitative exploratory innovation. By placing larger context thinking, influencing on its success or failure can better recognized understood. Drawing theory knowledge...