Qi Wu

ORCID: 0000-0003-3631-256X
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Multimodal Machine Learning Applications
  • Advanced Image and Video Retrieval Techniques
  • Domain Adaptation and Few-Shot Learning
  • Human Pose and Action Recognition
  • Topic Modeling
  • Natural Language Processing Techniques
  • Video Analysis and Summarization
  • Image Retrieval and Classification Techniques
  • Advanced Neural Network Applications
  • Video Surveillance and Tracking Methods
  • Blind Source Separation Techniques
  • EEG and Brain-Computer Interfaces
  • COVID-19 diagnosis using AI
  • Reinforcement Learning in Robotics
  • Handwritten Text Recognition Techniques
  • AI in cancer detection
  • Advanced Vision and Imaging
  • Radiomics and Machine Learning in Medical Imaging
  • Autonomous Vehicle Technology and Safety
  • Visual Attention and Saliency Detection
  • Speech and dialogue systems
  • Music and Audio Processing
  • Advanced Graph Neural Networks
  • Machine Learning and Data Classification
  • Brain Tumor Detection and Classification

The University of Adelaide
2016-2025

Australian Centre for Robotic Vision
2017-2025

First Affiliated Hospital of Jiangxi Medical College
2025

Jiangxi Provincial People's Hospital
2025

Guangdong Police College
2025

Lishui University
2025

Affiliated Hospital of Youjiang Medical University for Nationalities
2024-2025

Nanjing University
2025

Shanghai Stock Exchange
2024

Sichuan University
2024

A robot that can carry out a natural-language instruction has been dream since before the Jetsons cartoon series imagined life of leisure mediated by fleet attentive helpers. It is remains stubbornly distant. However, recent advances in vision and language methods have made incredible progress closely related areas. This significant because interpreting navigation on basis what it sees carrying process similar to Visual Question Answering. Both tasks be interpreted as visually grounded...

10.1109/cvpr.2018.00387 article EN 2018-06-01

Much recent progress in Vision-to-Language (V2L) problems has been achieved through a combination of Convolutional Neural Networks (CNNs) and Recurrent (RNNs). This approach does not explicitly represent high-level semantic concepts, but rather seeks to directly from image features text. In this paper we investigate whether direct succeeds due to, or despite, the fact that it avoids explicit representation information. We propose method incorporating concepts into successful CNN-RNN...

10.1109/cvpr.2016.29 article EN 2016-06-01

Visual Question Answering (VQA) has attracted much attention in both computer vision and natural language processing communities, not least because it offers insight into the relationships between two important sources of information. Current datasets, models built upon them, have focused on questions which are answerable by direct analysis question image alone. The set such that require no external information to answer is interesting, but very limited. It excludes common sense, or basic...

10.1109/tpami.2017.2754246 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2017-09-19

Much of the recent progress in Vision-to-Language problems has been achieved through a combination Convolutional Neural Networks (CNNs) and Recurrent (RNNs). This approach does not explicitly represent high-level semantic concepts, but rather seeks to directly from image features text. In this paper we first propose method incorporating concepts into successful CNN-RNN approach, show that it achieves significant improvement on state-of-the-art both captioning visual question answering. We...

10.1109/tpami.2017.2708709 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2017-05-26

We propose a method for visual question answering which combines an internal representation of the content image with information extracted from general knowledge base to answer broad range image-based questions. This allows more complex questions be answered using predominant neural network-based approach than has previously been possible. It particularly asked about contents image, even when itself does not contain whole answer. The constructs textual semantic and merges it sourced base,...

10.1109/cvpr.2016.500 preprint EN 2016-06-01

Image and sentence matching has made great progress recently, but it remains challenging due to the large visual-semantic discrepancy. This mainly arises from that representation of pixel-level image usually lacks high-level semantic information as in its matched sentence. In this work, we propose a semantic-enhanced model, which can improve by learning concepts then organizing them correct order. Given an image, first use multi-regional multi-label CNN predict concepts, including objects,...

10.1109/cvpr.2018.00645 preprint EN 2018-06-01

Cross-modal retrieval between videos and texts has attracted growing attentions due to the rapid emergence of on web. The current dominant approach is learn a joint embedding space measure cross-modal similarities. However, simple embeddings are insufficient represent complicated visual textual details, such as scenes, objects, actions their compositions. To improve fine-grained video-text retrieval, we propose Hierarchical Graph Reasoning (HGR) model, which decomposes matching into...

10.1109/cvpr42600.2020.01065 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020-06-01

The task in referring expression comprehension is to localize the object instance an image described by a phrased natural language. As language-to-vision matching task, key this problem learn discriminative feature that can adapt used. To avoid ambiguity, normally tends describe not only properties of referent itself, but also its relationships neighbourhood. capture and exploit important information we propose graph-based, language-guided attention mechanism. Being composed node component...

10.1109/cvpr.2019.00206 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

Humans are able to describe image contents with coarse fine details as they wish. However, most captioning models intention-agnostic which cannot generate diverse descriptions according different user intentions initiatively. In this work, we propose the Abstract Scene Graph (ASG) structure represent intention in fine-grained level and control what how detailed generated description should be. The ASG is a directed graph consisting of three types abstract nodes (object, attribute,...

10.1109/cvpr42600.2020.00998 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020-06-01

One of the long-term challenges robotics is to enable robots interact with humans in visual world via natural language, as are animals that communicate through language. Overcoming this challenge requires ability perform a wide variety complex tasks response multifarious instructions from humans. In hope it might drive progress towards more flexible and powerful human interactions robots, we propose dataset varied robot tasks, described terms objects visible large set real images. Given an...

10.1109/cvpr42600.2020.01000 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020-06-01

Deep convolution neural networks (CNNs) have demonstrated advanced performance on single-label image classification, and various progress also has been made to apply CNN methods multilabel which requires annotating objects, attributes, scene categories, etc., in a single shot. Recent state-of-the-art approaches the classification exploit label dependencies an image, at global level, largely improving labeling capacity. However, predicting small objects visual concepts is still challenging...

10.1109/tmm.2018.2812605 article EN IEEE Transactions on Multimedia 2018-03-09

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be phrase, sentence even multi-round dialogue. There are three main challenges VG: 1) what is focus query; 2) how understand image; 3) object. Most existing methods combine all information curtly, which may suffer from problem of redundancy (i.e. ambiguous query, complicated image and large number objects). In this paper, we formulate these as attention...

10.1109/cvpr.2018.00808 article EN 2018-06-01

Semantic segmentation aims to classify every pixel of an input image. Considering the difficulty acquiring dense labels, researchers have recently been resorting weak labels alleviate annotation burden segmentation. However, existing works mainly concentrate on expanding seed pseudo within image's salient region. In this work, we propose a non-salient region object mining approach for weakly supervised semantic We introduce graph-based global reasoning unit strengthen classification...

10.1109/cvpr46437.2021.00265 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

Accuracy of many visiolinguistic tasks has benefited significantly from the application vision-and-language (V&L) BERT. However, its for task navigation (VLN) remains limited. One reason this is difficulty adapting BERT architecture to partially observable Markov decision process present in VLN, requiring history-dependent attention and making. In paper we propose a recurrent model that time-aware use VLN. Specifically, equip with function maintains cross-modal state information agent....

10.1109/cvpr46437.2021.00169 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

Due to the memorization effect in Deep Neural Networks (DNNs), training with noisy labels usually results inferior model performance. Existing state-of-the-art methods primarily adopt a sample selection strategy, which selects small-loss samples for subsequent training. However, prior literature tends perform within each mini-batch, neglecting imbalance of noise ratios different mini-batches. Moreover, valuable knowledge high-loss is wasted. To this end, we propose noise-robust approach...

10.1109/cvpr46437.2021.00515 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

Knowledge-based visual question answering requires the ability of associating external knowledge for open-ended cross-modal scene understanding. One limitation existing solutions is that they capture relevant from text-only bases, which merely contain facts expressed by first-order predicates or language descriptions while lacking complex but indispensable multimodal How to construct vision-relevant and explainable VQA scenario has been less studied. In this paper, we propose MuKEA represent...

10.1109/cvpr52688.2022.00503 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

We describe a method for visual question answering which is capable of reasoning about an image on the basis information extracted from large-scale knowledge base. The not only answers natural language questions using concepts contained in image, but can explain by it developed its answer. It far more complex than predominant long short-term memory-based approach, and outperforms significantly testing. also provide dataset protocol to evaluate general methods.

10.24963/ijcai.2017/179 preprint EN 2017-07-28

The visual dialog task requires an agent to engage in a conversation about image with human. It represents extension of the question answering that needs answer image, but it do so light previous has taken place. key challenge is thus maintaining consistent, and natural while continuing questions correctly. We present novel approach combines Reinforcement Learning Generative Adversarial Networks (GANS) generate more human-like responses questions. GAN helps overcome relative paucity training...

10.1109/cvpr.2018.00639 preprint EN 2018-06-01

Recognising objects according to a pre-defined fixed set of class labels has been well studied in the Computer Vision. There are great many practical applications where subjects that may be interest not known beforehand, or so easily delineated, however. In these cases natural language dialog is way specify subject interest, and task achieving this capability (a.k.a, Referring Expression Comprehension) recently attracted attention. To end we propose unified framework, ParalleL AttentioN...

10.1109/cvpr.2018.00447 article EN 2018-06-01

In this paper, we exploit memory-augmented neural networks to predict accurate answers visual questions, even when those rarely occur in the training set. The memory network incorporates both internal and external blocks selectively pays attention each exemplar. We show that are able maintain a relatively long-term of scarce exemplars, which is important for question answering due heavy-tailed distribution general VQA setting. Experimental results two large-scale benchmark datasets favorable...

10.1109/cvpr.2018.00729 article EN 2018-06-01

Fact-based Visual Question Answering (FVQA) requires external knowledge beyond the visible content to answer questions about an image. This ability is challenging but indispensable achieve general VQA. One limitation of existing FVQA solutions that they jointly embed all kinds information without fine-grained selection, which introduces unexpected noises for reasoning final answer. How capture question-oriented and information-complementary evidence remains a key challenge solve problem. In...

10.24963/ijcai.2020/153 preprint EN 2020-07-01

One of the most intriguing features Visual Question Answering (VQA) challenge is unpredictability questions. Extracting information required to answer them demands a variety image operations from detection and counting, segmentation reconstruction. To train method perform even one these accurately {image, question, answer} tuples would be challenging, but aim achieve all with limited set such training data seems ambitious at best. Our thus learns how exploit external off-the-shelf algorithms...

10.1109/cvpr.2017.416 preprint EN 2017-07-01
Coming Soon ...