- Face recognition and analysis
- Face and Expression Recognition
- Advanced Image and Video Retrieval Techniques
- Advanced Neural Network Applications
- Speech and Audio Processing
- Advanced Image Processing Techniques
- Emotion and Mood Recognition
- Generative Adversarial Networks and Image Synthesis
- Human Motion and Animation
- Multimodal Machine Learning Applications
- Human Pose and Action Recognition
- Domain Adaptation and Few-Shot Learning
- Video Surveillance and Tracking Methods
- Advanced Vision and Imaging
- Video Analysis and Summarization
- Machine Learning and Data Classification
- Image Retrieval and Classification Techniques
- Image Processing Techniques and Applications
- Image and Signal Denoising Methods
- Image Enhancement Techniques
- Anomaly Detection Techniques and Applications
- Hand Gesture Recognition Systems
- Geotechnical Engineering and Underground Structures
- 3D Shape Modeling and Analysis
- Music and Audio Processing
University of Science and Technology of China
2016-2025
Jilin University of Chemical Technology
2025
Xi'an Technological University
2012-2024
Xinjiang University
2024
Guangxi University
2024
Central South University
2010-2023
Chongqing University of Science and Technology
2023
Liaoning Shihua University
2023
Tongji University
2009-2022
State Grid Corporation of China (China)
2021
Visual Question Answering (VQA) requires a fine-grained and simultaneous understanding of both the visual content images textual questions. Therefore, designing an effective `co-attention' model to associate key words in questions with objects is central VQA performance. So far, most successful attempts at co-attention learning have been achieved by using shallow models, deep models show little improvement over their counterparts. In this paper, we propose Modular Co-Attention Network (MCAN)...
The sample selection approach is popular in learning with noisy labels. state-of-the-art methods train two deep networks simultaneously for selection, which aims to employ their different abilities. To prevent from converging a consensus, divergence should be maintained. Prior work presents that the can kept by locating disagreement data on prediction labels of are different. However, this procedure sample-inefficient generalization, means only few clean examples utilized training. In paper,...
The task of temporally grounding textual queries in videos is to localize one video segment that semantically corresponds the given query. Most existing approaches rely on segment-sentence pairs (temporal annotations) for training, which are usually unavailable real-world scenarios. In this work we present an effective weakly-supervised model, named as Multi-Level Attentional Reconstruction Network (MARN), only relies video-sentence during training stage. proposed method leverages idea...
Visual Question Answering (VQA) requires a fine-grained and simultaneous understanding of both the visual content images textual questions. Therefore, designing an effective `co-attention' model to associate key words in questions with objects is central VQA performance. So far, most successful attempts at co-attention learning have been achieved by using shallow models, deep models show little improvement over their counterparts. In this paper, we propose Modular Co-Attention Network (MCAN)...
Given an arbitrary speech clip or text information as input, the proposed work aims to generate a talking face video with accurate lip synchronization. Existing works mainly have three limitations. (1) A single-modal learning is adopted either audio hence it lacks complementarity of <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">multimodal inputs</i> . (2) Each frame generated independently, ignores...
Accurate facial expression recognition is challenging because identity biases introduce large intraclass variations and high interclass similarities. Most existing approaches are devoted to alleviate the effects of identity. However, based on theories cognitive science, psychology, physiology, this article argues that information important can promote recognition. Motivated by our investigation influences recognition, proposes an identity–expression dual branch network (IE-DBN) for First,...
Multi-object tracking achieves the acquisition of target location information and identity through two subtasks, detection re-identification (ReID). The existing commonly used one-shot framework has speed advantages, but subtasks have different feature requirements, which leads to competitive learning in training thus weakens quality. We propose a decoupling based multi-object FDTrack for contradictory requirements. Through mutual inhibition features backbone network are decoupled. Then...
With the rapid advancement of AI technology, there has been a substantial surge in need for computational resources. Particularly deep learning, machine and large-scale data analysis, processing extensive datasets necessitates exceptionally high levels efficacy speed. Conventional homogeneous computing platforms, predominantly reliant on Central Processing Units (CPU), have encountered challenges meeting escalating demands high-performance computing. Consequently, this study advocates...
At present, the YOLO algorithm has become an indispensable core real-time object detection technology in aspects such as unmanned driving, face detection, and robot applications, its versions are constantly being updated upgraded. Herein, we deeply analyze evolution process of carefully investigate innovations contributions arising from iterations YOLOv1 to YOLOv5. We make vivid inspiring prospects for future development direction point out feasibility necessity research on algorithm.
As a focal point of research in various fields, human body language understanding has long been subject intense interest. Within this realm, the exploration emotion recognition through analysis facial expressions, voice patterns, and physiological signals, holds significant practical value. Compared with unimodal approaches, multimodal models leverage complementary information from vision, acoustic, modalities to robust perceive sentiment attitudes. However, heterogeneity among modality...
Current multimodal information retrieval studies mainly focus on single-image inputs, which limits real-world applications involving multiple images and text-image interleaved content. In this work, we introduce the (TIIR) task, where query document are sequences, model is required to understand semantics from context for effective retrieval. We construct a TIIR benchmark based naturally wikiHow tutorials, specific pipeline designed generate queries. To explore adapt several off-the-shelf...
Generative adversarial network (GAN) is a powerful generative model. However, it suffers from several problems, such as convergence instability and mode collapse. To overcome these drawbacks, this paper presents novel architecture of GAN, which consists one generator two different discriminators. With the fact that GAN analogy minimax game, proposed follows. The (G) aims to produce realistic-looking samples fool both first discriminator (D1) rewards high scores for data distribution, while...
We address the issues of 3-D head pose estimation and face modeling from a depth image. Given image, random forests are effective for estimating location orientation person's head. However, accuracy is not high enough. propose using corrected regression votes. The votes obtained by considering cooperation all trees, leading to significant improvement accuracy. Based on estimator, we present system. In our system, model generated aligning deformable image an iterative closest point (ICP)...
Characterization of pore throat size distribution (PTSD) in tight sandstones is substantial significance for sandstone reservoirs evaluation. High-pressure mercury intrusion (HPMI) and nuclear magnetic resonance (NMR) are the effective methods characterizing PTSD reservoirs. NMR T2 spectra usually converted to capillary pressure characterization. However, conversion challenging due tiny sizes. In this paper, linear method nonlinear investigated, error minimization least square proposed...
In recent years, the rapid development of artificial intelligence, especially deep learning technology, makes machine have application scenarios in fields power system stability analysis, coordination along with scheduling and load forecasting. This paper designs an emotional programming controller (EDLPC) for automatic voltage control systems. The designed EDLPC contains neural network (EDNN) structure Q-learning algorithm. Besides, a specially defined proportional-integral-derivative (PID)...
Tracking by natural language specification in a video is challenging task computer vision. Distinct from initializing the target state only bounding box first frame, has strong potential to assist visual object trackers capture appearance variation and eliminate semantic ambiguity of tracked object. In this paper, we carefully design unified local-global-search framework perspective cross-modal retrieval, including local tracker, an adaptive retrieval switch module, target-specific module....