Jun Yu

ORCID: 0000-0002-3197-8103
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Face recognition and analysis
  • Face and Expression Recognition
  • Advanced Image and Video Retrieval Techniques
  • Advanced Neural Network Applications
  • Speech and Audio Processing
  • Advanced Image Processing Techniques
  • Emotion and Mood Recognition
  • Generative Adversarial Networks and Image Synthesis
  • Human Motion and Animation
  • Multimodal Machine Learning Applications
  • Human Pose and Action Recognition
  • Domain Adaptation and Few-Shot Learning
  • Video Surveillance and Tracking Methods
  • Advanced Vision and Imaging
  • Video Analysis and Summarization
  • Machine Learning and Data Classification
  • Image Retrieval and Classification Techniques
  • Image Processing Techniques and Applications
  • Image and Signal Denoising Methods
  • Image Enhancement Techniques
  • Anomaly Detection Techniques and Applications
  • Hand Gesture Recognition Systems
  • Geotechnical Engineering and Underground Structures
  • 3D Shape Modeling and Analysis
  • Music and Audio Processing

University of Science and Technology of China
2016-2025

Jilin University of Chemical Technology
2025

Xi'an Technological University
2012-2024

Xinjiang University
2024

Guangxi University
2024

Central South University
2010-2023

Chongqing University of Science and Technology
2023

Liaoning Shihua University
2023

Tongji University
2009-2022

State Grid Corporation of China (China)
2021

Visual Question Answering (VQA) requires a fine-grained and simultaneous understanding of both the visual content images textual questions. Therefore, designing an effective `co-attention' model to associate key words in questions with objects is central VQA performance. So far, most successful attempts at co-attention learning have been achieved by using shallow models, deep models show little improvement over their counterparts. In this paper, we propose Modular Co-Attention Network (MCAN)...

10.1109/cvpr.2019.00644 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

The sample selection approach is popular in learning with noisy labels. state-of-the-art methods train two deep networks simultaneously for selection, which aims to employ their different abilities. To prevent from converging a consensus, divergence should be maintained. Prior work presents that the can kept by locating disagreement data on prediction labels of are different. However, this procedure sample-inefficient generalization, means only few clean examples utilized training. In paper,...

10.1109/iccv51070.2023.00176 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

The task of temporally grounding textual queries in videos is to localize one video segment that semantically corresponds the given query. Most existing approaches rely on segment-sentence pairs (temporal annotations) for training, which are usually unavailable real-world scenarios. In this work we present an effective weakly-supervised model, named as Multi-Level Attentional Reconstruction Network (MARN), only relies video-sentence during training stage. proposed method leverages idea...

10.48550/arxiv.2003.07048 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Visual Question Answering (VQA) requires a fine-grained and simultaneous understanding of both the visual content images textual questions. Therefore, designing an effective `co-attention' model to associate key words in questions with objects is central VQA performance. So far, most successful attempts at co-attention learning have been achieved by using shallow models, deep models show little improvement over their counterparts. In this paper, we propose Modular Co-Attention Network (MCAN)...

10.48550/arxiv.1906.10770 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Given an arbitrary speech clip or text information as input, the proposed work aims to generate a talking face video with accurate lip synchronization. Existing works mainly have three limitations. (1) A single-modal learning is adopted either audio hence it lacks complementarity of <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">multimodal inputs</i> . (2) Each frame generated independently, ignores...

10.1109/tcsvt.2020.2973374 article EN IEEE Transactions on Circuits and Systems for Video Technology 2020-02-12

Accurate facial expression recognition is challenging because identity biases introduce large intraclass variations and high interclass similarities. Most existing approaches are devoted to alleviate the effects of identity. However, based on theories cognitive science, psychology, physiology, this article argues that information important can promote recognition. Motivated by our investigation influences recognition, proposes an identity–expression dual branch network (IE-DBN) for First,...

10.1109/tcds.2020.3034807 article EN IEEE Transactions on Cognitive and Developmental Systems 2020-10-29

Multi-object tracking achieves the acquisition of target location information and identity through two subtasks, detection re-identification (ReID). The existing commonly used one-shot framework has speed advantages, but subtasks have different feature requirements, which leads to competitive learning in training thus weakens quality. We propose a decoupling based multi-object FDTrack for contradictory requirements. Through mutual inhibition features backbone network are decoupled. Then...

10.1109/tcsvt.2023.3249162 article EN IEEE Transactions on Circuits and Systems for Video Technology 2023-02-27

With the rapid advancement of AI technology, there has been a substantial surge in need for computational resources. Particularly deep learning, machine and large-scale data analysis, processing extensive datasets necessitates exceptionally high levels efficacy speed. Conventional homogeneous computing platforms, predominantly reliant on Central Processing Units (CPU), have encountered challenges meeting escalating demands high-performance computing. Consequently, this study advocates...

10.1109/jiot.2025.3526662 article EN IEEE Internet of Things Journal 2025-01-01

At present, the YOLO algorithm has become an indispensable core real-time object detection technology in aspects such as unmanned driving, face detection, and robot applications, its versions are constantly being updated upgraded. Herein, we deeply analyze evolution process of carefully investigate innovations contributions arising from iterations YOLOv1 to YOLOv5. We make vivid inspiring prospects for future development direction point out feasibility necessity research on algorithm.

10.1117/12.3055712 article EN 2025-01-09

As a focal point of research in various fields, human body language understanding has long been subject intense interest. Within this realm, the exploration emotion recognition through analysis facial expressions, voice patterns, and physiological signals, holds significant practical value. Compared with unimodal approaches, multimodal models leverage complementary information from vision, acoustic, modalities to robust perceive sentiment attitudes. However, heterogeneity among modality...

10.1145/3711865 article EN ACM Transactions on Multimedia Computing Communications and Applications 2025-01-10

Current multimodal information retrieval studies mainly focus on single-image inputs, which limits real-world applications involving multiple images and text-image interleaved content. In this work, we introduce the (TIIR) task, where query document are sequences, model is required to understand semantics from context for effective retrieval. We construct a TIIR benchmark based naturally wikiHow tutorials, specific pipeline designed generate queries. To explore adapt several off-the-shelf...

10.48550/arxiv.2502.12799 preprint EN arXiv (Cornell University) 2025-02-18

Generative adversarial network (GAN) is a powerful generative model. However, it suffers from several problems, such as convergence instability and mode collapse. To overcome these drawbacks, this paper presents novel architecture of GAN, which consists one generator two different discriminators. With the fact that GAN analogy minimax game, proposed follows. The (G) aims to produce realistic-looking samples fool both first discriminator (D1) rewards high scores for data distribution, while...

10.1145/3283254.3283282 article EN 2018-11-30

We address the issues of 3-D head pose estimation and face modeling from a depth image. Given image, random forests are effective for estimating location orientation person's head. However, accuracy is not high enough. propose using corrected regression votes. The votes obtained by considering cooperation all trees, leading to significant improvement accuracy. Based on estimator, we present system. In our system, model generated aligning deformable image an iterative closest point (ICP)...

10.1109/tmm.2019.2903724 article EN IEEE Transactions on Multimedia 2019-03-07

Characterization of pore throat size distribution (PTSD) in tight sandstones is substantial significance for sandstone reservoirs evaluation. High-pressure mercury intrusion (HPMI) and nuclear magnetic resonance (NMR) are the effective methods characterizing PTSD reservoirs. NMR T2 spectra usually converted to capillary pressure characterization. However, conversion challenging due tiny sizes. In this paper, linear method nonlinear investigated, error minimization least square proposed...

10.3390/en12081528 article EN cc-by Energies 2019-04-23

In recent years, the rapid development of artificial intelligence, especially deep learning technology, makes machine have application scenarios in fields power system stability analysis, coordination along with scheduling and load forecasting. This paper designs an emotional programming controller (EDLPC) for automatic voltage control systems. The designed EDLPC contains neural network (EDNN) structure Q-learning algorithm. Besides, a specially defined proportional-integral-derivative (PID)...

10.1109/access.2021.3060620 article EN cc-by IEEE Access 2021-01-01

Tracking by natural language specification in a video is challenging task computer vision. Distinct from initializing the target state only bounding box first frame, has strong potential to assist visual object trackers capture appearance variation and eliminate semantic ambiguity of tracked object. In this paper, we carefully design unified local-global-search framework perspective cross-modal retrieval, including local tracker, an adaptive retrieval switch module, target-specific module....

10.1109/cvprw56347.2022.00540 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2022-06-01
Coming Soon ...