Yanbin Hao

ORCID: 0000-0002-0695-1566
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Human Pose and Action Recognition
  • Multimodal Machine Learning Applications
  • Domain Adaptation and Few-Shot Learning
  • Advanced Image and Video Retrieval Techniques
  • Video Surveillance and Tracking Methods
  • Video Analysis and Summarization
  • Image Retrieval and Classification Techniques
  • Generative Adversarial Networks and Image Synthesis
  • Anomaly Detection Techniques and Applications
  • Topic Modeling
  • Gait Recognition and Analysis
  • Advanced Vision and Imaging
  • Advanced Neural Network Applications
  • Computer Graphics and Visualization Techniques
  • Hand Gesture Recognition Systems
  • Natural Language Processing Techniques
  • Robotics and Sensor-Based Localization
  • Advanced Text Analysis Techniques
  • Neural Networks and Applications
  • Face recognition and analysis
  • Human Motion and Animation
  • Recommender Systems and Techniques
  • Expert finding and Q&A systems
  • Cancer-related molecular mechanisms research
  • Digital Imaging for Blood Diseases

Hefei University of Technology
2014-2025

University of Science and Technology of China
2021-2025

City University of Hong Kong
2019-2021

Central China Normal University
2016-2018

Shanghai Maritime University
2009-2010

Northwestern Polytechnical University
2006

Recent transformer-based solutions have shown great success in 3D human pose estimation. Nevertheless, to calculate the joint-to-joint affinity matrix, computational cost has a quadratic growth with increasing number of joints. Such drawback becomes even worse especially for estimation video sequence, which necessitates spatio-temporal correlation spanning over entire video. In this paper, we facilitate issue by decomposing learning into space and time, present novel Spatio-Temporal...

10.1109/cvpr52729.2023.00464 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

Representing procedure text such as recipe for crossmodal retrieval is inherently a difficult problem, not mentioning to generate image from visualization. This paper studies new version of GAN, named Recipe Retrieval Generative Adversarial Network (R2GAN), explore the feasibility generating problem. The motivation using GAN twofold: learning compatible cross-modal features in an adversarial way, and explanation search results by showing images generated recipes. novelty R2GAN comes...

10.1109/cvpr.2019.01174 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

Near-duplicate video retrieval (NDVR) has been a significant research task in multimedia given its high impact applications, such as search, recommendation, and copyright protection. In addition to accurate performance, the exponential growth of online videos imposed heavy demands on efficiency scalability existing systems. Aiming at improving both accuracy speed, we propose novel stochastic multiview hashing algorithm facilitate construction large-scale NDVR system. Reliable mapping...

10.1109/tmm.2016.2610324 article EN IEEE Transactions on Multimedia 2016-09-15

Transformer achieves remarkable successes in understanding 1 and 2-dimensional signals (e.g., NLP Image Content Understanding). As a potential alternative to convolutional neural networks, it shares merits of strong interpretability, high discriminative power on hyper-scale data, flexibility processing varying length inputs. However, its encoders naturally contain computational intensive operations such as pair-wise self-attention, incurring heavy burden when being applied the complex...

10.1145/3474085.3475272 preprint EN Proceedings of the 30th ACM International Conference on Multimedia 2021-10-17

Attention mechanisms have significantly boosted the performance of video classification neural networks thanks to utilization perspective contexts. However, current research on attention generally focuses adopting a specific aspect contexts (e.g., channel, spatial/temporal, or global context) refine features and neglects their underlying correlation when computing attentions. This leads incomplete context hence bears weakness limited improvement. To tackle problem, this paper proposes an...

10.1109/tcsvt.2022.3169842 article EN IEEE Transactions on Circuits and Systems for Video Technology 2022-04-22

Learning discriminative representation from the complex spatio-temporal dynamic space is essential for video recognition. On top of those stylized computational units, further refining learnt feature with axial contexts demonstrated to be promising in achieving this goal. However, previous works generally focus on utilizing a single kind calibrate entire channels and could hardly apply deal diverse activities. The problem can tackled by using pair-wise attentions recompute response...

10.1109/cvpr52688.2022.00100 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

Human motion prediction from historical pose sequence is at the core of many applications in machine intelligence. However, current state-of-the-art methods, predicted future confined within same activity. One can neither generate predictions that differ activity, nor manipulate body parts to explore various possibilities. Undoubtedly, this greatly limits usefulness and applicability prediction. In paper, we propose a generalization human task which control parameters be readily incorporated...

10.1609/aaai.v35i3.16321 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2021-05-18

Predicting human motion from a historical pose sequence is at the core of many applications in computer vision. Current state-of-the-art methods concentrate on learning contexts space, however, high dimensionality and complex nature invoke inherent difficulties extracting such contexts. In this paper, we instead advocate to model joint trajectory as smooth, vectorial, gives sufficient information model. Moreover, most existing consider only dependencies between skeletal connected joints,...

10.1109/iccv48922.2021.01305 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

Zero-shot learning (ZSL) suffers intensely from the domain shift issue, i.e., mismatch (or misalignment) between true and learned data distributions for classes without training (unseen classes). By additionally unlabelled collected unseen classes, transductive ZSL (TZSL) could reduce but only to a certain extent. To improve TZSL, we propose novel approach Bi-VAEGAN which strengthens distribution alignment visual space an auxiliary space. As result, it can largely shift. The proposed key...

10.1109/cvpr52729.2023.01905 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

In this paper, a novel unsupervised hashing algorithm, referred to as t-USMVH, and its extension deep hashing, t-UDH, are proposed support large-scale video-to-video retrieval. To improve robustness of the learning, t-USMVH combines multiple types feature representations effectively fuses them by examining continuous relevance score based on Gaussian estimation over pairwise distances, also discrete neighbor cardinality reciprocal neighbors. reduce sensitivity scale changes for mapping...

10.1109/tip.2017.2737329 article EN IEEE Transactions on Image Processing 2017-08-07

Sentiment analysis is an important topic concerning identification of feelings, attitudes, emotions and opinions from text. To automate such analysis, a large amount example text needs to be manually annotated for model training. This laborious expensive, but the cross-domain technique key solution reducing cost by reusing reviews across domains. However, its success largely relies on learning robust common representation space In recent years, significant effort has been invested improve...

10.1109/tkde.2019.2913379 article EN IEEE Transactions on Knowledge and Data Engineering 2019-04-27

Few-shot learning (FSL) based on manifold regularization aims to improve the recognition capacity of novel objects with limited training samples by mixing two from different categories a blending factor. However, this operation weakens feature representation due linear interpolation and overlooking importance specific channels. To solve these issues, paper proposes attentive (AFR) which representativeness discriminability. In our approach, we first calculate relations between semantic labels...

10.1609/aaai.v38i7.28614 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2024-03-24

This paper offers an insightful examination of how currently top-trending AI technologies, i.e., generative artificial intelligence (Generative AI) and large language models (LLMs), are reshaping the field video technology, including generation, understanding, streaming.It highlights innovative use these technologies in producing highly realistic videos, a significant leap bridging gap between real-world dynamics digital creation.The study also delves into advanced capabilities LLMs...

10.36227/techrxiv.171172801.19993069/v1 preprint EN cc-by-sa 2024-03-29

10.1109/cvpr52733.2024.01630 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024-06-16

Young children are devoting increasing time to playing on handheld touchscreen devices (e.g., iPads). Though thousands of apps claimed be "educational," there is a lack sufficient evidence examining the impact touchscreens children's learning outcomes. In present study, two questions we focused were (a) whether using was helpful in teaching tell time, and (b) what extent young could transfer they had learned other media. A pre- posttest design adopted. After read iPad for 10 minutes, three...

10.3389/fpsyg.2016.01800 article EN cc-by Frontiers in Psychology 2016-11-17

Various structural relations/dependencies exist among human body joints, which makes it possible to estimate 3D poses from 2D sources. The current research on pose estimation (3D-HPE for short) mainly focuses information a specific perspective. However, this cannot facilitate 2D-to-3D lifting. This paper presents novel and efficient multi-layer perceptron with joint-coordinate gating (MLP-JCG) model, exploring utilizing both the local global perform estimations. Specifically, MLP-JCG...

10.1109/tmm.2023.3240455 article EN IEEE Transactions on Multimedia 2023-01-01

Capturing cross-pose correlation from a sequence of frame-level 2D poses is essential for 3D human pose estimation (3D-HPE) in the video. Recent studies have shown promising potential modeling relation with feature-mixing operations on temporal domain. However, they seldom consider interaction across frequency This paper Frequency-Temporal Collaborative Module (FTCM) to explore feasibility encoding correlations both and domains. FTCM aims jointly capture global local more lightweight network...

10.1109/tcsvt.2023.3286402 article EN IEEE Transactions on Circuits and Systems for Video Technology 2023-06-23

The large-scale visual-language pre-trained model, Contrastive Language-Image Pre-training (CLIP), has significantly improved image captioning for scenarios without human-annotated image-caption pairs. Recent advanced CLIP-based human annotations follows a text-only training paradigm, i.e., reconstructing text from shared embedding space. Nevertheless, these approaches are limited by the training/inference gap or huge storage requirements embeddings. Given that it is trivial to obtain images...

10.1145/3581783.3611891 preprint EN 2023-10-26

This study introduces an efficacious approach, Masked Collaborative Contrast (MCC), to highlight semantic regions in weakly supervised segmentation. MCC adroitly draws inspiration from masked image modeling and contrastive learning devise a novel framework that induces keys contract toward regions. Unlike prevalent techniques directly eradicate patch the input when generating masks, we scrutinize neighborhood relations of tokens by exploring masks considering on affinity matrix. Moreover,...

10.1109/wacv57701.2024.00091 article EN 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2024-01-03

Few-shot learning (FSL) aims at recognizing a novel object under limited training samples. A robust feature extractor (backbone) can significantly improve the recognition performance of FSL model. However, an effective backbone is challenging issue since 1) designing and validating structures backbones are time-consuming expensive processes, 2) trained on known (base) categories more inclined to focus textures objects it learns, which hard describe To solve these problems, we propose mixture...

10.1109/tip.2024.3411452 article EN IEEE Transactions on Image Processing 2024-01-01

The practical use of the Transformer-based methods for processing videos is constrained by high computing complexity. Although previous approaches adopt spatiotemporal decomposition 3D attention to mitigate issue, they suffer from drawback neglecting majority visual tokens. This paper presents a novel mixed operation that subtly fuses random, spatial, and temporal mechanisms. proposed random stochastically samples video tokens in simple yet effective way, complementing other methods....

10.1145/3712594 article EN ACM Transactions on Multimedia Computing Communications and Applications 2025-01-17

Recent advancements in text-to-image generation models have excelled creating diverse and realistic images. This success extends to food imagery, where various conditional inputs like cooking styles, ingredients, recipes are utilized. However, a yet-unexplored challenge is generating sequence of procedural images based on steps from recipe. could enhance the experience with visual guidance possibly lead an intelligent simulation system. To fill this gap, we introduce novel task called...

10.48550/arxiv.2501.09042 preprint EN arXiv (Cornell University) 2025-01-15

10.1109/icassp49660.2025.10890443 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12
Coming Soon ...