- Human Pose and Action Recognition
- Multimodal Machine Learning Applications
- Domain Adaptation and Few-Shot Learning
- Advanced Image and Video Retrieval Techniques
- Video Surveillance and Tracking Methods
- Video Analysis and Summarization
- Image Retrieval and Classification Techniques
- Generative Adversarial Networks and Image Synthesis
- Anomaly Detection Techniques and Applications
- Topic Modeling
- Gait Recognition and Analysis
- Advanced Vision and Imaging
- Advanced Neural Network Applications
- Computer Graphics and Visualization Techniques
- Hand Gesture Recognition Systems
- Natural Language Processing Techniques
- Robotics and Sensor-Based Localization
- Advanced Text Analysis Techniques
- Neural Networks and Applications
- Face recognition and analysis
- Human Motion and Animation
- Recommender Systems and Techniques
- Expert finding and Q&A systems
- Cancer-related molecular mechanisms research
- Digital Imaging for Blood Diseases
Hefei University of Technology
2014-2025
University of Science and Technology of China
2021-2025
City University of Hong Kong
2019-2021
Central China Normal University
2016-2018
Shanghai Maritime University
2009-2010
Northwestern Polytechnical University
2006
Recent transformer-based solutions have shown great success in 3D human pose estimation. Nevertheless, to calculate the joint-to-joint affinity matrix, computational cost has a quadratic growth with increasing number of joints. Such drawback becomes even worse especially for estimation video sequence, which necessitates spatio-temporal correlation spanning over entire video. In this paper, we facilitate issue by decomposing learning into space and time, present novel Spatio-Temporal...
Representing procedure text such as recipe for crossmodal retrieval is inherently a difficult problem, not mentioning to generate image from visualization. This paper studies new version of GAN, named Recipe Retrieval Generative Adversarial Network (R2GAN), explore the feasibility generating problem. The motivation using GAN twofold: learning compatible cross-modal features in an adversarial way, and explanation search results by showing images generated recipes. novelty R2GAN comes...
Near-duplicate video retrieval (NDVR) has been a significant research task in multimedia given its high impact applications, such as search, recommendation, and copyright protection. In addition to accurate performance, the exponential growth of online videos imposed heavy demands on efficiency scalability existing systems. Aiming at improving both accuracy speed, we propose novel stochastic multiview hashing algorithm facilitate construction large-scale NDVR system. Reliable mapping...
Transformer achieves remarkable successes in understanding 1 and 2-dimensional signals (e.g., NLP Image Content Understanding). As a potential alternative to convolutional neural networks, it shares merits of strong interpretability, high discriminative power on hyper-scale data, flexibility processing varying length inputs. However, its encoders naturally contain computational intensive operations such as pair-wise self-attention, incurring heavy burden when being applied the complex...
Attention mechanisms have significantly boosted the performance of video classification neural networks thanks to utilization perspective contexts. However, current research on attention generally focuses adopting a specific aspect contexts (e.g., channel, spatial/temporal, or global context) refine features and neglects their underlying correlation when computing attentions. This leads incomplete context hence bears weakness limited improvement. To tackle problem, this paper proposes an...
Learning discriminative representation from the complex spatio-temporal dynamic space is essential for video recognition. On top of those stylized computational units, further refining learnt feature with axial contexts demonstrated to be promising in achieving this goal. However, previous works generally focus on utilizing a single kind calibrate entire channels and could hardly apply deal diverse activities. The problem can tackled by using pair-wise attentions recompute response...
Human motion prediction from historical pose sequence is at the core of many applications in machine intelligence. However, current state-of-the-art methods, predicted future confined within same activity. One can neither generate predictions that differ activity, nor manipulate body parts to explore various possibilities. Undoubtedly, this greatly limits usefulness and applicability prediction. In paper, we propose a generalization human task which control parameters be readily incorporated...
Predicting human motion from a historical pose sequence is at the core of many applications in computer vision. Current state-of-the-art methods concentrate on learning contexts space, however, high dimensionality and complex nature invoke inherent difficulties extracting such contexts. In this paper, we instead advocate to model joint trajectory as smooth, vectorial, gives sufficient information model. Moreover, most existing consider only dependencies between skeletal connected joints,...
Zero-shot learning (ZSL) suffers intensely from the domain shift issue, i.e., mismatch (or misalignment) between true and learned data distributions for classes without training (unseen classes). By additionally unlabelled collected unseen classes, transductive ZSL (TZSL) could reduce but only to a certain extent. To improve TZSL, we propose novel approach Bi-VAEGAN which strengthens distribution alignment visual space an auxiliary space. As result, it can largely shift. The proposed key...
In this paper, a novel unsupervised hashing algorithm, referred to as t-USMVH, and its extension deep hashing, t-UDH, are proposed support large-scale video-to-video retrieval. To improve robustness of the learning, t-USMVH combines multiple types feature representations effectively fuses them by examining continuous relevance score based on Gaussian estimation over pairwise distances, also discrete neighbor cardinality reciprocal neighbors. reduce sensitivity scale changes for mapping...
Sentiment analysis is an important topic concerning identification of feelings, attitudes, emotions and opinions from text. To automate such analysis, a large amount example text needs to be manually annotated for model training. This laborious expensive, but the cross-domain technique key solution reducing cost by reusing reviews across domains. However, its success largely relies on learning robust common representation space In recent years, significant effort has been invested improve...
Few-shot learning (FSL) based on manifold regularization aims to improve the recognition capacity of novel objects with limited training samples by mixing two from different categories a blending factor. However, this operation weakens feature representation due linear interpolation and overlooking importance specific channels. To solve these issues, paper proposes attentive (AFR) which representativeness discriminability. In our approach, we first calculate relations between semantic labels...
This paper offers an insightful examination of how currently top-trending AI technologies, i.e., generative artificial intelligence (Generative AI) and large language models (LLMs), are reshaping the field video technology, including generation, understanding, streaming.It highlights innovative use these technologies in producing highly realistic videos, a significant leap bridging gap between real-world dynamics digital creation.The study also delves into advanced capabilities LLMs...
Young children are devoting increasing time to playing on handheld touchscreen devices (e.g., iPads). Though thousands of apps claimed be "educational," there is a lack sufficient evidence examining the impact touchscreens children's learning outcomes. In present study, two questions we focused were (a) whether using was helpful in teaching tell time, and (b) what extent young could transfer they had learned other media. A pre- posttest design adopted. After read iPad for 10 minutes, three...
Various structural relations/dependencies exist among human body joints, which makes it possible to estimate 3D poses from 2D sources. The current research on pose estimation (3D-HPE for short) mainly focuses information a specific perspective. However, this cannot facilitate 2D-to-3D lifting. This paper presents novel and efficient multi-layer perceptron with joint-coordinate gating (MLP-JCG) model, exploring utilizing both the local global perform estimations. Specifically, MLP-JCG...
Capturing cross-pose correlation from a sequence of frame-level 2D poses is essential for 3D human pose estimation (3D-HPE) in the video. Recent studies have shown promising potential modeling relation with feature-mixing operations on temporal domain. However, they seldom consider interaction across frequency This paper Frequency-Temporal Collaborative Module (FTCM) to explore feasibility encoding correlations both and domains. FTCM aims jointly capture global local more lightweight network...
The large-scale visual-language pre-trained model, Contrastive Language-Image Pre-training (CLIP), has significantly improved image captioning for scenarios without human-annotated image-caption pairs. Recent advanced CLIP-based human annotations follows a text-only training paradigm, i.e., reconstructing text from shared embedding space. Nevertheless, these approaches are limited by the training/inference gap or huge storage requirements embeddings. Given that it is trivial to obtain images...
This study introduces an efficacious approach, Masked Collaborative Contrast (MCC), to highlight semantic regions in weakly supervised segmentation. MCC adroitly draws inspiration from masked image modeling and contrastive learning devise a novel framework that induces keys contract toward regions. Unlike prevalent techniques directly eradicate patch the input when generating masks, we scrutinize neighborhood relations of tokens by exploring masks considering on affinity matrix. Moreover,...
Few-shot learning (FSL) aims at recognizing a novel object under limited training samples. A robust feature extractor (backbone) can significantly improve the recognition performance of FSL model. However, an effective backbone is challenging issue since 1) designing and validating structures backbones are time-consuming expensive processes, 2) trained on known (base) categories more inclined to focus textures objects it learns, which hard describe To solve these problems, we propose mixture...
The practical use of the Transformer-based methods for processing videos is constrained by high computing complexity. Although previous approaches adopt spatiotemporal decomposition 3D attention to mitigate issue, they suffer from drawback neglecting majority visual tokens. This paper presents a novel mixed operation that subtly fuses random, spatial, and temporal mechanisms. proposed random stochastically samples video tokens in simple yet effective way, complementing other methods....
Recent advancements in text-to-image generation models have excelled creating diverse and realistic images. This success extends to food imagery, where various conditional inputs like cooking styles, ingredients, recipes are utilized. However, a yet-unexplored challenge is generating sequence of procedural images based on steps from recipe. could enhance the experience with visual guidance possibly lead an intelligent simulation system. To fill this gap, we introduce novel task called...