- Human Pose and Action Recognition
- Multimodal Machine Learning Applications
- Advanced Neural Network Applications
- Video Surveillance and Tracking Methods
- Anomaly Detection Techniques and Applications
- Gait Recognition and Analysis
- Salivary Gland Tumors Diagnosis and Treatment
- Adversarial Robustness in Machine Learning
- Advanced Image and Video Retrieval Techniques
- Domain Adaptation and Few-Shot Learning
- Stroke Rehabilitation and Recovery
- Oral Health Pathology and Treatment
- Generative Adversarial Networks and Image Synthesis
- Topic Modeling
- Salivary Gland Disorders and Functions
- Natural Language Processing Techniques
- Hand Gesture Recognition Systems
- Video Analysis and Summarization
- Lipid metabolism and disorders
- Reproductive System and Pregnancy
- Human Motion and Animation
- Advanced Vision and Imaging
- AI in cancer detection
- Context-Aware Activity Recognition Systems
- Cancer-related molecular mechanisms research
Yangzhou University
2022-2024
University of Electronic Science and Technology of China
2016-2023
Human activity recognition in videos with convolutional neural network (CNN) features has received increasing attention multimedia understanding. Taking as a sequence of frames, new record was recently set on several benchmark datasets by feeding frame-level CNN to long short-term memory (LSTM) model for video recognition. This recurrent model-based visual pipeline is natural choice perceptual problems time-varying input or sequential outputs. However, the above-mentioned takes LSTM, which...
3-D convolutional neural networks (3-D-convNets) have been very recently proposed for action recognition in videos, and promising results are achieved. However, existing 3-D-convNets has two "artificial" requirements that may reduce the quality of video analysis: 1) It requires a fixed-sized (e.g., 112 $\times$ 112) input video; 2) most require fixed-length (i.e., shots with fixed number frames). To tackle these issues, we propose an end-to-end pipeline named Two-stream 3-D-convNet Fusion,...
The scene graph generation (SGG) task aims to detect visual relationship triplets, i.e., subject, predicate, object, in an image, providing a structural vision layout for understanding. However, current models are stuck common predicates, e.g., "on" and "at", rather than informative ones, "standing on" "looking at", resulting the loss of precise information overall performance. If model only uses "stone on road" "blocking" describe it is easy misunderstand scene. We argue that this...
Video visual question answering (V-VQA) remains challenging at the intersection of vision and language, where it requires joint comprehension video natural language question. Image-Question co-attention mechanism, which aims generating a spatial map highlighting image regions relevant to vice versa, has obtained impressive results. Despite success, simply applying results in unsatisfactory performance due complexity temporal nature videos. In this paper, we proposed novel architecture,...
Skeleton-based action recognition aims to project skeleton sequences categories, where are derived from multiple forms of pre-detected points. Compared with earlier methods that focus on exploring single-form skeletons via Graph Convolutional Networks (GCNs), existing tend improve GCNs by leveraging multi-form due their complementary cues. However, these (either adapting structure or model ensemble) require the co-existence all during both training and inference stages, while a typical...
The graph convolutional networks (GCNs), which model human body skeletons as several spatial-temporal graphs, have been widely used and become a key to representative feature extraction. However, existing methods limitations in recognizing action the wild, where are captured from real-world scenes with diversified view-points, obvious motion blurs, complex interactions fast varying resolutions of body. In this paper, we propose Multi-modal Knowledge Embedded Graph Convolutional Network...
Human densepose estimation, aiming at establishing dense correspondences between 2D pixels of human body and 3D template, is a key technique in enabling machines to have an understanding people images. It still poses several challenges due practical scenarios where real-world scenes are complex only partial annotations available, leading incompelete or false estimations. In this work, we present novel framework detect the multiple image. The proposed method, which refer Knowledge Transfer...
To determine whether age at menarche (AAM), first live birth (AFB), and estradiol levels are causally correlated with the development of systemic lupus erythematosus (SLE).A two-sample Mendelian randomization (MR) analysis was performed after data collected from a dataset genome-wide association studies (GWASs) related to SLE (as outcome), open access databases find statistics AAM, AFB, exposure).In our study, negative causal correlation between AAM confirmed by MR (MR egger: beta = 0.116,...
Despite of the recent great progress on multi-person pose estimation, existing solutions still remain challenging under condition "crowded scenes'', where RGB images capture complex real-world scenes with highly-overlapped people, severe occlusions and diverse postures. In this work, we focus two main problems: 1) how to design an effective pipeline for crowded estimation; 2) equip ability relation modeling interference resolving. To tackle these problems, propose a new named Relation based...
Multiple human parsing (MHP) is typically treated as two sub-tasks, i.e., instance separation and body part segmentation. Existing methods usually tackle the sub-tasks by adopting a two-stage strategy, which regards MHP an ROI-based (i.e., detect-then-segment) or grouping-based segment-then-grouping) paradigm. However, strong dependence between limits potential of method, since it often requires qualified prior predictions. Besides, isolated models responsible for bring significant...
In this paper, we address the multi-person densepose estimation problem, which aims at learning dense correspondences between 2D pixels of human body and 3D surface. It still poses several challenges due to real-world scenes with scale variations, occlusion insufficient annotations. particular, two main problems: 1) how design a simple yet effective pipeline for estimation; 2) equip ability handling issues limited annotations class-imbalanced labels. To tackle these problems, develop novel...
Crowded scenes human pose estimation remains challenging, which requires joint comprehension of multi-persons and their keypoints in a highly complex scenario. The top-down mechanism, is detect-then-estimate pipeline, has become the mainstream solution for general obtained impressive progress. However, simply applying this mechanism to crowded results unsatisfactory performance due several issues, particular involving missing crowds ambiguously labeling during training. To tackle above two...
Part-level attribute parsing is a fundamental but challenging task, which requires the region-level visual understanding to provide explainable details of body parts. Most existing approaches address this problem by adding regional convolutional neural network (RCNN) with an prediction head two-stage detector, in attributes parts are identified from localwise part boxes. However, boxes limit clues (i.e., appearance only) lead unsatisfying results, since highly dependent on comprehensive...
High-resolution representation is necessary for human pose estimation to achieve high performance, and the ensuing problem computational complexity. In particular, predominant methods estimate joints by 2D single-peak heatmaps. Each heatmap can be hori-zontally vertically projected reconstructed a pair of 1D heat vectors. Inspired this observation, we introduce lightweight powerful alternative, Spatially Unidimensional Self-Attention (SUSA), pointwise (1 x 1) convolution that main bottleneck...
Learning human 2D-3D correspondences aims to map all 2D pixels a 3D template, namely densepose estimation, involving surface patch recognition (i.e., Index-to-Patch (I)) and regression of patch-specific UV coordinates. Despite recent progress, it remains challenging especially under the condition “in wild”, where RGB images capture real-world scenes with backgrounds, occlusions, scale variations, postural diversity. In this paper, we address three vital problems in task: 1) how perceive...
Building multi-person pose estimation (MPPE) models that can handle complex foreground and uncommon scenes is an important challenge in computer vision. Aside from designing novel models, strengthening training data a promising direction but remains largely unexploited for the MPPE task. In this article, we systematically identify key deficiencies of existing datasets prevent power well-designed being fully exploited propose corresponding solutions. Specifically, find traditional...
Abstract Aim To evaluate the utility of magnetic resonance imaging (MRI) and sialography (MRS) for diagnosis primary Sjögren syndrome (pSS) singly or integrated with 2016 American College Rheumatology (ACR)/European League Against Rheumatic Diseases (EULAR) classification criteria. Methods The diagnostic efficiencies MRI, MRS, labial salivary gland biopsy (LSGB) were evaluated. prediction model was established by multivariate analysis. Finally, performance ACR/EULAR criteria evaluated after...
Temporal action proposal generation aims to localize temporal segments of human activities in videos. Current boundary-based methods can generate proposals with precise boundary but often suffer from the inferior quality confidence scores used for retrieving. In this article, we propose an effective and end-to-end method, named ProposalVLAD, Proposal-Intra Exploring Network (PVPI-Net). We first a ProposalVLAD module dynamically global features entire video, then combine local final feature...
Existing methods of multiple human parsing (MHP) apply deep models to learn instance-level representations for segmenting each person into non-overlapped body parts. However, learned often contain many spurious correlations that degrade model generalization, leading be vulnerable visually contextual variations in images (e.g., unseen image styles/external interventions). To tackle this, we present a causal property integrated termed CPI-Parser, which is driven by fundamental principles...
Targeted adversarial attack, which aims to mislead a model recognize any image as target object by imperceptible perturbations, has become mainstream tool for vulnerability assessment of deep neural networks (DNNs). Since existing targeted attackers only learn attack known classes, they cannot generalize well unknown classes. To tackle this issue, we propose $\bf{G}$eneralized $\bf{A}$dversarial attac$\bf{KER}$ ($\bf{GAKer}$), is able construct examples class. The core idea behind GAKer...
To guarantee the safety and reliability of autonomous vehicle (AV) systems, corner cases play a crucial role in exploring system's behavior under rare challenging conditions within simulation environments. However, current approaches often fall short meeting diverse testing needs struggle to generalize novel, high-risk scenarios that closely mirror real-world conditions. tackle this challenge, we present AutoScenario, multimodal Large Language Model (LLM)-based framework for realistic case...