- Face recognition and analysis
- Face and Expression Recognition
- Advanced Image and Video Retrieval Techniques
- Speech Recognition and Synthesis
- Natural Language Processing Techniques
- Topic Modeling
- Human Pose and Action Recognition
- Video Surveillance and Tracking Methods
- Speech and Audio Processing
- Mobile Ad Hoc Networks
- Music and Audio Processing
- Image Retrieval and Classification Techniques
- Multimodal Machine Learning Applications
- Opportunistic and Delay-Tolerant Networks
- Generative Adversarial Networks and Image Synthesis
- Remote-Sensing Image Classification
- Energy Efficient Wireless Sensor Networks
- Anomaly Detection Techniques and Applications
- Advanced Computational Techniques and Applications
- Video Analysis and Summarization
- Gait Recognition and Analysis
- Rough Sets and Fuzzy Logic
- Speech and dialogue systems
- Network Security and Intrusion Detection
- Security in Wireless Sensor Networks
Guangdong Academy of Sciences
2025
Nanjing University
2023
Cloud Computing Center
2018-2023
Shandong Jiaotong University
2018-2021
Shanghai Jiao Tong University
2018-2021
Jiyang College of Zhejiang A&F University
2021
Chongqing University of Posts and Telecommunications
2020
Beijing University of Posts and Telecommunications
2020
Xinjiang Technical Institute of Physics & Chemistry
2010-2020
Jimei University
2020
The combination of global and partial features has been an essential solution to improve discriminative performances in person re-identification (Re-ID) tasks. Previous part-based methods mainly focus on locating regions with specific pre-defined semantics learn local representations, which increases learning difficulty but not efficient or robust scenarios large variances. In this paper, we propose end-to-end feature strategy integrating information various granularities. We carefully...
The latest work on language representations carefully integrates contextualized features into model training, which enables a series of success especially in various machine reading comprehension and natural inference tasks. However, the existing representation models including ELMo, GPT BERT only exploit plain context-sensitive such as character or word embeddings. They rarely consider incorporating structured semantic information can provide rich semantics for representation. To promote...
Regression based facial landmark detection methods usually learns a series of regression functions to update the positions from an initial estimation. Most existing approaches focus on learning effective mapping with robust image features improve performance. The approach dealing initialization issue, however, receives relatively fewer attentions. In this paper, we present deep architecture two-stage re-initialization explicitly deal problem. At global stage, given rough face result, full...
In this paper, we present a patch-based regression framework for addressing the human age and head pose estimation problems. Firstly, each image is encoded as an ensemble of orderless coordinate patches, global distribution which described by Gaussian Mixture Models (GMM), then further expressed specific model Maximum Posteriori adaptation from GMM. Then patch-kernel designed characterizing Kullback-Leibler divergence between derived models any two images, its discriminating power enhanced...
In this paper, we propose a new image representation to capture both the appearance and spatial information for classification applications. First, model feature vectors, from whole corpus, each at individual patch, in Bayesian hierarchical framework using mixtures of Gaussians. After such Gaussianization, is represented by Gaussian mixture (GMM) its appearance, several maps layout. Then extract GMM parameters, global local statistics over maps. Finally, employ supervised dimension reduction...
Semantic role labeling (SRL) aims to discover the predicateargument structure of a sentence. End-to-end SRL without syntactic input has received great attention. However, most them focus on either span-based or dependency-based semantic representation form and only show specific model optimization respectively. Meanwhile, handling these two tasks uniformly was less successful. This paper presents an end-to-end for both dependency span with unified argument deal different types annotations in...
Multi-choice reading comprehension is a challenging task to select an answer from set of candidate options when given passage and question. Previous approaches usually only calculate question-aware representation ignore passage-aware question modeling the relationship between question, which cannot effectively capture In this work, we propose dual co-matching network (DCMN) models among passage, bidirectionally. Besides, inspired by how humans solve multi-choice questions, integrate two...
The ability to handle multi-view facial expressions is important for computers understand affective behavior under less constrained environment. However, most of existing methods expression recognition are based on the near-frontal view face data, which likely fail in non-frontal analysis. In this paper, we conduct an investigation analyzing expressions. Three local patch descriptors (HoG, LBP, and SIFT) used extract features, inputs a nearest-neighbor indexing method that identifies We also...
In this work, we present a SIFT-Bag based generative-to-discriminative framework for addressing the problem of video event recognition in unconstrained news videos. generative stage, each clip is encoded as bag SIFT feature vectors, distribution which described by Gaussian Mixture Models (GMM). discriminative Kernel designed characterizing property Kullback-Leibler divergence between specialized GMMs any two clips, and then kernel utilized supervised learning ways. On one hand, further...
Beauty e-Experts, a fully automatic system for makeover recommendation and synthesis, is developed in this work. The synthesis simultaneously considers many kinds of items on hairstyle makeup. Given user-provided frontal face image with short/bound hair no/light makeup, the e-Experts not only recommends most suitable hairdo but also synthesizes virtual makeup effects. To acquire enough knowledge beauty modeling, we built Database, which contains 1,505 female photos variety attributes...
Accurate temporal action proposals play an important role in detecting actions from untrimmed videos. The existing approaches have difficulties capturing global contextual information and simultaneously localizing with different durations. To this end, we propose a Relation-aware pyramid Network (RapNet) to generate highly accurate proposals. In RapNet, novel relation-aware module is introduced exploit bi-directional long-range relations between local features for context distilling. This...
Wireless sensor networks (WSNs) are becoming more and common. WSNs have many existing envisioned applications due to their ease of deployment, particularly in remote areas. But the security is still an issue. Some approaches mainly rely on cryptography ensure data authentication integrity. These only address part problem WSNs. However, these not sufficient for unique characteristics novel misbehaviors encountered Recently, use reputation systems has become important mechanism In this paper...
We propose a straightforward method that simultaneously reconstructs the 3D facial structure and provides dense alignment. To achieve this, we design 2D representation called UV position map which records shape of complete face in space, then train simple Convolutional Neural Network to regress it from single image. also integrate weight mask into loss function during training improve performance network. Our does not rely on any prior model, can reconstruct full geometry along with semantic...
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles. Thus utterance- and speaker-aware clues are supposed to be well captured in models. However, the existing retrieval-based modeling, pre-trained language models (PrLMs) as encoder represent dialogues coarsely by taking pairwise history candidate response a whole, hierarchical information on either utterance interrelation roles coupled such representations not addressed. In this work, we propose...
Temporal language grounding in videos aims to localize the temporal span relevant given query sentence. Previous methods treat it either as a boundary regression task or extraction task. This paper will formulate into video reading comprehension and propose Relation-aware Network (RaNet) address it. framework select moment choice from predefined answer set with aid of coarse-and-fine choice-query interaction choice-choice relation construction. A interactor is proposed match visual textual...
Video grounding aims at localizing the temporal moment related to given language description, which is very helpful many cross-modal content understanding applications like visual question answering and sentence-video search. Existing approaches usually directly regress boundaries of an event described by a query sentence in video sequence. This direct regression manner often encounters large decision space due diverse target events variable durations, leading inaccurate localization as well...
Speech perceptual features, such as Mel-frequency Cepstral Coefficients (MFCC), have been widely used in acoustic event detection. However, the different spectral structures between speech and events degrade performance of feature sets. We propose quantifying discriminative capability each component according to approximated Bayesian accuracy deriving a set for Compared MFCC, sets derived using proposed approaches achieve about 30% relative improvement
This work details the authors' efforts to push baseline of expression recognition performance on a realistic database. Both subject-dependent and subject-independent emotion scenarios are addressed in this work. These two happen frequently real life settings. The approach towards solving problem involves face detection, followed by key point identification, then feature generation finally classification. An ensemble features comprising Hierarchial Gaussianization (HG), Scale Invariant...
Face alignment acts as an important task in computer vision. Regression-based methods currently dominate the approach to solving this problem, which generally employ a series of mapping functions from face appearance iteratively update shape hypothesis. One keypoint here is thus how perform regression procedure. In work, we formulate procedure sparse coding problem. We learn two relational dictionaries, one for and other shape, with coupled reconstruction coefficient capture their underlying...