- Domain Adaptation and Few-Shot Learning
- Advanced Neural Network Applications
- Multimodal Machine Learning Applications
- Advanced Image and Video Retrieval Techniques
- Handwritten Text Recognition Techniques
- Human Pose and Action Recognition
- Topic Modeling
- Video Surveillance and Tracking Methods
- Face recognition and analysis
- Advanced Graph Neural Networks
- Natural Language Processing Techniques
- Advanced Image Processing Techniques
- Image Retrieval and Classification Techniques
- Anomaly Detection Techniques and Applications
- Robotics and Sensor-Based Localization
- 3D Surveying and Cultural Heritage
- 3D Shape Modeling and Analysis
- Image Enhancement Techniques
- Advanced Vision and Imaging
- Digital Media Forensic Detection
- Gait Recognition and Analysis
- Biometric Identification and Security
- Text and Document Classification Technologies
- Image Processing and 3D Reconstruction
- COVID-19 diagnosis using AI
Hikvision (China)
2018-2024
InferVision (China)
2018-2024
Zhejiang University
2019-2023
Peking University
2023
Chongqing University
2023
South China University of Technology
2020
Cloud Computing Center
2020
Fudan University
2018
Skeleton-based human action recognition has recently drawn increasing attentions with the availability of large-scale skeleton datasets. The most crucial factors for this task lie in two aspects: intra-frame representation joint co-occurrences and inter-frame skeletons' temporal evolutions. In paper we propose an end-to-end convolutional co-occurrence feature learning framework. features are learned a hierarchical methodology, which different levels contextual information aggregated...
Scene text recognition has been a hot research topic in computer vision due to its various applications. The state of the art is attention-based encoder-decoder framework that learns mapping between input images and output sequences purely data-driven way. However, we observe existing methods perform poorly on complicated and/or low-quality images. One major reason cannot get accurate alignments feature areas targets for such We call this phenomenon "attention drift". To tackle problem,...
Recognizing text from natural images is a hot research topic in computer vision due to its various applications. Despite the enduring of several decades on optical character recognition (OCR), recognizing texts still challenging task. This because scene are often irregular (e.g. curved, arbitrarily-oriented or seriously distorted) arrangements, which have not yet been well addressed literature. Existing methods mainly work with regular (horizontal and frontal) cannot be trivially generalized...
Despite Visual Question Answering (VQA) has realized impressive progress over the last few years, today's VQA models tend to capture superficial linguistic correlations in train set and fail generalize test with different QA distributions. To reduce language biases, several recent works introduce an auxiliary question-only model regularize training of targeted model, achieve dominating performance on VQA-CP. However, since complexity design, current methods are unable equip ensemble-based...
Current state-of-the-art approaches to skeleton-based action recognition are mostly based on recurrent neural networks (RNN). In this paper, we propose a novel convolutional (CNN) framework for both classification and detection. Raw skeleton coordinates as well motion fed directly into CNN label prediction. A transformer module is designed rearrange select important joints automatically. With simple 7-layer network, obtain 89.3% accuracy validation set of the NTU RGB+D dataset. For detection...
raph Convolutional Networks (GCNs) have attracted increasing interests for the task of skeleton-based action recognition. The key lies in design graph structure, which encodes skeleton topology information. In this paper, we propose Dynamic GCN, a novel convolutional neural network named Context-encoding Network (CeN) is introduced to learn automatically. particular, when learning dependency between two joints, contextual features from rest joints are incorporated global manner. CeN...
Existing enhancement methods are empirically expected to help the high-level end computer vision task: however, that is observed not always be case in practice. We focus on object or face detection poor visibility enhancements caused by bad weathers (haze, rain) and low light conditions. To provide a more thorough examination fair comparison, we introduce three benchmark sets collected real-world hazy, rainy, low-light conditions, respectively, with annotated objects/faces. launched UG <sup...
Point clouds can be represented in many forms (views), typically, point-based sets, voxel-based cells or range-based images(i.e., panoramic view). The view is geometrically accurate, but it disordered, which makes difficult to find local neighbors efficiently. regular, sparse, and computation grows cubicly when voxel resolution increases. regular generally dense, however spherical projection physical dimensions distorted. Both voxel-and views suffer from quantization loss, especially for...
We consider the scene text recognition problem under attention-based encoder-decoder framework, which is state of art. The existing methods usually employ a frame-wise maximal likelihood loss to optimize models. When we train model, misalignment between ground truth strings and attention's output sequences probability distribution, caused by missing or superfluous characters, will confuse mislead training process, consequently make costly degrade accuracy. To handle this problem, propose...
Recently, convolutional neural network (CNN) has attracted tremendous attention and achieved great success in many image processing tasks. In this paper, we focus on CNN technology combined with restoration to facilitate video coding performance propose the content-aware based in-loop filtering for high-efficiency (HEVC). particular, quantitatively analyze structure of proposed model from multiple dimensions make interpretable optimal CNN-based loop filtering. More specifically, each tree...
Scene graphs --- objects as nodes and visual relationships edges describe the whereabouts interactions of in an image for comprehensive scene understanding. To generate coherent graphs, almost all existing methods exploit fruitful context by modeling message passing among objects. For example, ``person'' on ``bike'' can help to determine relationship ``ride'', which turn contributes confidence two However, we argue that is not properly learned using prevailing cross-entropy based supervised...
Novel classes frequently arise in our dynamically changing world, e.g., new users the authentication system, and a machine learning model should recognize without forgetting old ones. This scenario becomes more challenging when class instances are insufficient, which is called few-shot class-incremental (FSCIL). Cur-rent methods handle incremental retrospectively by making updated similar to one. By contrast, we suggest prospectively prepare for future updates, propose ForwArd Compatible...
Reconstruction-based methods play an important role in unsupervised anomaly detection images. Ideally, we expect a perfect reconstruction for normal samples and poor abnormal samples. Since the generalizability of deep neural networks is difficult to control, existing models such as autoencoder do not work well. In this work, interpret image divide-and-assemble procedure. Surprisingly, by varying granularity division on feature maps, are able modulate capability model both That is, finer...
Deep neural network is difficult to train and this predicament becomes worse as the depth increases. The essence of problem exists in magnitude backpropagated errors that will result gradient vanishing or exploding phenomenon. We show a variant regularizer which utilizes orthonormality among different filter banks can alleviate problem. Moreover, we design backward error modulation mechanism based on quasi-isometry assumption between two consecutive parametric layers. Equipped with these...
Recent years have witnessed remarkable success of deep learning methods in quality enhancement for compressed video. To better explore temporal information, existing usually estimate optical flow motion compensation. However, since video could be seriously distorted by various compression artifacts, the estimated tends to inaccurate and unreliable, thereby resulting ineffective enhancement. In addition, estimation consecutive frames is generally conducted a pairwise manner, which...
Spatiotemporal feature learning is of central importance for action recognition in videos. Existing deep neural network models either learn spatial and temporal features independently (C2D) or jointly with unconstrained parameters (C3D). In this paper, we propose a novel operation which encodes spatiotemporal collaboratively by imposing weight-sharing constraint on the learnable parameters. particular, perform 2D convolution along three orthogonal views volumetric video data, learns...
Unsupervised domain adaptation (UDA) assumes that source and target data are freely available usually trained together to reduce the gap. However, considering privacy inefficiency of transmission, it is impractical in real scenarios. Hence, draws our eyes optimize network without accessing labeled data. To explore this direction object detection, for first time, we propose a data-free adaptive detection (SFOD) framework via modeling into problem learning with noisy labels. Generally,...
Many approaches have recently been proposed to detect irregular scene text and achieved promising results. However, their localization results may not well satisfy the following recognition part mainly because of two reasons: 1) recognizing arbitrary shaped is still a challenging task, 2) prevalent non-trainable pipeline strategies between detection will lead suboptimal performances. To handle this incompatibility problem, in paper we propose an end-to-end trainable spotting approach named...
Since real-world ubiquitous documents (e.g., invoices, tickets, resumes and leaflets) contain rich information, automatic document image understanding has become a hot topic. Most existing works decouple the problem into two separate tasks, (1) text reading for detecting recognizing texts in images (2) information extraction analyzing extracting key elements from previously extracted plain text.However, they mainly focus on improving task, while neglecting fact that are mutually correlated....
This paper proposes a segregated temporal assembly recurrent (STAR) network for weakly-supervised multiple action detection. The model learns from untrimmed videos with only supervision of video-level labels and makes prediction intervals actions. Specifically, we first assemble video clips according to class by an attention mechanism that class-variable weights thus helps the noise relieving background or other Secondly, build relationship between actions feeding assembled features into...
Although current face anti-spoofing methods achieve promising results under intra-dataset testing, they suffer from poor generalization to unseen attacks. Most existing works adopt domain adaptation (DA) or (DG) techniques address this problem. However, the target is often unknown during training which limits utilization of DA methods. DG can conquer by learning invariant features without seeing any data. fail in utilizing information In paper, we propose a self-domain framework leverage...
Shift operation is an efficient alternative over depthwise separable convolution. However, it still bottlenecked by its implementation manner, namely memory movement. To put this direction forward, a new and novel basic component named Sparse Layer (SSL) introduced in paper to construct convolutional neural networks. In family of architectures, the block only composed 1x1 layers with few shift operations applied intermediate feature maps. make idea feasible, we introduce penalty during...
Recently end-to-end scene text spotting has become a popular research topic due to its advantages of global optimization and high maintainability in real applications. Most methods attempt develop various region interest (RoI) operations concatenate the detection part sequence recognition into two-stage framework. However, such framework, is highly sensitive detected results (e.g., compactness contours). To address this problem, paper, we propose novel Mask AttentioN Guided One-stage...
New classes arise frequently in our ever-changing world, e.g., emerging topics social media and new types of products e-commerce. A model should recognize meanwhile maintain discriminability over old classes. Under severe circumstances, only limited novel instances are available to incrementally update the model. The task recognizing few-shot without forgetting is called class-incremental learning (FSCIL). In this work, we propose a paradigm for FSCIL based on meta-learning by LearnIng...