- Advanced Image Processing Techniques
- Advanced Neural Network Applications
- Domain Adaptation and Few-Shot Learning
- Speech and Audio Processing
- Image and Signal Denoising Methods
- Human Pose and Action Recognition
- Speech Recognition and Synthesis
- Music and Audio Processing
- Advanced Image and Video Retrieval Techniques
- Anomaly Detection Techniques and Applications
- Image Processing Techniques and Applications
- Generative Adversarial Networks and Image Synthesis
- Video Surveillance and Tracking Methods
- Granular flow and fluidized beds
- Microbial Natural Products and Biosynthesis
- Advanced Vision and Imaging
- Hand Gesture Recognition Systems
- Sparse and Compressive Sensing Techniques
- Thermochemical Biomass Conversion Processes
- Gait Recognition and Analysis
- Plant biochemistry and biosynthesis
- Machine Learning and ELM
- Adversarial Robustness in Machine Learning
- Face and Expression Recognition
- Computational Drug Discovery Methods
Zhejiang University
2021-2025
East China Normal University
2025
Google (United States)
2020-2024
DeepMind (United Kingdom)
2024
State Key Laboratory of Clean Energy Utilization
2021-2024
Shandong Institute of Business and Technology
2024
Chinese University of Hong Kong, Shenzhen
2021-2024
Nanjing University of Posts and Telecommunications
2024
North University of China
2024
Children's Hospital of Zhejiang University
2023-2024
Recent deep learning based approaches have shown promising results for the challenging task of inpainting large missing regions in an image. These methods can generate visually plausible image structures and textures, but often create distorted or blurry textures inconsistent with surrounding areas. This is mainly due to ineffectiveness convolutional neural networks explicitly borrowing copying information from distant spatial locations. On other hand, traditional texture patch synthesis are...
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent networks (RNNs).Transformer are good at capturing content-based global interactions, while CNNs exploit local features effectively.In this work, we achieve the best of both worlds by studying how to combine convolution transformers model dependencies an audio sequence a parameter-efficient way.To regard, propose...
We present a generative image inpainting system to complete images with free-form mask and guidance. The is based on gated convolutions learned from millions of without additional labelling efforts. proposed convolution solves the issue vanilla that treats all input pixels as valid ones, generalizes partial by providing learnable dynamic feature selection mechanism for each channel at spatial location across layers. Moreover, masks may appear anywhere in any shape, global local GANs designed...
This paper reviews the first challenge on single image super-resolution (restoration of rich details in an low resolution image) with focus proposed solutions and results. A new DIVerse 2K dataset (DIV2K) was employed. The had 6 competitions divided into 2 tracks 3 magnification factors each. Track 1 employed standard bicubic downscaling setup, while unknown operators (blur kernel decimation) but learnable through high res train images. Each competition ∽100 registered participants 20 teams...
In present object detection systems, the deep convolutional neural networks (CNNs) are utilized to predict bounding boxes of candidates, and have gained performance advantages over traditional region proposal methods. However, existing CNN methods assume bounds be four independent variables, which could regressed by $\ell_2$ loss separately. Such an oversimplified assumption is contrary well-received observation, that those variables correlated, resulting less accurate localization. To...
Existing image inpainting methods typically fill holes by borrowing information from surrounding pixels. They often produce unsatisfactory results when the overlap with or touch foreground objects due to lack of about actual extent and background regions within holes. These scenarios, however, are very important in practice, especially for applications such as distracting object removal. To address problem, we propose a foreground-aware system that explicitly disentangles structure inference...
This paper reviews the 2nd NTIRE challenge on single image super-resolution (restoration of rich details in a low resolution image) with focus proposed solutions and results. The had 4 tracks. Track 1 employed standard bicubic downscaling setup, while Tracks 2, 3 realistic unknown downgrading operators simulating camera acquisition pipeline. were learnable through provided pairs high train images. tracks 145, 114, 101, 113 registered participants, resp., 31 teams competed final testing...
Slimmable networks are a family of neural that can instantly adjust the runtime width. The width be chosen from predefined widths set to adaptively optimize accuracy-efficiency trade-offs at runtime. In this work, we propose systematic approach train universally slimmable (US-Nets), extending execute arbitrary width, and generalizing both with without batch normalization layers. We further two improved training techniques for US-Nets, named sandwich rule inplace distillation, enhance process...
In this report we demonstrate that with same parameters and computational budgets, models wider features before ReLU activation have significantly better performance for single image super-resolution (SISR). The resulted SR residual network has a slim identity mapping pathway (\(2\times\) to \(4\times\)) channels in each block. To further widen (\(6\times\) \(9\times\)) without overhead, introduce linear low-rank convolution into networks achieve even accuracy-efficiency tradeoffs. addition,...
With recent progress in joint modeling of visual and textual representations, Vision-Language Pretraining (VLP) has achieved impressive performance on many multimodal downstream tasks. However, the requirement for expensive annotations including clean image captions regional labels limits scalability existing approaches, complicates pretraining procedure with introduction multiple dataset-specific objectives. In this work, we relax these constraints present a minimalist framework, named...
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent networks (RNNs). are good at capturing content-based global interactions, while CNNs exploit local features effectively. In this work, we achieve the best of both worlds by studying how to combine convolution transformers model dependencies an audio sequence a parameter-efficient way. To regard, propose convolution-augmented...
Convolutional neural networks (CNN) have shown promising results for end-to-end speech recognition, albeit still behind RNN/transformer based models in performance.In this paper, we study how to bridge gap and go beyond with a novel CNN-RNN-transducer architecture, which call ContextNet.ContextNet features fully convolutional encoder that incorporates global context information into convolution layers by adding squeeze-and-excitation modules.In addition, propose simple scaling method scales...
Recent deep learning based approaches have shown promising results for the challenging task of inpainting large missing regions in an image. These methods can generate visually plausible image structures and textures, but often create distorted or blurry textures inconsistent with surrounding areas. This is mainly due to ineffectiveness convolutional neural networks explicitly borrowing copying information from distant spatial locations. On other hand, traditional texture patch synthesis are...
The generative adversarial network (GAN) framework has emerged as a powerful tool for various image and video synthesis tasks, allowing the of visual content in an unconditional or input-conditional manner. It enabled generation high-resolution photorealistic images videos, task that was challenging impossible with prior methods. also led to creation many new applications creation. In this article, we provide overview GANs special focus on algorithms synthesis. We cover several important...
We study how to set channel numbers in a neural network achieve better accuracy under constrained resources (e.g., FLOPs, latency, memory footprint or model size). A simple and one-shot solution, named AutoSlim, is presented. Instead of training many samples searching with reinforcement learning, we train single slimmable approximate the different configurations. then iteratively evaluate trained greedily slim layer minimal drop. By this pass, can obtain optimized configurations resource...
We summarize the results of a host efforts using giant automatic speech recognition (ASR) models pre-trained large, diverse unlabeled datasets containing approximately million hours audio. find that combination pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens thousands labeled data. In particular, on an ASR task 34k data, by fine-tuning 8 billion parameter Conformer we can match state-of-the-art (SoTA)...
Abstract Self-similarity refers to the image prior widely used in restoration algorithms that small but similar patterns tend occur at different locations and scales. However, recent advanced deep convolutional neural network-based methods for do not take full advantage of self-similarities by relying on self-attention modules only process information same scale. To solve this problem, we present a novel Pyramid Attention module restoration, which captures long-range feature correspondences...
We present a simple and general method to train single neural network executable at different widths (number of channels in layer), permitting instant adaptive accuracy-efficiency trade-offs runtime. Instead training individual networks with width configurations, we shared switchable batch normalization. At runtime, the can adjust its on fly according on-device benchmarks resource constraints, rather than downloading offloading models. Our trained networks, named slimmable achieve similar...
In this paper, balanced two-stage residual networks (BTSRN) are proposed for single image super-resolution. The deep design with constrained depth achieves the optimal balance between accuracy and speed super-resolving images. experiments show that structure, together our lightweight two-layer PConv block design, very promising results when considering both speed. We evaluated models on New Trends in Image Restoration Enhancement workshop challenge super-resolution (NTIRE SR 2017). Our final...
Pretraining language models with next-token prediction on massive text corpora has delivered phenomenal zero-shot, few-shot, transfer learning and multi-tasking capabilities both generative discriminative tasks. Motivated by this success, we explore a Vector-quantized Image Modeling (VIM) approach that involves pretraining Transformer to predict rasterized image tokens autoregressively. The discrete are encoded from learned Vision-Transformer-based VQGAN (ViT-VQGAN). We first propose...
End-to-end (E2E) models have shown to outperform state-of-the-art conventional for streaming speech recognition [1] across many dimensions, including quality (as measured by word error rate (WER)) and endpointer latency [2]. However, the model still tends delay predictions towards end thus has much higher partial compared a ASR model. To address this issue, we look at encouraging E2E emit words early, through an algorithm called FastEmit [3]. Naturally, improving on results in degradation....