- Advanced Neural Network Applications
- Neural Networks and Applications
- Domain Adaptation and Few-Shot Learning
- Human Pose and Action Recognition
- Video Surveillance and Tracking Methods
- Anomaly Detection Techniques and Applications
- Adversarial Robustness in Machine Learning
- 3D Shape Modeling and Analysis
- Chaos control and synchronization
- Neural dynamics and brain function
- Multimodal Machine Learning Applications
- GaN-based semiconductor devices and materials
- Semiconductor Quantum Structures and Devices
- Advanced Image and Video Retrieval Techniques
- Advanced Vision and Imaging
- Wireless Communication Networks Research
- Generative Adversarial Networks and Image Synthesis
- Advanced Computational Techniques and Applications
- Cellular Automata and Applications
- Remote Sensing and LiDAR Applications
- COVID-19 diagnosis using AI
- Advanced Decision-Making Techniques
- Advanced Algorithms and Applications
- Visual Attention and Saliency Detection
- Computer Graphics and Visualization Techniques
Zhejiang University
2019-2025
Wuhan University
2025
Guangdong-Hongkong-Macau Joint Laboratory of Collaborative Innovation for Environmental Quality
2024
Jinan University
2024
Xiamen University
1988-2005
Army Medical University
2000
Daping Hospital
2000
Dartmouth College
1992-1993
While features of different scales are perceptually important to visual inputs, existing vision transformers do not yet take advantage them explicitly. To this end, we first propose a cross-scale transformer, CrossFormer. It introduces embedding layer (CEL) and long-short distance attention (LSDA). On the one hand, CEL blends each token with multiple patches scales, providing self-attention module itself features. other LSDA splits into short-distance long-distance counterpart, which only...
Weakly supervised semantic segmentation (WSSS) with image-level labels is a challenging task. Mainstream approaches follow multi-stage framework and suffer from high training costs. In this paper, we explore the potential of Contrastive Language-Image Pre-training models (CLIP) to localize different categories only without further training. To efficiently generate high-quality masks CLIP, propose novel WSSS called CLIP-ES. Our improves all three stages special designs for CLIP: 1) We...
Despite the tremendous progress of Masked Autoencoders (MAE) in developing vision tasks such as image and video, exploring MAE large-scale 3D point clouds remains challenging due to inherent irregularity. In contrast previous frameworks, which either design a complex decoder infer masked information from maintained regions or adopt sophisticated masking strategies, we instead propose much simpler paradigm. The core idea is apply Generative Decoder for (GD-MAE) automatically merges...
Existing neural rendering methods for creating human avatars typically either require dense input signals such as video or multi-view images, leverage a learned prior from large-scale specific 3D datasets that reconstruction can be performed with sparse-view inputs. Most of these fail to achieve realistic when only single image is available. To enable the data-efficient creation anima table humans, we propose ELICIT, novel method learning human-specific radiance fields image. Inspired by...
Achieving organic red/near infrared (NIR) phosphorescence at high temperatures is theoretically challenging because of the severe nonradiative transitions excited triplet states with low energy gaps. This study realizes bright and persistent red/NIR afterglow excellent high-temperature resistance up to 413 K via highly efficient (≈100%) resonance transfer (PRET) from rationally designed branched luminogens as donors dyes acceptors, coupled optimized aggregated structures. According...
With the advancement of generative artificial intelligence, previous studies have achieved task generating aesthetic images from hand-drawn sketches, fulfilling public's needs for drawing. However, these methods are limited to static and lack ability control video animation generation using sketches. To address this gap, we propose VidSketch, first method capable high-quality animations directly any number sketches simple text prompts, bridging divide between ordinary users professional...
With the rapid development of AIGC technology, significant progress has been made in diffusion model-based technologies for text-to-image (T2I) and text-to-video (T2V). In recent years, a few studies have introduced strategy Direct Preference Optimization (DPO) into T2I tasks, significantly enhancing human preferences generated images. However, existing T2V generation methods lack well-formed pipeline with exact loss function to guide alignment videos using DPO strategies. Additionally,...
Diffusion models have exhibited impressive prowess in the text-to-image task. Recent methods add image-level structure controls, e.g., edge and depth maps, to manipulate generation process together with text prompts obtain desired images. This controlling is globally operated on entire image, which limits flexibility of control regions. In this paper, we explore a novel practical task setting: local control. It focuses specific region according user-defined image conditions, while remaining...
Adversarial training is a powerful type of defense against adversarial examples. Previous empirical results suggest that requires wider networks for better performances. However, it remains elusive how neural network width affects model robustness. In this paper, we carefully examine the relationship between and Specifically, show robustness closely related to tradeoff natural accuracy perturbation stability, which controlled by robust regularization parameter $λ$. With same $λ$, can achieve...
Adversarial training has been demonstrated to be one of the most effective remedies for defending adversarial examples, yet it often suffers from huge robustness generalization gap on unseen testing adversaries, deemed as adversarially robust problem. Despite preliminary understandings devoted generalization, little is known architectural perspective. To bridge gap, this paper first time systematically investigated relationship between and design. In particular, we comprehensively evaluated...
Monocular 3D object detection is one of the most challenging tasks in scene understanding. Due to ill-posed nature monocular imagery, existing methods highly rely on training with manually annotated box labels LiDAR point clouds. This annotation process very laborious and expensive. To dispense reliance labels, this paper we explore weakly supervised detection. Specifically, first detect 2D boxes image. Then, adopt generated select corresponding RoI points as weak supervision. Eventually, a...
Most semantic segmentation models treat as a pixel-wise classification task and use error their optimization criterions. However, the ignores strong dependencies among pixels in an image, which limits performance of model. Several ways to incorporate structure information objects have been investigated, \eg, conditional random fields (CRF), image priors based methods, generative adversarial network (GAN). Nevertheless, these methods usually require extra model branches or additional...
Human body orientation estimation (HBOE) aims to estimate the of a human relative camera’s frontal view. Despite recent advancements in this field, there still exist limitations achieving fine-grained results. We identify certain defects and propose corresponding approaches as follows: 1). Existing datasets suffer from non-uniform angle distributions, resulting sparse image data for angles. To provide comprehensive high-quality data, we introduce RMOS (Rendered Model Orientation Set),...
While features of different scales are perceptually important to visual inputs, existing vision transformers do not yet take advantage them explicitly. To this end, we first propose a cross-scale transformer, CrossFormer. It introduces embedding layer (CEL) and long-short distance attention (LSDA). On the one hand, CEL blends each token with multiple patches scales, providing self-attention module itself features. other LSDA splits into short-distance long-distance counterpart, which only...
Transformer-based networks have achieved impressive performance in 3D point cloud understanding. However, most of them concentrate on aggregating local features, but neglect to directly model global dependencies, which results a limited effective receptive field. Besides, how effectively incorporate and components also remains challenging. To tackle these problems, we propose Asymmetric Parallel Point Transformer (APPT). Specifically, introduce Global Pivot Attention extract features enlarge...
Logit based knowledge distillation gets less attention in recent years since feature methods perform better most cases. Nevertheless, we find it still has untapped potential when re-investigate the temperature, which is a crucial hyper-parameter to soften logit outputs. For of previous works, was set as fixed value for entire procedure. However, logits from different samples are distributed quite variously, not feasible all them an equal degree by just single may make work transfer each...
The configuration coordinate (CC) and momentum conservation (MC) models have been widely used to explain the phonon sidebands of impurity spectra in semiconductors. In this paper, distinction between CC MC is discussed. We conclude that model only applies shallow Coulombic impurities; other cases, such as isoelectronic traps, more appropriate. show Huang-Rhys parameters for bulk modes coupling a bound electron or exciton can be calculated from bound-state wave function k space if...