- Building Energy and Comfort Optimization
- Advanced Neural Network Applications
- Domain Adaptation and Few-Shot Learning
- Multimodal Machine Learning Applications
- Human Pose and Action Recognition
- Advanced Image and Video Retrieval Techniques
- Energy Load and Power Forecasting
- Air Quality Monitoring and Forecasting
- Evacuation and Crowd Dynamics
- Wind and Air Flow Studies
- Coal Properties and Utilization
- Noise Effects and Management
- Disaster Management and Resilience
- Solar Thermal and Photovoltaic Systems
- Energy Efficiency and Management
- Spatial Cognition and Navigation
- Agriculture, Soil, Plant Science
- Advanced Vision and Imaging
- 3D Shape Modeling and Analysis
- Facility Location and Emergency Management
- Greenhouse Technology and Climate Control
- Natural Language Processing Techniques
- COVID-19 diagnosis using AI
- Plant Ecology and Soil Science
- Topic Modeling
University of Science and Technology Beijing
2020-2025
National Health and Family Planning Commission
2022-2025
China Medical University
2023-2025
Tsinghua University
2019-2023
Microsoft Research Asia (China)
2021-2023
Trinity College Dublin
2023
Henan University
2022
China Three Gorges University
2022
Beijing Normal University
2022
Anyang Institute of Technology
2022
This paper presents a new vision Transformer, called Swin that capably serves as general-purpose backbone for computer vision. Challenges in adapting Transformer from language to arise differences between the two domains, such large variations scale of visual entities and high resolution pixels images compared words text. To address these differences, we propose hierarchical whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by...
We present techniques for scaling Swin Transformer [35] up to 3 billion parameters and making it capable of training with images 1,536x1,536 resolution. By capacity resolution, sets new records on four representative vision benchmarks: 84.0% top-1 accuracy ImageNet- V2 image classification, 63.1 / 54.4 box mask mAP COCO object detection, 59.9 mIoU ADE20K semantic segmentation, 86.8% Kinetics-400 video action classification. tackle issues instability, study how effectively transfer models...
The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks. These models are all built layers that globally connect patches across spatial and temporal dimensions. In this paper, we instead advocate an inductive bias of locality in which leads better speed-accuracy trade-off compared previous approaches compute self-attention even with spatial-temporal factorization....
We propose DeepHuman, an image-guided volume-to-volume translation CNN for 3D human reconstruction from a single RGB image. To reduce the ambiguities associated with of invisible areas, our method leverages dense semantic representation generated SMPL model as additional input. One key feature network is that it fuses different scales image features into space through volumetric transformation, which helps to recover accurate surface geometry. The details are further refined normal...
This paper presents a new vision Transformer, called Swin that capably serves as general-purpose backbone for computer vision. Challenges in adapting Transformer from language to arise differences between the two domains, such large variations scale of visual entities and high resolution pixels images compared words text. To address these differences, we propose hierarchical whose representation is computed with \textbf{S}hifted \textbf{win}dows. The shifted windowing scheme brings greater...
Masked image modeling (MIM) learns representations with remarkably good fine-tuning performances, overshadowing previous prevalent pre-training approaches such as classification, instance contrastive learning, and image-text alignment. In this paper, we show that the inferior performance of these can be significantly improved by a simple post-processing in form feature distillation (FD). The converts old to new have few desirable properties just like those produced MIM. These properties,...
Human pose is typically represented by a coordinate vector of body joints or their heatmap embeddings. While easy for data processing, unrealistic estimates are admitted due to the lack dependency modeling between joints. In this paper, we present structured representation, named Pose as Compositional Tokens (PCT), explore joint dependency. It represents M discrete tokens with each characterizing sub-structure several interdependent (see Figure 1). The compositional design enables it achieve...
Scaling properties have been one of the central issues in self-supervised pre-training, especially data scalability, which has successfully motivated large-scale pre-trained language models and endowed them with significant modeling capabilities. However, scaling seem to be unintentionally neglected recent trending studies on masked image (MIM), some arguments even suggest that MIM cannot benefit from data. In this work, we try break down these preconceptions systematically study behaviors...
In this paper, a novel robotic grasping system is established to automatically pick up objects in cluttered scenes. A composite hand composed of suction cup and gripper designed for the object stably. The used lifting from clutter first accordingly. We utilize affordance map provide pixel-wise point candidates cup. To obtain good map, active exploration mechanism introduced system. An effective metric calculate reward current deep Q-Network (DQN) employed guide actively explore environment...
Workers in deep coal mines are often exposed to hyperthermal conditions, which is a health and safety hazard. Although many novel approaches have been proposed recent years, air distribution optimization combined with thermal analysis based on existing cooling systems has rarely conducted. Here, heat dissipation the mine environment was estimated field measurements numerical simulations. A pilot study conducted validate performance of multistage system optimize ventilation. The results show...
The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks. These models are all built layers that globally connect patches across spatial and temporal dimensions. In this paper, we instead advocate an inductive bias of locality in which leads better speed-accuracy trade-off compared previous approaches compute self-attention even with spatial-temporal factorization....
In vision-language modeling, image token removal is an efficient augmentation technique to reduce the cost of encoding features. The CLIP-style models, however, have been found be negatively impacted by this technique. We hypothesize that removing a large portion tokens may inadvertently destroy semantic information associated given text description, resulting in misaligned paired data CLIP training. To address issue, we propose attentive approach, which retains small number strong...