Yixuan Wei

ORCID: 0000-0003-1775-7301
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Building Energy and Comfort Optimization
  • Advanced Neural Network Applications
  • Domain Adaptation and Few-Shot Learning
  • Multimodal Machine Learning Applications
  • Human Pose and Action Recognition
  • Advanced Image and Video Retrieval Techniques
  • Energy Load and Power Forecasting
  • Air Quality Monitoring and Forecasting
  • Evacuation and Crowd Dynamics
  • Wind and Air Flow Studies
  • Coal Properties and Utilization
  • Noise Effects and Management
  • Disaster Management and Resilience
  • Solar Thermal and Photovoltaic Systems
  • Energy Efficiency and Management
  • Spatial Cognition and Navigation
  • Agriculture, Soil, Plant Science
  • Advanced Vision and Imaging
  • 3D Shape Modeling and Analysis
  • Facility Location and Emergency Management
  • Greenhouse Technology and Climate Control
  • Natural Language Processing Techniques
  • COVID-19 diagnosis using AI
  • Plant Ecology and Soil Science
  • Topic Modeling

University of Science and Technology Beijing
2020-2025

National Health and Family Planning Commission
2022-2025

China Medical University
2023-2025

Tsinghua University
2019-2023

Microsoft Research Asia (China)
2021-2023

Trinity College Dublin
2023

Henan University
2022

China Three Gorges University
2022

Beijing Normal University
2022

Anyang Institute of Technology
2022

This paper presents a new vision Transformer, called Swin that capably serves as general-purpose backbone for computer vision. Challenges in adapting Transformer from language to arise differences between the two domains, such large variations scale of visual entities and high resolution pixels images compared words text. To address these differences, we propose hierarchical whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by...

10.1109/iccv48922.2021.00986 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

We present techniques for scaling Swin Transformer [35] up to 3 billion parameters and making it capable of training with images 1,536x1,536 resolution. By capacity resolution, sets new records on four representative vision benchmarks: 84.0% top-1 accuracy ImageNet- V2 image classification, 63.1 / 54.4 box mask mAP COCO object detection, 59.9 mIoU ADE20K semantic segmentation, 86.8% Kinetics-400 video action classification. tackle issues instability, study how effectively transfer models...

10.1109/cvpr52688.2022.01170 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks. These models are all built layers that globally connect patches across spatial and temporal dimensions. In this paper, we instead advocate an inductive bias of locality in which leads better speed-accuracy trade-off compared previous approaches compute self-attention even with spatial-temporal factorization....

10.1109/cvpr52688.2022.00320 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

We propose DeepHuman, an image-guided volume-to-volume translation CNN for 3D human reconstruction from a single RGB image. To reduce the ambiguities associated with of invisible areas, our method leverages dense semantic representation generated SMPL model as additional input. One key feature network is that it fuses different scales image features into space through volumetric transformation, which helps to recover accurate surface geometry. The details are further refined normal...

10.1109/iccv.2019.00783 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2019-10-01

This paper presents a new vision Transformer, called Swin that capably serves as general-purpose backbone for computer vision. Challenges in adapting Transformer from language to arise differences between the two domains, such large variations scale of visual entities and high resolution pixels images compared words text. To address these differences, we propose hierarchical whose representation is computed with \textbf{S}hifted \textbf{win}dows. The shifted windowing scheme brings greater...

10.48550/arxiv.2103.14030 preprint EN cc-by arXiv (Cornell University) 2021-01-01

Masked image modeling (MIM) learns representations with remarkably good fine-tuning performances, overshadowing previous prevalent pre-training approaches such as classification, instance contrastive learning, and image-text alignment. In this paper, we show that the inferior performance of these can be significantly improved by a simple post-processing in form feature distillation (FD). The converts old to new have few desirable properties just like those produced MIM. These properties,...

10.48550/arxiv.2205.14141 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Human pose is typically represented by a coordinate vector of body joints or their heatmap embeddings. While easy for data processing, unrealistic estimates are admitted due to the lack dependency modeling between joints. In this paper, we present structured representation, named Pose as Compositional Tokens (PCT), explore joint dependency. It represents M discrete tokens with each characterizing sub-structure several interdependent (see Figure 1). The compositional design enables it achieve...

10.1109/cvpr52729.2023.00071 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

Scaling properties have been one of the central issues in self-supervised pre-training, especially data scalability, which has successfully motivated large-scale pre-trained language models and endowed them with significant modeling capabilities. However, scaling seem to be unintentionally neglected recent trending studies on masked image (MIM), some arguments even suggest that MIM cannot benefit from data. In this work, we try break down these preconceptions systematically study behaviors...

10.1109/cvpr52729.2023.00999 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

In this paper, a novel robotic grasping system is established to automatically pick up objects in cluttered scenes. A composite hand composed of suction cup and gripper designed for the object stably. The used lifting from clutter first accordingly. We utilize affordance map provide pixel-wise point candidates cup. To obtain good map, active exploration mechanism introduced system. An effective metric calculate reward current deep Q-Network (DQN) employed guide actively explore environment...

10.1109/iros40897.2019.8967899 article EN 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2019-11-01

Workers in deep coal mines are often exposed to hyperthermal conditions, which is a health and safety hazard. Although many novel approaches have been proposed recent years, air distribution optimization combined with thermal analysis based on existing cooling systems has rarely conducted. Here, heat dissipation the mine environment was estimated field measurements numerical simulations. A pilot study conducted validate performance of multistage system optimize ventilation. The results show...

10.1016/j.csite.2023.102908 article EN cc-by-nc-nd Case Studies in Thermal Engineering 2023-03-15

The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks. These models are all built layers that globally connect patches across spatial and temporal dimensions. In this paper, we instead advocate an inductive bias of locality in which leads better speed-accuracy trade-off compared previous approaches compute self-attention even with spatial-temporal factorization....

10.48550/arxiv.2106.13230 preprint EN cc-by arXiv (Cornell University) 2021-01-01

In vision-language modeling, image token removal is an efficient augmentation technique to reduce the cost of encoding features. The CLIP-style models, however, have been found be negatively impacted by this technique. We hypothesize that removing a large portion tokens may inadvertently destroy semantic information associated given text description, resulting in misaligned paired data CLIP training. To address issue, we propose attentive approach, which retains small number strong...

10.1109/iccv51070.2023.00260 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01
Coming Soon ...