NFDI4DS | UHH-SEMS - Publication Details

Lewei Lu

ORCID: 0009-0009-9809-3818

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5000474748

Research Areas

Advanced Neural Network Applications
Multimodal Machine Learning Applications
Advanced Image and Video Retrieval Techniques
Domain Adaptation and Few-Shot Learning
Topic Modeling
Advanced Vision and Imaging
Natural Language Processing Techniques
Image Processing Techniques and Applications
Autonomous Vehicle Technology and Safety
3D Shape Modeling and Analysis
Visual Attention and Saliency Detection
Computer Graphics and Visualization Techniques
Image Retrieval and Classification Techniques
Robotics and Sensor-Based Localization
Generative Adversarial Networks and Image Synthesis
CCD and CMOS Imaging Sensors
Visual perception and processing mechanisms
Remote Sensing and LiDAR Applications
Anomaly Detection Techniques and Applications
Human Pose and Action Recognition
Advanced Image Processing Techniques
Industrial Vision Systems and Defect Detection
Adversarial Robustness in Machine Learning
3D Surveying and Cultural Heritage
Advanced Memory and Neural Computing

Group Sense (China)
2023-2024

The Sense Innovation and Research Center
2023

Microsoft Research (United Kingdom)
2019

Deformable DETR: Deformable Transformers for End-to-End Object Detection

OPENALEX - Publications

Xizhou Zhu Weijie Su Lewei Lu Bin Li Xiaogang Wang and 1 more

DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due limitation of Transformer attention modules processing image maps. To mitigate these issues, we Deformable DETR, whose only attend a small set key sampling points around reference. can achieve better performance than (especially on objects) with 10 times less...

10.48550/arxiv.2010.04159 preprint EN other-oa arXiv (Cornell University) 2020-01-01

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

OPENALEX - Publications

Weijie Su Xizhou Zhu Yue Cao Bin Li Lewei Lu and 2 more

We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT short). VL-BERT adopts the simple yet powerful Transformer model as backbone, and extends it to take both visual linguistic embedded features input. In it, each element of input is either word from sentence, or region-of-interest (RoI) image. It designed fit most downstream tasks. To better exploit representation, we pre-train on massive-scale Conceptual Captions...

10.48550/arxiv.1908.08530 preprint EN other-oa arXiv (Cornell University) 2019-01-01

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

OPENALEX - Publications

Wenhai Wang Jifeng Dai Zhe Chen Zhenhang Huang Zhiqi Li and 7 more

Compared to the great progress of large-scale vision transformers (ViTs) in recent years, models based on convolutional neural networks (CNNs) are still an early state. This work presents a new CNN-based foundation model, termed InternImage, which can obtain gain from increasing parameters and training data like ViTs. Different CNNs that focus large dense kernels, InternImage takes deformable convolution as core operator, so our model not only has effective receptive field required for...

10.1109/cvpr52729.2023.01385 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

Planning-oriented Autonomous Driving

OPENALEX - Publications

Yihan Hu Jiazhi Yang Li Chen Keyu Li Chonghao Sima and 11 more

Modern autonomous driving system is characterized as modular tasks in sequential order, i.e., perception, prediction, and planning. In order to perform a wide diversity of achieve advanced-level intelligence, contemporary approaches either deploy standalone models for individual tasks, or design multi-task paradigm with separate heads. However, they might suffer from accumulative errors deficient task coordination. Instead, we argue that favorable framework should be devised optimized...

10.1109/cvpr52729.2023.01712 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision

OPENALEX - Publications

Chenyu Yang Yuntao Chen Hao Tian Chenxin Tao Xizhou Zhu and 7 more

We present a novel bird's-eye-view (BEV) detector with perspective supervision, which converges faster and bet-suits modern image backbones. Existing state-of-the-art BEV detectors are often tied to certain depth pretrained backbones like Vo Vn et, hindering the synergy between booming detectors. To address this limitation, we prioritize easing optimization of by introducing view supervision. end, propose two-stage detector; where proposals from head fed into bird' s-eye-view for final...

10.1109/cvpr52729.2023.01710 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting

OPENALEX - Publications

Rui Liu Hanming Deng Yangyi Huang Xiaoyu Shi Lewei Lu and 4 more

Transformer, as a strong and flexible architecture for modelling long-range relations, has been widely explored in vision tasks. However, when used video inpainting that requires fine-grained representation, existed method still suffers from yielding blurry edges detail due to the hard patch splitting. Here we aim tackle this problem by proposing FuseFormer, Transformer model designed via feature fusion based on novel Soft Split Composition operations. The soft split divides map into many...

10.1109/iccv48922.2021.01378 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

Delving Into the Devils of Bird’s-Eye-View Perception: A Review, Evaluation and Recipe

OPENALEX - Publications

Hongyang Li Chonghao Sima Jifeng Dai Wenhai Wang Lewei Lu and 17 more

Learning powerful representations in bird's-eye-view (BEV) for perception tasks is trending and drawing extensive attention both from industry academia. Conventional approaches most autonomous driving algorithms perform detection, segmentation, tracking, etc., a front or perspective view. As sensor configurations get more complex, integrating multi-source information different sensors representing features unified view come of vital importance. BEV inherits several advantages, as surrounding...

10.1109/tpami.2023.3333838 article EN cc-by-nc-nd IEEE Transactions on Pattern Analysis and Machine Intelligence 2023-11-17

Scene as Occupancy

OPENALEX - Publications

Wenwen Tong Chonghao Sima Tai Wang Li Chen Silei Wu and 6 more

Human driver can easily describe the complex traffic scene by visual system. Such an ability of precise perception is essential for driver's planning. To achieve this, a geometry-aware representation that quantizes physical 3D into structured grid map with semantic labels per cell, termed as Occupancy, would be desirable. Compared to form bounding box, key insight behind occupancy it could capture fine-grained details critical obstacles in scene, and thereby facilitate subsequent tasks....

10.1109/iccv51070.2023.00772 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

OPENALEX - Publications

Yuwen Xiong Zhiqi Li Yuntao Chen Feng Wang Xizhou Zhu and 8 more

10.1109/cvpr52733.2024.00540 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024-06-16

Towards All-in-One Pre-Training via Maximizing Multi-Modal Mutual Information

OPENALEX - Publications

Weijie Su Xizhou Zhu Chenxin Tao Lewei Lu Bin Li and 5 more

To effectively exploit the potential of large-scale models, various pre-training strategies supported by massive data from different sources are proposed, including supervised pre-training, weakly-supervised and self-supervised pre-training. It has been proved that combining multiple modalities/sources can greatly boost training models. However, current works adopt a multi-stage system, where complex pipeline may increase uncertainty instability is thus desirable these be integrated in...

10.1109/cvpr52729.2023.01525 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

Decoupled Spatial-Temporal Transformer for Video Inpainting

OPENALEX - Publications

Rui Liu Hanming Deng Yangyi Huang Xiaoyu Shi Lewei Lu and 4 more

Video inpainting aims to fill the given spatiotemporal holes with realistic appearance but is still a challenging task even prosperous deep learning approaches. Recent works introduce promising Transformer architecture into video and achieve better performance. However, it suffers from synthesizing blurry texture as well huge computational cost. Towards this end, we propose novel Decoupled Spatial-Temporal (DSTT) for improving exceptional efficiency. Our proposed DSTT disentangles of...

10.48550/arxiv.2104.06637 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory

OPENALEX - Publications

Xizhou Zhu Yuntao Chen Hao Tian Chenxin Tao Weijie Su and 8 more

The captivating realm of Minecraft has attracted substantial research interest in recent years, serving as a rich platform for developing intelligent agents capable functioning open-world environments. However, the current landscape predominantly focuses on specific objectives, such popular "ObtainDiamond" task, and not yet shown effective generalization to broader spectrum tasks. Furthermore, leading success rate task stands at around 20%, highlighting limitations Reinforcement Learning...

10.48550/arxiv.2305.17144 preprint EN other-oa arXiv (Cornell University) 2023-01-01

MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer

OPENALEX - Publications

Changyao Tian Xizhou Zhu Yuwen Xiong Weiyun Wang Zhe Chen and 8 more

Developing generative models for interleaved image-text data has both research and practical value. It requires to understand the sequences subsequently generate images text. However, existing attempts are limited by issue that fixed number of visual tokens cannot efficiently capture image details, which is particularly problematic in multi-image scenarios. To address this, this paper presents MM-Interleaved, an end-to-end model data. introduces a multi-scale feature synchronizer module,...

10.48550/arxiv.2401.10208 preprint EN other-oa arXiv (Cornell University) 2024-01-01

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures

OPENALEX - Publications

Yuchen Duan Wei‐Yun Wang Zhe Chen Xizhou Zhu Lewei Lu and 5 more

Transformers have revolutionized computer vision and natural language processing, but their high computational complexity limits application in high-resolution image processing long-context analysis. This paper introduces Vision-RWKV (VRWKV), a model adapted from the RWKV used NLP field with necessary modifications for tasks. Similar to Vision Transformer (ViT), our is designed efficiently handle sparse inputs demonstrate robust global capabilities, while also scaling up effectively,...

10.48550/arxiv.2403.02308 preprint EN arXiv (Cornell University) 2024-03-04

SEAL: Speech Embedding Alignment Learning for Speech Large Language Model with Retrieval-Augmented Generation

OPENALEX - Publications

Chunyu Sun Bingyu Liu Zhichao Cui Anbin Qi Tianhao Zhang and 2 more

Embedding-based retrieval models have made significant strides in retrieval-augmented generation (RAG) techniques for text and multimodal large language (LLMs) applications. However, when it comes to speech larage (SLLMs), these methods are limited a two-stage process, where automatic recognition (ASR) is combined with text-based retrieval. This sequential architecture suffers from high latency error propagation. To address limitations, we propose unified embedding framework that eliminates...

10.48550/arxiv.2502.02603 preprint EN arXiv (Cornell University) 2025-01-26

MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction

OPENALEX - Publications

Jingwen Ni Yuxin Guo Yichen Liu Rui Chen Lewei Lu and 1 more

World models that forecast environmental changes from actions are vital for autonomous driving with strong generalization. The prevailing world model mainly build on video prediction model. Although these can produce high-fidelity sequences advanced diffusion-based generator, they constrained by their predictive duration and overall generalization capabilities. In this paper, we explore to solve problem combining generation loss MAE-style feature-level context learning. particular,...

10.48550/arxiv.2502.11663 preprint EN arXiv (Cornell University) 2025-02-17

DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving

OPENALEX - Publications

Wenhai Wang Jiangwei Xie ChuanYang Hu Haoming Zou Jianan Fan and 11 more

Large language models (LLMs) have opened up new possibilities for intelligent agents, endowing them with human-like thinking and cognitive abilities. In this work, we delve into the potential of large in autonomous driving (AD). We introduce DriveMLM, an LLM-based AD framework that can perform close-loop realistic simulators. To end, (1) bridge gap between decisions vehicle control commands by standardizing decision states according to off-the-shelf motion planning module. (2) employ a...

10.48550/arxiv.2312.09245 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Distilling Focal Knowledge from Imperfect Expert for 3D Object Detection

OPENALEX - Publications

Jia Zeng Li Chen Hanming Deng Lewei Lu Junchi Yan and 2 more

Multi-camera 3D object detection blossoms in recent years and most of state-of-the-art methods are built up on the bird' s-eye- view (BEV) representations. Albeit remarkable performance, these works suffer from low efficiency. Typically, knowledge distillation can be used for model compression. However, due to unclear geometry reasoning, expert features usually contain some noisy confusing areas. In this work, we investigate how distill an imperfect expert. We propose FD3D, a Focal Distiller...

10.1109/cvpr52729.2023.00102 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

Modeling Continuous Motion for 3D Point Cloud Object Tracking

OPENALEX - Publications

Zhipeng Luo Gongjie Zhang Changqing Zhou Zhonghua Wu Qingyi Tao and 2 more

The task of 3D single object tracking (SOT) with LiDAR point clouds is crucial for various applications, such as autonomous driving and robotics. However, existing approaches have primarily relied on appearance matching or motion modeling within only two successive frames, thereby overlooking the long-range continuous property objects in space. To address this issue, paper presents a novel approach that views each tracklet stream: at timestamp, current frame fed into network to interact...

10.1609/aaai.v38i5.28196 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2024-03-24

1st Place Solution of LVIS Challenge 2020: A Good Box is not a Guarantee of a Good Mask

OPENALEX - Publications

Jingru Tan Gang Zhang Hanming Deng Changbao Wang Lewei Lu and 2 more

This article introduces the solutions of team lvisTraveler for LVIS Challenge 2020. In this work, two characteristics dataset are mainly considered: long-tailed distribution and high quality instance segmentation mask. We adopt a two-stage training pipeline. first stage, we incorporate EQL self-training to learn generalized representation. second utilize Balanced GroupSoftmax promote classifier, propose novel proposal assignment strategy new balanced mask loss head get more precise...

10.48550/arxiv.2009.01559 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

OPENALEX - Publications

Yuwen Xiong Zhiqi Li Yuntao Chen Feng Wang Xizhou Zhu and 8 more

We introduce Deformable Convolution v4 (DCNv4), a highly efficient and effective operator designed for broad spectrum of vision applications. DCNv4 addresses the limitations its predecessor, DCNv3, with two key enhancements: 1. removing softmax normalization in spatial aggregation to enhance dynamic property expressive power 2. optimizing memory access minimize redundant operations speedup. These improvements result significantly faster convergence compared DCNv3 substantial increase...

10.48550/arxiv.2401.06197 preprint EN other-oa arXiv (Cornell University) 2024-01-01

Coming Soon ...