Bu Jin

ORCID: 0000-0001-7577-2177
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Multimodal Machine Learning Applications
  • Advanced Vision and Imaging
  • Human Pose and Action Recognition
  • Natural Language Processing Techniques
  • Video Analysis and Summarization
  • Robotics and Automated Systems
  • Image Enhancement Techniques
  • Image Processing Techniques and Applications
  • Generative Adversarial Networks and Image Synthesis
  • Explainable Artificial Intelligence (XAI)
  • Topic Modeling
  • Machine Learning in Healthcare
  • Advanced Neural Network Applications
  • Advanced Queuing Theory Analysis
  • 3D Shape Modeling and Analysis
  • Stochastic processes and financial applications
  • Anomaly Detection Techniques and Applications
  • Simulation Techniques and Applications
  • Autonomous Vehicle Technology and Safety
  • Video Surveillance and Tracking Methods
  • Semantic Web and Ontologies
  • Neural dynamics and brain function
  • Adversarial Robustness in Machine Learning
  • EEG and Brain-Computer Interfaces
  • Robot Manipulation and Learning

Chinese Academy of Sciences
2023-2024

Institute of Automation
2023-2024

Tsinghua University
2023-2024

University of Chinese Academy of Sciences
2022-2023

Beijing Academy of Artificial Intelligence
2023

End-to-end autonomous driving has great potential in the transportation industry. However, lack of transparency and interpretability automatic decision-making process hinders its industrial adoption practice. There have been some early attempts to use attention maps or cost volume for better model explainability which is difficult ordinary passengers understand. To bridge gap, we propose an end-to-end transformer-based architecture, ADAPT (Action-aware Driving cAPtion Transformer), provides...

10.1109/icra48891.2023.10160326 article EN 2023-05-29

Self-supervised depth estimation draws a lot of attention recently as it can promote the 3D sensing capa-bilities self-driving vehicles. However, intrinsically relies upon photometric consistency assumption, which hardly holds during nighttime. Although various supervised night-time image enhancement methods have been proposed, their generalization performance in challenging driving scenarios is not satisfactory. To this end, we propose first method that jointly learns nighttime enhancer and...

10.1109/icra48891.2023.10160708 article EN 2023-05-29

Traffic accidents present complex challenges for autonomous driving, often featuring unpredictable scenarios that hinder accurate system interpretation and responses.Nonetheless, prevailing methodologies fall short in elucidating the causes of proposing preventive measures due to paucity training data specific accident scenarios.In this work, we introduce AVD2 (Accident Video Diffusion Accident Description), a novel framework enhances scene understanding by generating videos aligned with...

10.48550/arxiv.2502.14801 preprint EN arXiv (Cornell University) 2025-02-20

We address the new problem of language-guided semantic style transfer 3D indoor scenes. The input is a scene mesh and several phrases that describe target scene. Firstly, vertex coordinates are mapped to RGB residues by multi-layer perceptron. Secondly, colored meshes differentiablly rendered into 2D images, via viewpoint sampling strategy tailored for Thirdly, images compared phrases, pre-trained vision-language models. Lastly, errors back-propagated perceptron update colors corresponding...

10.1145/3552482.3556555 article EN 2022-09-28

Vehicle motion planning is an essential component of autonomous driving technology. Current rule-based vehicle methods perform satisfactorily in common scenarios but struggle to generalize long-tailed situations. Meanwhile, learning-based have yet achieve superior performance over approaches large-scale closed-loop scenarios. To address these issues, we propose PlanAgent, the first mid-to-mid system based on a Multi-modal Large Language Model (MLLM). MLLM used as cognitive agent introduce...

10.48550/arxiv.2406.01587 preprint EN arXiv (Cornell University) 2024-06-03

Monocular Semantic Occupancy Prediction aims to infer the complete 3D geometry and semantic information of scenes from only 2D images. It has garnered significant attention, particularly due its potential enhance perception autonomous vehicles. However, existing methods rely on a complex cascaded framework with relatively limited restore scenes, including dependency supervision solely whole network's output, single-frame input, utilization small backbone. These challenges, in turn, hinder...

10.48550/arxiv.2403.08766 preprint EN arXiv (Cornell University) 2024-03-13

Constructing a 3D scene capable of accommodating open-ended language queries, is pivotal pursuit, particularly within the domain robotics. Such technology facilitates robots in executing object manipulations based on human directives. To tackle this challenge, some research efforts have been dedicated to development language-embedded implicit fields. However, fields (e.g. NeRF) encounter limitations due necessity processing large number input views for reconstruction, coupled with their...

10.48550/arxiv.2403.09637 preprint EN arXiv (Cornell University) 2024-03-14

3D dense captioning stands as a cornerstone in achieving comprehensive understanding of scenes through natural language. It has recently witnessed remarkable achievements, particularly indoor settings. However, the exploration outdoor is hindered by two major challenges: 1) \textbf{domain gap} between and scenes, such dynamics sparse visual inputs, makes it difficult to directly adapt existing methods; 2) \textbf{lack data} with box-caption pair annotations specifically tailored for scenes....

10.48550/arxiv.2403.19589 preprint EN arXiv (Cornell University) 2024-03-28

We propose DOME, a diffusion-based world model that predicts future occupancy frames based on past observations. The ability of this to capture the evolution environment is crucial for planning in autonomous driving. Compared 2D video-based models, utilizes native 3D representation, which features easily obtainable annotations and modality-agnostic. This flexibility has potential facilitate development more advanced models. Existing models either suffer from detail loss due discrete...

10.48550/arxiv.2410.10429 preprint EN arXiv (Cornell University) 2024-10-14

End-to-end architectures in autonomous driving (AD) face a significant challenge interpretability, impeding human-AI trust. Human-friendly natural language has been explored for tasks such as explanation and 3D captioning. However, previous works primarily focused on the paradigm of declarative where interpretations are not grounded intermediate outputs AD systems, making only declarative. In contrast, aligned interpretability establishes connection between systems. Here we introduce...

10.48550/arxiv.2409.06702 preprint EN arXiv (Cornell University) 2024-09-10

EEG-based brain-computer interfaces (BCIs) have the potential to decode visual information. Recently, artificial neural networks (ANNs) been used classify EEG signals evoked by stimuli. However, methods using ANNs extract features from raw still perform lower than traditional frequency-domain features, and are typically evaluated on small-scale datasets at a low sample rate, which can hinder capabilities of deep-learning models. To overcome these limitations, we propose hybrid local-global...

10.1038/s41598-024-77923-4 article EN cc-by-nc-nd Scientific Reports 2024-11-08

The end-to-end autonomous driving paradigm has recently attracted lots of attention due to its scalability. However, existing methods are constrained by the limited scale real-world data, which hinders a comprehensive exploration scaling laws associated with driving. To address this issue, we collected substantial data from various scenarios and behaviors conducted an extensive study on imitation learning-based paradigms. Specifically, approximately 4 million demonstrations 23 different...

10.48550/arxiv.2412.02689 preprint EN arXiv (Cornell University) 2024-12-03

Self-supervised depth estimation draws a lot of attention recently as it can promote the 3D sensing capabilities self-driving vehicles. However, intrinsically relies upon photometric consistency assumption, which hardly holds during nighttime. Although various supervised nighttime image enhancement methods have been proposed, their generalization performance in challenging driving scenarios is not satisfactory. To this end, we propose first method that jointly learns enhancer and estimator,...

10.48550/arxiv.2302.01334 preprint EN other-oa arXiv (Cornell University) 2023-01-01

End-to-end autonomous driving has great potential in the transportation industry. However, lack of transparency and interpretability automatic decision-making process hinders its industrial adoption practice. There have been some early attempts to use attention maps or cost volume for better model explainability which is difficult ordinary passengers understand. To bridge gap, we propose an end-to-end transformer-based architecture, ADAPT (Action-aware Driving cAPtion Transformer), provides...

10.48550/arxiv.2302.00673 preprint EN cc-by-sa arXiv (Cornell University) 2023-01-01

We address the new problem of language-guided semantic style transfer 3D indoor scenes. The input is a scene mesh and several phrases that describe target scene. Firstly, vertex coordinates are mapped to RGB residues by multi-layer perceptron. Secondly, colored meshes differentiablly rendered into 2D images, via viewpoint sampling strategy tailored for Thirdly, images compared phrases, pre-trained vision-language models. Lastly, errors back-propagated perceptron update colors corresponding...

10.48550/arxiv.2208.07870 preprint EN cc-by-sa arXiv (Cornell University) 2022-01-01
Coming Soon ...