Mingyu Ding

ORCID: 0000-0001-6556-8359
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Multimodal Machine Learning Applications
  • Domain Adaptation and Few-Shot Learning
  • Advanced Neural Network Applications
  • Advanced Vision and Imaging
  • Robotics and Sensor-Based Localization
  • Human Pose and Action Recognition
  • Advanced Image and Video Retrieval Techniques
  • Reinforcement Learning in Robotics
  • Robot Manipulation and Learning
  • Remote Sensing and LiDAR Applications
  • Topic Modeling
  • Autonomous Vehicle Technology and Safety
  • Visual Attention and Saliency Detection
  • 3D Surveying and Cultural Heritage
  • Natural Language Processing Techniques
  • Anomaly Detection Techniques and Applications
  • Video Analysis and Summarization
  • Generative Adversarial Networks and Image Synthesis
  • Explainable Artificial Intelligence (XAI)
  • COVID-19 diagnosis using AI
  • Image Enhancement Techniques
  • Robotic Path Planning Algorithms
  • Optical measurement and interference techniques
  • Video Surveillance and Tracking Methods
  • Human Motion and Animation

University of North Carolina at Chapel Hill
2025

University of California, Berkeley
2023-2025

Technical University of Munich
2024

University of Hong Kong
2019-2023

Chinese University of Hong Kong
2020-2023

Berkeley College
2023

Renmin University of China
2018-2020

HKU-Pasteur Research Pole
2020

3D object detection from a single image without LiDAR is challenging task due to the lack of accurate depth information. Conventional 2D convolutions are unsuitable for this because they fail capture local and its scale information, which vital detection. To better represent structure, prior arts typically transform maps estimated images into pseudo-LiDAR representation, then apply existing point-cloud based detectors. However, their results depend heavily on accuracy maps, resulting in...

10.1109/cvpr42600.2020.01169 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020-06-01

Camera re-localization is an important but challenging task in applications like robotics and autonomous driving. Recently, retrieval-based methods have been considered as a promising direction they can be easily generalized to novel scenes. Despite significant progress has made, we observe that the performance bottleneck of previous actually lies retrieval module. These use same features for both relative pose regression tasks which potential conflicts learning. To this end, here present...

10.1109/iccv.2019.00296 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2019-10-01

Existing few-shot learning (FSL) methods make the implicit assumption that few target class samples are from same domain as source samples. However, in practice, this is often invalid -the classes could come a different domain. This poses an additional challenge of adaptation (DA) with training In paper, problem domain-adaptive (DA-FSL) tackled, which expected to have wide use real-world scenarios and requires solving FSL DA unified framework. To end, we propose novel domain-adversarial...

10.1109/wacv48630.2021.00143 article EN 2021-01-01

Large, high-capacity models trained on diverse datasets have shown remarkable successes efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led a consolidation of pretrained models, with general backbones serving as starting point for many Can such happen in robotics? Conventionally, robotic learning methods train separate model every application, robot, and even environment. we instead generalist X-robot policy that can be adapted new robots,...

10.48550/arxiv.2310.08864 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Existing learning-based autonomous driving (AD) systems face challenges in comprehending high-level information, generalizing to rare events, and providing interpretability. To address these problems, this work employs Large Language Models (LLMs) as a decision-making component for complex AD scenarios that require human commonsense understanding. We devise cognitive pathways enable comprehensive reasoning with LLMs, develop algorithms translating LLM decisions into actionable commands....

10.48550/arxiv.2310.03026 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Embodied AI is a crucial frontier in robotics, capable of planning and executing action sequences for robots to accomplish long-horizon tasks physical environments. In this work, we introduce EmbodiedGPT, an end-to-end multi-modal foundation model embodied AI, empowering agents with understanding execution capabilities. To achieve this, have made the following efforts: (i) We craft large-scale dataset, termed EgoCOT. The dataset consists carefully selected videos from Ego4D along...

10.48550/arxiv.2305.15021 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Optimization in multi-task learning (MTL) is more challenging than single-task (STL), as the gradient from different tasks can be contradictory. When are related, it beneficial to share some parameters among them (cooperation). However, require additional with expertise a specific type of data or discrimination (specialization). To address MTL challenge, we propose Mod-Squad, new model that Modularized into groups experts (a 'Squad'). This structure allows us formalize cooperation and...

10.1109/cvpr52729.2023.01138 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

A major challenge for video semantic segmentation is the lack of labeled data. In most benchmark datasets, only one frame a clip annotated, which makes supervised methods fail to utilize information from rest frames. To exploit spatio-temporal in videos, many previous works use pre-computed optical flows, encode temporal consistency improve segmentation. However, and flow estimation are still considered as two separate tasks. this paper, we propose novel framework joint estimation. Semantic...

10.1609/aaai.v34i07.6699 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2020-04-03

Reducing complexity of the pipeline instance segmentation is crucial for real-world applications. This work addresses this problem by introducing an anchor-box free and single-shot framework, termed PolarMask++, which reformulates as predicting contours objects in polar coordinate, leading to several appealing benefits. (1) The representation unifies (masks) object detection (bounding boxes) into a single reducing design computational complexity. (2) We carefully two modules (soft centerness...

10.1109/tpami.2021.3080324 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2021-01-01

A deep facial attribute editing model strives to meet two requirements: (1) correctness – the target should correctly appear on edited face image; (2) irrelevance preservation any irrelevant information (e.g., identity) not be changed after editing. Meeting both requirements challenges state-of-the-art works which resort either spatial attention or latent space factorization. Specifically, former assume that each has well-defined local support regions; they are often more effective for a...

10.1109/cvpr46437.2021.00297 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

High-resolution representations (HR) are essential for dense prediction tasks such as segmentation, detection, and pose estimation. Learning HR is typically ignored in previous Neural Architecture Search (NAS) methods that focus on image classification. This work proposes a novel NAS method, called HR-NAS, which able to find efficient accurate networks different tasks, by effectively encoding multiscale contextual information while maintaining high-resolution representations. In we renovate...

10.1109/cvpr46437.2021.00300 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

Automated deception detection (ADD) from real-life videos is a challenging task. It specifically needs to address two problems: (1) Both face and body contain useful cues regarding whether subject deceptive. How effectively fuse the thus key effectiveness of an ADD model. (2) Real-life deceptive samples are hard collect; learning with limited training data challenges most deep based models. In this work, both problems addressed. Specifically, for face-body multimodal learning, novel...

10.1109/cvpr.2019.00799 preprint EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

3D vehicle detection based on point cloud is a challenging task in real-world applications such as autonomous driving. Despite significant progress has been made, we observe two aspects to be further improved. First, the semantic context information LiDAR seldom explored previous works, which may help identify ambiguous vehicles. Second, distribution of vehicles varies continuously with increasing depths, not well modeled by single model. In this work, propose unified model SegVoxelNet...

10.1109/icra40945.2020.9196556 article EN 2020-05-01

Large-scale vision-language pre-trained models have shown promising transferability to various downstream tasks. As the size of these foundation and number tasks grow, standard full fine-tuning paradigm becomes unsustainable due heavy computational storage costs. This paper proposes UniAdapter, which unifies unimodal multimodal adapters for parameter-efficient cross-modal adaptation on models. Specifically, are distributed different modalities their interactions, with total tunable...

10.48550/arxiv.2302.06605 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Abstract Recent developments in intelligent robot systems, especially autonomous vehicles, put forward higher requirements for safety and comfort. Road conditions are crucial factors affecting the comprehensive performance of ground vehicles. Nonetheless, existing environment perception datasets driving lack attention to road surface areas. In this paper, we introduce reconstruction dataset, providing multi-modal, high-resolution, high-precision data collected by real-vehicle platform...

10.1038/s41597-024-03261-9 article EN cc-by Scientific Data 2024-05-06

10.1109/tits.2024.3431671 article EN IEEE Transactions on Intelligent Transportation Systems 2024-07-31

We present GeoManip, a framework to enable generalist robots leverage essential conditions derived from object and part relationships, as geometric constraints, for robot manipulation. For example, cutting the carrot requires adhering constraint: blade of knife should be perpendicular carrot's direction. By interpreting these constraints through symbolic language representations translating them into low-level actions, GeoManip bridges gap between natural robotic execution, enabling greater...

10.48550/arxiv.2501.09783 preprint EN arXiv (Cornell University) 2025-01-16

Recent advancements in video generation have significantly improved the ability to synthesize videos from text instructions. However, existing models still struggle with key challenges such as instruction misalignment, content hallucination, safety concerns, and bias. Addressing these limitations, we introduce MJ-BENCH-VIDEO, a large-scale preference benchmark designed evaluate across five critical aspects: Alignment, Safety, Fineness, Coherence & Consistency, Bias Fairness. This...

10.48550/arxiv.2502.01719 preprint EN arXiv (Cornell University) 2025-02-03

The efficient planning of stacking boxes, especially in the online setting where sequence item arrivals is unpredictable, remains a critical challenge modern warehouse and logistics management. Existing solutions often address box size variations, but overlook their intrinsic physical properties, such as density rigidity, which are crucial for real-world applications. We use reinforcement learning (RL) to solve this problem by employing action space masking direct RL policy toward valid...

10.48550/arxiv.2502.13443 preprint EN arXiv (Cornell University) 2025-02-19
Coming Soon ...