Yudong Yang

ORCID: 0000-0002-0120-6831
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Multimodal Machine Learning Applications
  • Video Analysis and Summarization
  • Soft Robotics and Applications
  • Robot Manipulation and Learning
  • Railway Engineering and Dynamics
  • Structural Load-Bearing Analysis
  • Advanced Image and Video Retrieval Techniques
  • Robotic Path Planning Algorithms
  • Structural Health Monitoring Techniques
  • Geotechnical Engineering and Soil Stabilization
  • Modular Robots and Swarm Intelligence
  • Geotechnical Engineering and Analysis
  • Teleoperation and Haptic Systems
  • Structural Engineering and Vibration Analysis
  • Robotics and Sensor-Based Localization
  • Natural Language Processing Techniques
  • Music and Audio Processing

Nanjing University of Posts and Telecommunications
2019

Xi'an University of Architecture and Technology
2010

Tsinghua University
1995

While recent advancements in reasoning optimization have significantly enhanced the capabilities of large language models (LLMs), existing efforts to improve been limited solving mathematical problems and focusing on visual graphical inputs, neglecting broader applications general video understanding.This paper proposes video-SALMONN-o1, first open-source reasoning-enhanced audio-visual LLM designed for understanding tasks. To enhance its abilities, we develop a reasoning-intensive dataset...

10.48550/arxiv.2502.11775 preprint EN arXiv (Cornell University) 2025-02-17

Audio often serves as an auxiliary modality in video understanding tasks of audio-visual large language models (LLMs), merely assisting the comprehension visual information. However, a thorough videos significantly depends on auditory information, audio offers critical context, emotional cues, and semantic meaning that data alone lacks. This paper proposes audio-centric benchmark (ACVUBench) to evaluate capabilities multimodal LLMs with particular focus Specifically, ACVUBench incorporates...

10.48550/arxiv.2503.19951 preprint EN arXiv (Cornell University) 2025-03-25

Videos contain a wealth of information, and generating detailed accurate descriptions in natural language is key aspect video understanding. In this paper, we present video-SALMONN 2, an advanced audio-visual large model (LLM) with low-rank adaptation (LoRA) designed for enhanced (with paired audio) captioning through directed preference optimization (DPO). We propose new metrics to evaluate the completeness accuracy descriptions, which are optimized using DPO. To further improve training,...

10.48550/arxiv.2410.06682 preprint EN arXiv (Cornell University) 2024-10-09

In such application of VR to teleoperation, the most important thing is keep virtual environment consistent with real as operating time. If this can be ensured, a good part practical teleoperation performed in an automatic way. Following idea, we have developed model-based dynamic calibration algorithm for consistency two environments. First, model created by moving camera through environment. We use multi-position-based stereo vision technique. process rehearsal, path planned operator...

10.1117/12.227947 article EN Proceedings of SPIE, the International Society for Optical Engineering/Proceedings of SPIE 1995-12-01
Coming Soon ...