- Multimodal Machine Learning Applications
- Video Analysis and Summarization
- Soft Robotics and Applications
- Robot Manipulation and Learning
- Railway Engineering and Dynamics
- Structural Load-Bearing Analysis
- Advanced Image and Video Retrieval Techniques
- Robotic Path Planning Algorithms
- Structural Health Monitoring Techniques
- Geotechnical Engineering and Soil Stabilization
- Modular Robots and Swarm Intelligence
- Geotechnical Engineering and Analysis
- Teleoperation and Haptic Systems
- Structural Engineering and Vibration Analysis
- Robotics and Sensor-Based Localization
- Natural Language Processing Techniques
- Music and Audio Processing
Nanjing University of Posts and Telecommunications
2019
Xi'an University of Architecture and Technology
2010
Tsinghua University
1995
While recent advancements in reasoning optimization have significantly enhanced the capabilities of large language models (LLMs), existing efforts to improve been limited solving mathematical problems and focusing on visual graphical inputs, neglecting broader applications general video understanding.This paper proposes video-SALMONN-o1, first open-source reasoning-enhanced audio-visual LLM designed for understanding tasks. To enhance its abilities, we develop a reasoning-intensive dataset...
Audio often serves as an auxiliary modality in video understanding tasks of audio-visual large language models (LLMs), merely assisting the comprehension visual information. However, a thorough videos significantly depends on auditory information, audio offers critical context, emotional cues, and semantic meaning that data alone lacks. This paper proposes audio-centric benchmark (ACVUBench) to evaluate capabilities multimodal LLMs with particular focus Specifically, ACVUBench incorporates...
Videos contain a wealth of information, and generating detailed accurate descriptions in natural language is key aspect video understanding. In this paper, we present video-SALMONN 2, an advanced audio-visual large model (LLM) with low-rank adaptation (LoRA) designed for enhanced (with paired audio) captioning through directed preference optimization (DPO). We propose new metrics to evaluate the completeness accuracy descriptions, which are optimized using DPO. To further improve training,...
In such application of VR to teleoperation, the most important thing is keep virtual environment consistent with real as operating time. If this can be ensured, a good part practical teleoperation performed in an automatic way. Following idea, we have developed model-based dynamic calibration algorithm for consistency two environments. First, model created by moving camera through environment. We use multi-position-based stereo vision technique. process rehearsal, path planned operator...