- Multimodal Machine Learning Applications
- Robotics and Sensor-Based Localization
- Advanced Image and Video Retrieval Techniques
- Human Pose and Action Recognition
- Vehicle Noise and Vibration Control
- Domain Adaptation and Few-Shot Learning
- Advanced Neural Network Applications
- Machine Fault Diagnosis Techniques
- Advanced Vision and Imaging
- 3D Surveying and Cultural Heritage
- Video Analysis and Summarization
- Video Surveillance and Tracking Methods
- Advanced Measurement and Detection Methods
- Advanced Algorithms and Applications
- Acoustic Wave Phenomena Research
- Cancer-related molecular mechanisms research
- Visual Attention and Saliency Detection
- Aerodynamics and Acoustics in Jet Flows
- Turbomachinery Performance and Optimization
- Structural Health Monitoring Techniques
- IoT-based Smart Home Systems
- IoT and GPS-based Vehicle Safety Systems
- Infrared Target Detection Methodologies
- Bayesian Modeling and Causal Inference
- Gaze Tracking and Assistive Technology
Aero Engine Corporation of China (China)
2024
Hangzhou Dianzi University
2024
Fudan University
2020-2023
Rochester Institute of Technology
2022
Westlake University
2020-2022
Hong Kong Metropolitan University
2020
Beijing University of Posts and Telecommunications
2017
Video objection detection is a challenging task because isolated video frames may encounter appearance deterioration, which introduces great confusion for detection. One of the popular solutions to exploit temporal information and enhance per-frame representation through aggregating features from neighboring frames. Despite achieving improvements in detection, existing methods focus on selection higher-level aggregation rather than modeling lower-level relations increase feature...
In this work, we introduce a Denser Feature Network(DenserNet) for visual localization. Our work provides three principal contributions. First, develop convolutional neural network (CNN) architecture which aggregates feature maps at different semantic levels image representations. Using denser maps, our method can produce more key point features and increase retrieval accuracy. Second, model is trained end-to-end without pixel-level an-notation other than positive negative GPS-tagged pairs....
Video captioning is a challenging task as it needs to accurately transform visual understanding into natural language description. To date, state-of-the-art methods inadequately model global-local vision representation for sentence generation, leaving plenty of room improvement. In this work, we approach the video from new perspective and propose GLR framework, namely granularity. Our demonstrates three advantages over prior efforts. First, simple solution, which exploits extensive...
Instance segmentation in videos, which aims to segment and track multiple objects video frames, has garnered a flurry of research attention recent years. In this paper, we present novel weakly supervised framework with \textbf{S}patio-\textbf{T}emporal \textbf{C}ollaboration for instance \textbf{Seg}mentation namely \textbf{STC-Seg}. Concretely, STC-Seg demonstrates four contributions. First, leverage the complementary representations from unsupervised depth estimation optical flow produce...
Video captioning is a challenging task as it needs to accurately transform visual understanding into natural language description. To date, state-of-the-art methods inadequately model global-local representation across video frames for caption generation, leaving plenty of room improvement. In this work, we approach the from new perspective and propose GL-RG framework captioning, namely Global-Local Representation Granularity. Our demonstrates three advantages over prior efforts: 1)...
Fine-tuning large vision-language models is a challenging task. Prompt tuning approaches have been introduced to learn fixed textual or visual prompts while freezing the pre-trained model in downstream tasks. Despite effectiveness of prompt tuning, what do those learnable remains unexplained. In this work, we explore whether fine-tuning can knowledge-aware from pre-training, by designing two different sets pre-training and phases respectively. Specifically, present Video-Language (VL-Prompt)...
Geo-localization is a critical task in computer vision. In this work, we cast the geo-localization as 2D image retrieval task. Current state-of-the-art methods for are not robust to locate scene with drastic scale variations because they only exploit features from one semantic level representations. To address limitation, introduce hierarchical attention fusion network using multi-scale geo-localization. We extract feature maps convolutional neural (CNN) and organically fuse extracted Our...
Vision and voice are two vital keys for agents' interaction learning. In this paper, we present a novel indoor navigation model called Memory Vision-Voice Indoor Navigation (MVV-IN), which receives commands analyzes multimodal information of visual observation in order to enhance robots' environment understanding. We make use single RGB images taken by rst-view monocular camera. also apply self-attention mechanism keep the agent focusing on key areas. is important avoid repeating certain...
Abstract Noise source identification of gas turbines can provide the basis and guidance for vibration noise reduction turbines. Independent component analysis (ICA) is one most popular techniques blind separation (BSS) widely used in mechanical systems. ICA suitable independent signals. However, order to identify dependent sources turbines, a convolutive BSS frequency domain based on bounded (BCA) proposed. First, basic theory BCA introduced detail. The mixing time transformed into an...
In this work, we introduce a Denser Feature Network (DenserNet) for visual localization. Our work provides three principal contributions. First, develop convolutional neural network (CNN) architecture which aggregates feature maps at different semantic levels image representations. Using denser maps, our method can produce more keypoint features and increase retrieval accuracy. Second, model is trained end-to-end without pixel-level annotation other than positive negative GPS-tagged pairs....
Video captioning is a challenging task as it needs to accurately transform visual understanding into natural language description. To date, state-of-the-art methods inadequately model global-local representation across video frames for caption generation, leaving plenty of room improvement. In this work, we approach the from new perspective and propose GL-RG framework captioning, namely \textbf{G}lobal-\textbf{L}ocal \textbf{R}epresentation \textbf{G}ranularity. Our demonstrates three...
Video objection detection is a challenging task because isolated video frames may encounter appearance deterioration, which introduces great confusion for detection. One of the popular solutions to exploit temporal information and enhance per-frame representation through aggregating features from neighboring frames. Despite achieving improvements in detection, existing methods focus on selection higher-level aggregation rather than modeling lower-level relations increase feature...
Planar reconstruction detects planar segments and deduces their 3D parameters (normals offsets) from the input image; this has significant potential in fields of digital preservation cultural heritage, architectural design, robot navigation, intelligent transportation, security monitoring. Existing methods mainly employ multiple-view images with limited overlap for but lack utilization relative position rotation information between images. To fill gap, paper uses two views camera pose to...
<title>Abstract</title> Driver attention prediction plays a crucial role in the developing intelligent driving and assisted systems. However, this task presents several challenges to researchers, including difficulty of effectively utilizing scene information lack driver models that can accurately predict driver’s multiple regions fixation. To address these challenges, work proposes novel multi-scale feature fusion network (MSFFDAP) for prediction. MSFFDAP uses convolutional neural extract...
Conventional model-driven operational transfer path analysis (OTPA) cannot update and optimize itself based on data characteristics, which weakens its accuracy reliability. Inspired by data-driven thinking of learning from data, this paper develops statistically OTPA. First, considering the statistical distribution characteristics potential errors in according to central limit theorem, factors affecting error calculating transmissibility are analyzed summarized. Then, constructing objective...
Abstract: Recently, instance segmentation models with complex architectures and large parameter sets have shown impressive levels of precision. Nonetheless, considering a practical perspective, balancing precision speed is more desirable. Real-time faces efficiency quality challenges in urban street scenes. In the present research, we propose YOLOv8-seg based model named LAtt-Yolov8-seg. A pivotal advancement lies introduction mechanism called Focused Linear Attention, which effectively...
With the continuous development and popularization of drone technology, drones are widely used in various fields, especially video applications. We propose DroneGPT, a neural-symbolic method that learns VISPROG, which does not require any task-specific training. It leverages contextual learning ability large language models to generate execute modular programs, solving complex compositional vision tasks given natural instructions. The modules program can call several ready-made computer...
Abstract: Inspection of pipelines is particularly important for the drainage industry, and automation this process has received a lot attention. We propose Mixture Experts Sewer Defect Classification (Sewer-MoE), an innovative model identifying pipe defects, in which we train multiple expert models then merge them into single multiclassification model. During training process, produced attention mechanism structure that allows each to refer other models, while weighting classification...
Object trajectory prediction is a hot research issue with wide applications in video surveillance and autonomous driving. The previous studies consider the interaction sparsity mainly among pedestrians instead of multi-type objects, which brings new types interactions consequently superfluous ones. This paper proposes Multi-type Trajectory Prediction (MOTP) method Sparse Multi-relational Graph Convolutional Network (SMGCN) novel multi-round Global Temporal Aggregation (GTA). MOTP introduces...