- Human Pose and Action Recognition
- Video Analysis and Summarization
- Vehicle License Plate Recognition
- Advanced Neural Network Applications
- Handwritten Text Recognition Techniques
- Infrastructure Maintenance and Monitoring
- Advanced Image and Video Retrieval Techniques
- Hand Gesture Recognition Systems
- Image Processing and 3D Reconstruction
- Advanced Vision and Imaging
- Music and Audio Processing
Hokkaido University
2021-2024
Traffic sign recognition is a complex and challenging yet popular problem that can assist drivers on the road reduce traffic accidents. Most existing methods for use convolutional neural networks (CNNs) achieve high accuracy. However, these first require large number of carefully crafted datasets training process. Moreover, since signs differ in each country there variety signs, need to be fine-tuned when recognizing new categories. To address issues, we propose matching method zero-shot...
Recent multimodal large language models (MLLM) such as GPT-4o and GPT-4v have shown great potential in autonomous driving. In this paper, we propose a cross-domain few-shot in-context learning method based on the MLLM for enhancing traffic sign recognition (TSR). We first construct detection network Vision Transformer Adapter an extraction module to extract signs from original road images. To reduce dependence training data improve performance stability of cross-country TSR, introduce MLLM....
We propose a new strategy called think twice before recognizing to improve fine-grained traffic sign recognition (TSR). Fine-grained TSR in the wild is difficult due complex road conditions, and existing approaches particularly struggle with cross-country when data lacking. Our achieves effective by stimulating multiple-thinking capability of large multimodal models (LMM). introduce context, characteristic, differential descriptions design multiple thinking processes for LMM. The context...
This paper presents a transformer-based multimodal soccer scene recognition method for both visual and audio modalities. Our approach directly uses the original video frames spectrogram from as input of transformer model, which can capture spatial information action at moment contextual temporal between different actions in videos. We fuse output model order to better identify scenes that occur real matches. The late fusion performs weighted average estimation results obtain complete scene....
This paper presents a scene retrieval method in soccer videos with video vision Transformer (ViViT). In coaching, it is difficult for the training staff to find required scenes efficiently from large number of videos. We tackle this problem simple yet effective method. train ViViT and obtain output token features by pre-trained model. The tokens contain spatio-temporal information scenes. then transform query candidate into using calculate similarity between cosine similarity. conducted...
Similar scene retrieval in soccer videos has been drawing a lot of attention recent years. In previous studies, long and unified frame sequences extracted from are used to represent scene. However, it causes confusion that affects the performance. this paper, we propose reduction method based on combination short for similar videos. Our preserves both complete contextual information immediate state action represented by sequences. The experimental results show MAP@10 achieves 0.587 with our approach.