- Human Pose and Action Recognition
- Multimodal Machine Learning Applications
- Video Surveillance and Tracking Methods
- Hand Gesture Recognition Systems
- Video Analysis and Summarization
- Anomaly Detection Techniques and Applications
- Topic Modeling
- Advanced Neural Network Applications
- Subtitles and Audiovisual Media
- Visual Attention and Saliency Detection
- Advanced Image and Video Retrieval Techniques
- Tactile and Sensory Interactions
- Robot Manipulation and Learning
- Context-Aware Activity Recognition Systems
- Usability and User Interface Design
- Virtual Reality Applications and Impacts
- Advanced Vision and Imaging
- Digital Games and Media
- Human-Automation Interaction and Safety
- Time Series Analysis and Forecasting
National Institute of Advanced Industrial Science and Technology
2024-2025
The University of Tokyo
2018-2024
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of dailylife activity spanning hundreds scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations 9 different countries. The approach to collection is designed uphold rigorous privacy ethics standards, with consenting participants robust de-identification procedures where relevant. Ego4D dramatically expands the volume...
We present a new task that predicts future locations of people observed in first-person videos. Consider video stream continuously recorded by wearable camera. Given short clip person is extracted from the complete stream, we aim to predict person's location frames. To facilitate this localization ability, make following three key observations: (a) First-person videos typically involve significant ego-motion which greatly affects target frames; (b) Scales act as salient cue estimate...
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity spanning hundreds scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations 9 different countries. The approach to collection is designed uphold rigorous privacy ethics standards, with consenting participants robust de-identification procedures where relevant. Ego4D dramatically expands the volume...
Object affordance is an important concept in hand-object interaction, providing information on action possibilities based human motor capacity and objects' physical property thus benefiting tasks such as anticipation robot imitation learning. However, the definition of existing datasets often: 1) mix up with object functionality; 2) confuse goal-related action; 3) ignore capacity. This paper proposes efficient annotation scheme to address these issues by combining goal-irrelevant actions...
People spend an enormous amount of time and effort looking for lost objects. To help remind people the location objects, various computational systems that provide information on their locations have been developed. However, prior assisting in finding objects require users to register target advance. This requirement imposes a cumbersome burden users, system cannot them unexpectedly We propose GO-Finder ("Generic Object Finder"), registration-free wearable camera based arbitrary number two...
Hand segmentation is a crucial task in first-person vision. Since images exhibit strong bias appearance among different environments, adapting pre-trained model to new domain required hand segmentation. Here, we focus on gaps for regions and backgrounds separately. We propose (i) foreground-aware image stylization (ii) consensus pseudo-labeling adaptation of stylize source independently the foreground background using target as style. To resolve shift that has not addressed, apply careful by...
Formula-driven supervised learning (FDSL) is a growing research topic for finding simple mathematical formulas that generate synthetic data and labels pre-training neural networks. The main advantage of FDSL there no risk generating with ethical implications such as gender bias racial because it does not rely on real discussed in previous studies using fractals polygons image encoders. While has been proposed encoders, considered temporal trajectory data. In this paper, we introduce PolarDB,...
Temporally localizing the presence of object states in videos is crucial understanding human activities beyond actions and objects. This task has suffered from a lack training data due to states' inherent ambiguity variety. To avoid exhaustive annotation, learning transcribed narrations instructional would be intriguing. However, are less described compared actions, making them effective. In this work, we propose extract state information action included narrations, using large language...
Every hand-object interaction begins with contact. Despite predicting the contact state between hands and objects is useful in understanding interactions, prior methods on analysis have assumed that interacting are known, were not studied detail. In this study, we introduce a video-based method for hand an object. Specifically, given video pair of object tracks, predict binary (contact or no-contact) each frame. However, annotating large number tracks labels costly. To overcome difficulty,...
We present a new task that predicts future locations of people observed in first-person videos. Consider video stream continuously recorded by wearable camera. Given short clip person is extracted from the complete stream, we aim to predict person's location frames. To facilitate this localization ability, make following three key observations: a) First-person videos typically involve significant ego-motion which greatly affects target frames; b) Scales act as salient cue estimate...
We propose a novel benchmark for cross-view knowledge transfer of dense video captioning, adapting models from web instructional videos with exocentric views to an egocentric view. While captioning (predicting time segments and their captions) is primarily studied (e.g., YouCook2), benchmarks are restricted due data scarcity. To overcome the limited availability, transferring abundant demanded as practical approach. However, learning correspondence between difficult dynamic view changes. The...
People spend an enormous amount of time and effort looking for lost objects. To help remind people the location objects, various computational systems that provide information on their locations have been developed. However, prior assisting in finding objects require users to register target advance. This requirement imposes a cumbersome burden users, system cannot them unexpectedly We propose GO-Finder (“Generic Object Finder”), registration-free wearable camera-based arbitrary number based...