- Advanced Vision and Imaging
- Advanced Image and Video Retrieval Techniques
- Video Analysis and Summarization
- Image Retrieval and Classification Techniques
- Image Processing Techniques and Applications
- CCD and CMOS Imaging Sensors
- Advanced Image Processing Techniques
- Computer Graphics and Visualization Techniques
- Video Surveillance and Tracking Methods
- Advanced Data Compression Techniques
- Infrared Target Detection Methodologies
- Nutritional Studies and Diet
- Image Enhancement Techniques
- 3D Shape Modeling and Analysis
- Human Pose and Action Recognition
- Image and Signal Denoising Methods
- Multimodal Machine Learning Applications
- Handwritten Text Recognition Techniques
- Robotics and Sensor-Based Localization
- Visual Attention and Saliency Detection
- Human Motion and Animation
- Music and Audio Processing
- Video Coding and Compression Technologies
- Face recognition and analysis
- Advanced Chemical Sensor Technologies
The University of Tokyo
2016-2025
Bunkyo University
2002-2025
Tokyo University of Information Sciences
2014-2024
Universidad Europea
2024
University of Tokyo Hospital
2023
Hitachi (Japan)
2020
University of Liverpool
2015
National Institute of Informatics
2015
Ube Frontier University
2002-2008
Shinshu University
2005
Deep neural networks (DNNs) trained on large-scale datasets have exhibited significant performance in image classification. Many are collected from websites, however they tend to contain inaccurate labels that termed as noisy labels. Training such labeled causes degradation because DNNs easily overfit To overcome this problem, we propose a joint optimization framework of learning DNN parameters and estimating true Our can correct during training by alternating update network We conduct...
Can we detect common objects in a variety of image domains without instance-level annotations? In this paper, present framework for novel task, cross-domain weakly supervised object detection, which addresses question. For have access to images with annotations source domain (e.g., natural image) and image-level target watercolor). addition, the classes be detected are all or subset those domain. Starting from fully detector, is pre-trained on domain, propose two-step progressive adaptation...
In this paper, we apply a convolutional neural network (CNN) to the tasks of detecting and recognizing food images. Because wide diversity types food, image recognition items is generally very difficult. However, deep learning has been shown recently be powerful technique, CNN state-of-the-art approach learning. We applied detection through parameter optimization. constructed dataset most frequent in publicly available food-logging system, used it evaluate performance. showed significantly...
The paper gives an overview of model-based approaches applied to image coding, by looking at source models. In these schemes, which are different from the various conventional waveform coding methods, 3-D properties scenes taken into consideration. They can achieve very low bit rate transmission. 2-D model and based explained. Among them, a method using facial utilizing deformable triangular patches described. Works related images some remaining problems also described.< <ETX...
This paper presents a robust photometric stereo method that effectively compensates for various non-Lambertian corruptions such as specularities, shadows, and image noise. We construct constrained sparse regression problem enforces both Lambertian, rank-3 structure sparse, additive corruptions. A solution is derived using hierarchical Bayesian approximation to accurately estimate the surface normals while simultaneously separating Extensive evaluations are performed show state-of-the-art...
We have created Manga109, a dataset of variety 109 Japanese comic books publicly available for use academic purposes. This provides numerous images but lacks the annotations elements in comics that are necessary by machine learning algorithms or evaluation methods. In this paper, we present our ongoing project to build metadata Manga109. first define terms frames, texts and characters. then web-based software efficiently creating ground truth these images. addition, provide guideline...
We have investigated the "FoodLog" multimedia food-recording tool, whereby users upload photographs of their meals and a food diary is constructed using image-processing functions such as food-image detection food-balance estimation. In this paper, following brief introduction to FoodLog, we propose Bayesian framework that makes use personal dietary tendencies improve both The facilitates incremental learning. It incorporates three influence analysis: likelihood, prior distribution, mealtime...
Since deep learning models have been implemented in many commercial applications, it is important to detect out-of-distribution (OOD) inputs correctly maintain the performance of models, ensure quality collected data, and prevent applications from being used for other-than-intended purposes. In this work, we propose a two-head convolutional neural network (CNN) maximize discrepancy between two classifiers OOD inputs. We train CNN consisting one common feature extractor which different...
In this work, travel destinations and business locations are taken as venues. Discovering a venue by photograph is very important for visual context-aware applications. Unfortunately, few efforts paid attention to complicated real images such photographs generated users. Our goal fine-grained discovery from heterogeneous social multimodal data. To end, we propose novel deep learning model, category-based canonical correlation analysis. Given input, model performs: 1) exact search (find the...
This paper presents a photometric stereo method that is purely pixelwise and handles general isotropic surfaces in stable manner. Following the recently proposed sum-of-lobes representation of reflectance function, we constructed constrained bivariate regression problem where function approximated by smooth, Bernstein polynomials. The unknown normal vector was separated from considering inverse image formation process, then could accurately compute surface normals solving simple efficient...
Currently, food image recognition tasks are evaluated against fixed datasets. However, in real-world conditions, there cases which the number of samples each class continues to increase and from novel classes appear. In particular, dynamic datasets individual user creates updating process often have content that varies considerably between different users, per person is very limited. A single classifier common all users cannot handle such data. Bridging gap laboratory environment real world...
Manga, or comics, which are a type of multimodal artwork, have been left behind in the recent trend deep learning applications because lack proper dataset. Hence, we built Manga109, dataset consisting variety 109 Japanese comic books (94 authors and 21 142 pages) made it publicly available by obtaining author permissions for academic use. We carefully annotated frames, speech texts, character faces, bodies; total number annotations exceeds 500 k. This provides numerous manga images...
Scene text recognition (STR) task has a common practice: All state-of-the-art STR models are trained on large synthetic data. In contrast to this practice, training only fewer real labels (STR with labels) is important when we have train without data: for handwritten or artistic texts that difficult generate synthetically and languages other than English which do not always However, there been implicit knowledge data nearly impossible because insufficient. We consider obstructed the study of...
This paper proposes new methods for analyzing image sequences and updating textures of the three-dimensional (3-D) facial model. It also describes a method synthesizing various expressions. These three are key technologies model-based coding system. The input analysis technique directly robustly estimates 3-D head motions expressions without any two-dimensional (2-D) entity correspondences. resolves 2-D correspondence mismatch errors provides quality reproduction original images by fully...
In this paper, we present continuous capture of our life log with various sensors plus additional data and propose effective retrieval methods using context content. Our system contains video, audio, acceleration sensor, gyro, GPS, annotations, documents, web pages, emails. previous studies, showed methodology [8], [9], which mainly depends on information from sensor data. extend functions. They are (1) spatio-temporal sampling for extraction key frames summarization; (2) conversation scene...
In this paper, we propose a novel method that combines monocular visual simultaneous localization and mapping (vSLAM) deep-learning-based semantic segmentation. For stable operation, vSLAM requires feature points on static objects. conventional vSLAM, random sample consensus (RANSAC) [5] is used to select those points. However, if major portion of the view occupied by moving objects, many become inappropriate RANSAC does not perform well. Based our empirical studies, in sky cars often cause...
FoodLog is a multimedia food-recording tool that offers novel method for recording daily food intake primarily healthcare purposes. Its use of image-processing techniques presents significant potential the development new monitoring apps.
This work presents methods to automatically find optimal parameter settings for convolutional neural networks (CNNs) by using an evolutionary algorithm called particle swarm optimization (PSO). Even though the space is extremely large (> 10 <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">20</sup> ), we experimentally show that a better setting can be found Alexnet configuration five different image datasets. We have also developed two candidate...
We introduce Spatio-Temporal Vector of Locally Max Pooled Features (ST-VLMPF), a super vector-based encoding method specifically designed for local deep features encoding. The proposed addresses an important problem video understanding: how to build representation that incorporates the CNN over entire video. Feature assignment is carried out at two levels, by using similarity and spatio-temporal information. For each we specific encoding, focused on nature features, with goal capture highest...
End-to-end distance metric learning (DML) has been applied to obtain features useful in many computer vision tasks. However, these DML studies have not provided equitable comparisons between extracted from DML-based networks and softmax-based networks. In this paper, we present objective two approaches under the same network architecture.
The Japanese comic format known as Manga is popular all over the world. It traditionally produced in black and white, colorization time consuming costly. Automatic methods generally rely on greyscale values, which are not present manga. Furthermore, due to copyright protection, colorized manga available for training scarce. We propose a method based conditional Generative Adversarial Networks (cGAN). Unlike previous cGAN approaches that use many hundreds or thousands of images, our requires...
Weakly supervised object detection (WSOD), where a detector is trained with only image-level annotations, attracting more and attention. As method to obtain well-performing detector, the instance labels are updated iteratively. In this study, for efficient iterative updating, we focus on labeling problem, problem of which label should be annotated each region based last localization result. Instead simply top-scoring its highly overlapping regions as positive others negative, propose...