NFDI4DS | UHH-SEMS - Publication Details

All in One: Exploring Unified Video-Language Pre-Training

OPENALEX - Publications

Jinpeng Wang Yixiao Ge Rui Yan Yuying Ge Kevin Qinghong Lin and 7 more

Mainstream Video-Language Pre-training (VLP) models [10, 26, 64] consist of three parts, a video encoder, text and video-text fusion Transformer. They pursue better performance via utilizing heavier unimodal encoders or multimodal Transformers, resulting in increased parameters with lower efficiency downstream tasks. In this work, we for the first time introduce an end-to-end VLP model, namely all-in-one Transformer, that embeds raw textual signals into joint representations using unified...

10.1109/cvpr52729.2023.00638 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

edge2vec: Representation learning using edge semantics for biomedical knowledge discovery

OPENALEX - Publications

Zheng Gao Gang Fu Chunping Ouyang Satoshi Tsutsui Xiaozhong Liu and 6 more

Representation learning provides new and powerful graph analytical approaches tools for the highly valued data science challenge of mining knowledge graphs. Since previous methods have mostly focused on homogeneous graphs, an important current is extending this methodology richly heterogeneous graphs domains. The biomedical sciences are such a domain, reflecting complexity biology, with entities as genes, proteins, drugs, diseases, phenotypes, relationships gene co-expression, biochemical...

10.1186/s12859-019-2914-2 article EN cc-by BMC Bioinformatics 2019-06-10

Discovering Hidden Visual Concepts Beyond Linguistic Input in Infant Learning

OPENALEX - Publications

Xiaoyu Ke Satoshi Tsutsui Yayun Zhang Bihan Wen

Infants develop complex visual understanding rapidly, even preceding of the acquisition linguistic inputs. As computer vision seeks to replicate human system, infant development may offer valuable insights. In this paper, we present an interdisciplinary study exploring question: can a computational model that imitates learning process broader concepts extend beyond vocabulary it has heard, similar how infants naturally learn? To investigate this, analyze recently published in Science by Vong...

10.48550/arxiv.2501.05205 preprint EN arXiv (Cornell University) 2025-01-09

Towards Robust and Reliable Concept Representations: Reliability-Enhanced Concept Embedding Model

OPENALEX - Publications

Yongxiang Cai Xiyu Wang Satoshi Tsutsui Winnie Pang Bihan Wen

Concept Bottleneck Models (CBMs) aim to enhance interpretability by predicting human-understandable concepts as intermediates for decision-making. However, these models often face challenges in ensuring reliable concept representations, which can propagate downstream tasks and undermine robustness, especially under distribution shifts. Two inherent issues contribute unreliability: sensitivity concept-irrelevant features (e.g., background variations) lack of semantic consistency the same...

10.48550/arxiv.2502.01191 preprint EN arXiv (Cornell University) 2025-02-03

A Data Driven Approach for Compound Figure Separation Using Convolutional Neural Networks

OPENALEX - Publications

Satoshi Tsutsui David Crandall

A key problem in automatic analysis and understanding of scientific papers is to extract semantic information from non-textual paper components like figures, diagrams, tables, etc. Much this work requires a very first preprocessing step: decomposing compound multi-part figures into individual sub-figures. Previous figure separation has been based on manually designed features rules, which often fail for less common types layouts. Moreover, few implementations decomposition are publicly...

10.1109/icdar.2017.93 article EN 2017-11-01

AVA-AVD: Audio-visual Speaker Diarization in the Wild

OPENALEX - Publications

Zhongcong Xu Zeyang Song Satoshi Tsutsui Chao Feng Mang Ye and 1 more

Audio-visual speaker diarization aims at detecting "who spoke when'' using both auditory and visual signals. Existing audio-visual datasets are mainly focused on indoor environments like meeting rooms or news studios, which quite different from in-the-wild videos in many scenarios such as movies, documentaries, audience sitcoms. To develop methods for these challenging videos, we create the AVA Audio-Visual Diarization (AVA-AVD) dataset. Our experiments demonstrate that adding AVA-AVD into...

10.1145/3503161.3548027 article EN Proceedings of the 30th ACM International Conference on Multimedia 2022-10-10

Meta-Reinforced Synthetic Data for One-Shot Fine-Grained Visual Recognition

OPENALEX - Publications

Satoshi Tsutsui Yanwei Fu David Crandall

One-shot fine-grained visual recognition often suffers from the problem of training data scarcity for new classes. To alleviate this problem, an off-the-shelf image generator can be applied to synthesize additional images, but these synthesized images are not helpful actually improving accuracy one-shot recognition. This paper proposes a meta-learning framework combine generated with original so that resulting ``hybrid'' improve learning. Specifically, generic is updated by few instances...

10.48550/arxiv.1911.07164 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Minimizing Supervision for Free-Space Segmentation

OPENALEX - Publications

Satoshi Tsutsui Tommi Kerola Shunta Saito David Crandall

Identifying "free-space," or safely driveable regions in the scene ahead, is a fundamental task for autonomous navigation. While this can be addressed using semantic segmentation, manual labor involved creating pixel-wise annotations to train segmentation model very costly. Although weakly supervised addresses issue, most methods are not designed free-space. In paper, we observe that homogeneous texture and location two key characteristics of free-space, develop novel, practical framework...

10.1109/cvprw.2018.00145 preprint EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2018-06-01

Using Artificial Tokens to Control Languages for Multilingual Image Caption Generation

OPENALEX - Publications

Satoshi Tsutsui David Crandall

Recent work in computer vision has yielded impressive results automatically describing images with natural language. Most of these systems generate captions a sin- gle language, requiring multiple language-specific models to build multilingual captioning system. We propose very simple technique single unified model across languages, using artificial tokens control the making system more compact. evaluate our approach on generating English and Japanese captions, show that typical neural...

10.48550/arxiv.1706.06275 preprint EN other-oa arXiv (Cornell University) 2017-01-01

Reinforcing Generated Images via Meta-Learning for One-Shot Fine-Grained Visual Recognition

OPENALEX - Publications

Satoshi Tsutsui Yanwei Fu David Crandall

One-shot fine-grained visual recognition often suffers from the problem of having few training examples for new classes. To alleviate this problem, off-the-shelf image generation techniques based on Generative Adversarial Networks (GANs) can potentially create additional images. However, these GAN-generated images are not helpful actually improving accuracy one-shot recognition. In paper, we propose a meta-learning framework to combine generated with original images, so that resulting...

10.1109/tpami.2022.3167112 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2022-04-13

Delocate: Detection and Localization for Deepfake Videos with Randomly-Located Tampered Traces

OPENALEX - Publications

Juan Hu Xin Liao Difei Gao Satoshi Tsutsui Qian Wang and 2 more

Deepfake videos are becoming increasingly realistic, showing few tampering traces on facial areas that vary between frames. Consequently, existing detection methods struggle to detect unknown domain while accurately locating the tampered region. To address this limitation, we propose Delocate, a novel model can both recognize and localize videos. Our method consists of two stages named recovering localization. In stage, randomly masks regions interest (ROIs) reconstructs real faces without...

10.24963/ijcai.2024/648 article EN 2024-07-26

Distantly Supervised Road Segmentation

OPENALEX - Publications

Satoshi Tsutsui Shunta Saito Tommi Kerola

We present an approach for road segmentation that only requires image-level annotations at training time.We leverage distant supervision, which allows us to train our model using images are different from the target domain.Using large publicly available image databases as supervisors, we develop a simple method automatically generate weak pixel-wise masks.These used iteratively fully convolutional neural network, produces final model.We evaluate on Cityscapes dataset, where compare it with...

10.1109/iccvw.2017.29 article EN 2017-10-01

Mover: Mask and Recovery based Facial Part Consistency Aware Method for Deepfake Video Detection

OPENALEX - Publications

Juan Hu Xin Liao Difei Gao Satoshi Tsutsui Zheng Qin and 1 more

Deepfake techniques have been widely used for malicious purposes, prompting extensive research interest in developing detection methods. manipulations typically involve tampering with facial parts, which can result inconsistencies across different parts of the face. For instance, may change smiling lips to an upset lip, while eyes remain smiling. Existing methods depend on specific indicators forgery, tend disappear as forgery patterns are improved. To address limitation, we propose Mover, a...

10.48550/arxiv.2303.01740 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Benchmarking White Blood Cell Classification under Domain Shift

OPENALEX - Publications

Satoshi Tsutsui Zhengyang Su Bihan Wen

Recognizing the types of white blood cells (WBCs) in microscopic images human smears is a fundamental task fields pathology and hematology. Although previous studies have made significant contributions to development methods datasets, few papers investigated benchmarks or baselines that others can easily refer to. For instance, we observed notable variations reported accuracies same Convolutional Neural Network (CNN) model across different studies, yet no public implementation exists...

10.1109/icassp49357.2023.10097167 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

A Computational Model of Early Word Learning from the Infant's Point of View

OPENALEX - Publications

Satoshi Tsutsui Arjun Chandrasekaran Alimoor Reza David Crandall Chen Yu

Human infants have the remarkable ability to learn associations between object names and visual objects from inherently ambiguous experiences. Researchers in cognitive science developmental psychology built formal models that implement in-principle learning algorithms, then used pre-selected pre-cleaned datasets test abilities of find statistical regularities input data. In contrast previous modeling approaches, present study egocentric video gaze data collected infant learners during...

10.48550/arxiv.2006.02802 preprint EN other-oa arXiv (Cornell University) 2020-01-01

edge2vec: Representation learning using edge semantics for biomedical knowledge discovery

OPENALEX - Publications

Zheng Gao Gang Fu Chunping Ouyang Satoshi Tsutsui Xiaozhong Liu and 6 more

Representation learning provides new and powerful graph analytical approaches tools for the highly valued data science challenge of mining knowledge graphs. Since previous methods have mostly focused on homogeneous graphs, an important current is extending this methodology richly heterogeneous graphs domains. The biomedical sciences are such a domain, reflecting complexity biology, with entities as genes, proteins, drugs, diseases, phenotypes, relationships gene co-expression, biochemical...

10.48550/arxiv.1809.02269 preprint EN other-oa arXiv (Cornell University) 2018-01-01

Novel View Synthesis for High-fidelity Headshot Scenes

OPENALEX - Publications

Satoshi Tsutsui Weijia Mao Sijing Lin Yunyi Zhu Murong Ma and 1 more

Rendering scenes with a high-quality human face from arbitrary viewpoints is practical and useful technique for many real-world applications. Recently, Neural Radiance Fields (NeRF), rendering that uses neural networks to approximate classical ray tracing, have been considered as one of the promising approaches synthesizing novel views sparse set images. We find NeRF can render new while maintaining geometric consistency, but it does not properly maintain skin details, such moles pores....

10.48550/arxiv.2205.15595 preprint EN other-oa arXiv (Cornell University) 2022-01-01

From Token to Word: OCR Token Evolution via Contrastive Learning and Semantic Matching for Text-VQA

OPENALEX - Publications

Zan-Xia Jin Mike Zheng Shou Fang Zhou Satoshi Tsutsui Jingyan Qin and 1 more

Text-based Visual Question Answering (Text-VQA) is a question-answering task to understand scene text, where the text usually recognized by Optical Character Recognition (OCR) systems. However, from OCR systems often includes spelling errors, such as "pepsi" being "peosi". These errors are one of major challenges for Text-VQA To address this, we propose novel method alleviate via token evolution. First, artificially create misspelled tokens in training time, and make system more robust...

10.1145/3503161.3547977 article EN Proceedings of the 30th ACM International Conference on Multimedia 2022-10-10

Distantly Supervised Road Segmentation

OPENALEX - Publications

Satoshi Tsutsui Tommi Kerola Shunta Saito

We present an approach for road segmentation that only requires image-level annotations at training time. leverage distant supervision, which allows us to train our model using images are different from the target domain. Using large publicly available image databases as supervisors, we develop a simple method automatically generate weak pixel-wise masks. These used iteratively fully convolutional neural network, produces final model. evaluate on Cityscapes dataset, where compare it with...

10.48550/arxiv.1708.06118 preprint EN other-oa arXiv (Cornell University) 2017-01-01

Whose hand is this? Person Identification from Egocentric Hand Gestures

OPENALEX - Publications

Satoshi Tsutsui Yanwei Fu David Crandall

Recognizing people by faces and other biometrics has been extensively studied in computer vision. But these techniques do not work for identifying the wearer of an egocentric (first-person) camera because that person rarely (if ever) appears their own first-person view. while one's face is frequently visible, hands are: fact, are among most common objects field It thus natural to ask whether appearance motion patterns people's distinctive enough recognize them. In this paper, we...

10.1109/wacv48630.2021.00344 article EN 2021-01-01

Active Object Manipulation Facilitates Visual Object Learning: An Egocentric Vision Study

OPENALEX - Publications

Satoshi Tsutsui Dian Zhi Md. Alimoor Reza David Crandall Yu Chen

Inspired by the remarkable ability of infant visual learning system, a recent study collected first-person images from children to analyze `training data' that they receive. We conduct follow-up investigates two additional directions. First, given infants can quickly learn recognize new object without much supervision (i.e. few-shot learning), we limit number training images. Second, investigate how control signals receive during based on hand manipulation objects. Our experimental results...

10.48550/arxiv.1906.01415 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Delocate: Detection and Localization for Deepfake Videos with Randomly-Located Tampered Traces

OPENALEX - Publications

Juan Hu Xin Liao Difei Gao Satoshi Tsutsui Qian Wang and 2 more

Deepfake videos are becoming increasingly realistic, showing subtle tampering traces on facial areasthat vary between frames. Consequently, many existing detection methods struggle to detect unknown domain while accurately locating the tampered region. To address thislimitation, we propose Delocate, a novel model that can both recognize andlocalize videos. Ourmethod consists of two stages named recoveringand localization. In recovering stage, modelrandomly masks regions interest (ROIs) and...

10.48550/arxiv.2401.13516 preprint EN other-oa arXiv (Cornell University) 2024-01-01