- Advanced Neural Network Applications
- Domain Adaptation and Few-Shot Learning
- Advanced Image and Video Retrieval Techniques
- Human Pose and Action Recognition
- Multimodal Machine Learning Applications
- Generative Adversarial Networks and Image Synthesis
- Digital Media Forensic Detection
- Biomedical Text Mining and Ontologies
- Video Surveillance and Tracking Methods
- Anomaly Detection Techniques and Applications
- Autonomous Vehicle Technology and Safety
- Advanced Graph Neural Networks
- Digital Imaging for Blood Diseases
- AI in cancer detection
- Radiomics and Machine Learning in Medical Imaging
- Machine Learning in Bioinformatics
- Medical Imaging and Analysis
- Automated Road and Building Extraction
- Biometric Identification and Security
- Infrastructure Maintenance and Monitoring
- Cell Image Analysis Techniques
- Speech Recognition and Synthesis
- Hand Gesture Recognition Systems
- Speech and Audio Processing
- Bioinformatics and Genomic Networks
Nanyang Technological University
2023-2024
National University of Singapore
2022-2023
Indiana University Bloomington
2017-2022
Indiana University
2017-2021
Preferred Networks (Japan)
2017
Yokohama City University
2017
Mainstream Video-Language Pre-training (VLP) models [10, 26, 64] consist of three parts, a video encoder, text and video-text fusion Transformer. They pursue better performance via utilizing heavier unimodal encoders or multimodal Transformers, resulting in increased parameters with lower efficiency downstream tasks. In this work, we for the first time introduce an end-to-end VLP model, namely all-in-one Transformer, that embeds raw textual signals into joint representations using unified...
Representation learning provides new and powerful graph analytical approaches tools for the highly valued data science challenge of mining knowledge graphs. Since previous methods have mostly focused on homogeneous graphs, an important current is extending this methodology richly heterogeneous graphs domains. The biomedical sciences are such a domain, reflecting complexity biology, with entities as genes, proteins, drugs, diseases, phenotypes, relationships gene co-expression, biochemical...
Infants develop complex visual understanding rapidly, even preceding of the acquisition linguistic inputs. As computer vision seeks to replicate human system, infant development may offer valuable insights. In this paper, we present an interdisciplinary study exploring question: can a computational model that imitates learning process broader concepts extend beyond vocabulary it has heard, similar how infants naturally learn? To investigate this, analyze recently published in Science by Vong...
Concept Bottleneck Models (CBMs) aim to enhance interpretability by predicting human-understandable concepts as intermediates for decision-making. However, these models often face challenges in ensuring reliable concept representations, which can propagate downstream tasks and undermine robustness, especially under distribution shifts. Two inherent issues contribute unreliability: sensitivity concept-irrelevant features (e.g., background variations) lack of semantic consistency the same...
A key problem in automatic analysis and understanding of scientific papers is to extract semantic information from non-textual paper components like figures, diagrams, tables, etc. Much this work requires a very first preprocessing step: decomposing compound multi-part figures into individual sub-figures. Previous figure separation has been based on manually designed features rules, which often fail for less common types layouts. Moreover, few implementations decomposition are publicly...
Audio-visual speaker diarization aims at detecting "who spoke when'' using both auditory and visual signals. Existing audio-visual datasets are mainly focused on indoor environments like meeting rooms or news studios, which quite different from in-the-wild videos in many scenarios such as movies, documentaries, audience sitcoms. To develop methods for these challenging videos, we create the AVA Audio-Visual Diarization (AVA-AVD) dataset. Our experiments demonstrate that adding AVA-AVD into...
One-shot fine-grained visual recognition often suffers from the problem of training data scarcity for new classes. To alleviate this problem, an off-the-shelf image generator can be applied to synthesize additional images, but these synthesized images are not helpful actually improving accuracy one-shot recognition. This paper proposes a meta-learning framework combine generated with original so that resulting ``hybrid'' improve learning. Specifically, generic is updated by few instances...
Identifying "free-space," or safely driveable regions in the scene ahead, is a fundamental task for autonomous navigation. While this can be addressed using semantic segmentation, manual labor involved creating pixel-wise annotations to train segmentation model very costly. Although weakly supervised addresses issue, most methods are not designed free-space. In paper, we observe that homogeneous texture and location two key characteristics of free-space, develop novel, practical framework...
Recent work in computer vision has yielded impressive results automatically describing images with natural language. Most of these systems generate captions a sin- gle language, requiring multiple language-specific models to build multilingual captioning system. We propose very simple technique single unified model across languages, using artificial tokens control the making system more compact. evaluate our approach on generating English and Japanese captions, show that typical neural...
One-shot fine-grained visual recognition often suffers from the problem of having few training examples for new classes. To alleviate this problem, off-the-shelf image generation techniques based on Generative Adversarial Networks (GANs) can potentially create additional images. However, these GAN-generated images are not helpful actually improving accuracy one-shot recognition. In paper, we propose a meta-learning framework to combine generated with original images, so that resulting...
Deepfake videos are becoming increasingly realistic, showing few tampering traces on facial areas that vary between frames. Consequently, existing detection methods struggle to detect unknown domain while accurately locating the tampered region. To address this limitation, we propose Delocate, a novel model can both recognize and localize videos. Our method consists of two stages named recovering localization. In stage, randomly masks regions interest (ROIs) reconstructs real faces without...
We present an approach for road segmentation that only requires image-level annotations at training time.We leverage distant supervision, which allows us to train our model using images are different from the target domain.Using large publicly available image databases as supervisors, we develop a simple method automatically generate weak pixel-wise masks.These used iteratively fully convolutional neural network, produces final model.We evaluate on Cityscapes dataset, where compare it with...
Deepfake techniques have been widely used for malicious purposes, prompting extensive research interest in developing detection methods. manipulations typically involve tampering with facial parts, which can result inconsistencies across different parts of the face. For instance, may change smiling lips to an upset lip, while eyes remain smiling. Existing methods depend on specific indicators forgery, tend disappear as forgery patterns are improved. To address limitation, we propose Mover, a...
Recognizing the types of white blood cells (WBCs) in microscopic images human smears is a fundamental task fields pathology and hematology. Although previous studies have made significant contributions to development methods datasets, few papers investigated benchmarks or baselines that others can easily refer to. For instance, we observed notable variations reported accuracies same Convolutional Neural Network (CNN) model across different studies, yet no public implementation exists...
Human infants have the remarkable ability to learn associations between object names and visual objects from inherently ambiguous experiences. Researchers in cognitive science developmental psychology built formal models that implement in-principle learning algorithms, then used pre-selected pre-cleaned datasets test abilities of find statistical regularities input data. In contrast previous modeling approaches, present study egocentric video gaze data collected infant learners during...
Representation learning provides new and powerful graph analytical approaches tools for the highly valued data science challenge of mining knowledge graphs. Since previous methods have mostly focused on homogeneous graphs, an important current is extending this methodology richly heterogeneous graphs domains. The biomedical sciences are such a domain, reflecting complexity biology, with entities as genes, proteins, drugs, diseases, phenotypes, relationships gene co-expression, biochemical...
Rendering scenes with a high-quality human face from arbitrary viewpoints is practical and useful technique for many real-world applications. Recently, Neural Radiance Fields (NeRF), rendering that uses neural networks to approximate classical ray tracing, have been considered as one of the promising approaches synthesizing novel views sparse set images. We find NeRF can render new while maintaining geometric consistency, but it does not properly maintain skin details, such moles pores....
Text-based Visual Question Answering (Text-VQA) is a question-answering task to understand scene text, where the text usually recognized by Optical Character Recognition (OCR) systems. However, from OCR systems often includes spelling errors, such as "pepsi" being "peosi". These errors are one of major challenges for Text-VQA To address this, we propose novel method alleviate via token evolution. First, artificially create misspelled tokens in training time, and make system more robust...
We present an approach for road segmentation that only requires image-level annotations at training time. leverage distant supervision, which allows us to train our model using images are different from the target domain. Using large publicly available image databases as supervisors, we develop a simple method automatically generate weak pixel-wise masks. These used iteratively fully convolutional neural network, produces final model. evaluate on Cityscapes dataset, where compare it with...
Recognizing people by faces and other biometrics has been extensively studied in computer vision. But these techniques do not work for identifying the wearer of an egocentric (first-person) camera because that person rarely (if ever) appears their own first-person view. while one's face is frequently visible, hands are: fact, are among most common objects field It thus natural to ask whether appearance motion patterns people's distinctive enough recognize them. In this paper, we...
Inspired by the remarkable ability of infant visual learning system, a recent study collected first-person images from children to analyze `training data' that they receive. We conduct follow-up investigates two additional directions. First, given infants can quickly learn recognize new object without much supervision (i.e. few-shot learning), we limit number training images. Second, investigate how control signals receive during based on hand manipulation objects. Our experimental results...
Deepfake videos are becoming increasingly realistic, showing subtle tampering traces on facial areasthat vary between frames. Consequently, many existing detection methods struggle to detect unknown domain while accurately locating the tampered region. To address thislimitation, we propose Delocate, a novel model that can both recognize andlocalize videos. Ourmethod consists of two stages named recoveringand localization. In recovering stage, modelrandomly masks regions interest (ROIs) and...