- Domain Adaptation and Few-Shot Learning
- Advanced Image Processing Techniques
- Image and Signal Denoising Methods
- Advanced Neural Network Applications
- Advanced Image and Video Retrieval Techniques
- Image Enhancement Techniques
- Human Pose and Action Recognition
- Multimodal Machine Learning Applications
- Advanced Sensor and Energy Harvesting Materials
- Image Processing Techniques and Applications
- Speech and Audio Processing
- Tactile and Sensory Interactions
- Image and Video Quality Assessment
- Music Technology and Sound Studies
- Smart Grid Energy Management
- Video Surveillance and Tracking Methods
- Vestibular and auditory disorders
- Advanced Vision and Imaging
- Adversarial Robustness in Machine Learning
- Music and Audio Processing
- IoT-based Smart Home Systems
- Integrated Circuits and Semiconductor Failure Analysis
- Bluetooth and Wireless Communication Technologies
- Photoacoustic and Ultrasonic Imaging
- Robot Manipulation and Learning
Korea Advanced Institute of Science and Technology
2018-2024
Kootenay Association for Science & Technology
2022
Dong-A University
2016
This paper considers a network referred to as Modality Shifting Attention Network (MSAN) for Multimodal Video Question Answering (MVQA) task. MSAN decomposes the task into two sub-tasks: (1) localization of temporal moment relevant question, and (2) accurate prediction answer based on localized moment. The modality required may be different from that prediction, this ability shift is essential performing To end, proposal (MPN) attempts locate most appropriate each modalities, also...
Diffusion models have gained significant popularity for image-to-image translation tasks. Previous efforts applying diffusion to image super-resolution demonstrated that iteratively refining pure Gaussian noise using a U-Net architecture trained on denoising at various levels can yield satisfactory high-resolution images from low-resolution inputs. However, this iterative refinement process comes with the drawback of low inference speed, which strongly limits its applications. To speed up...
This paper considers an architecture referred to as Cascade Region Proposal Network (Cascade RPN) for improving the region-proposal quality and detection performance by \textit{systematically} addressing limitation of conventional RPN that \textit{heuristically defines} anchors \textit{aligns} features anchors. First, instead using multiple with predefined scales aspect ratios, relies on a \textit{single anchor} per location performs multi-stage refinement. Each stage is progressively more...
Diffusion probabilistic models (DPM) have been widely adopted in image-to-image translation to generate high-quality images. Prior attempts at applying the DPM image super-resolution (SR) shown that iteratively refining a pure Gaussian noise with conditional using U-Net trained on denoising various-level noises can help obtain satisfied high-resolution for low-resolution one. To further improve performance and simplify current DPM-based methods, we propose simple but non-trivial post-process...
Contrastive learning (CL) is widely known to require many negative samples, 65536 in MoCo for instance, which the performance of a dictionary-free framework often inferior because sample size (NSS) limited by its mini-batch (MBS). To decouple NSS from MBS, dynamic dictionary has been adopted large volume CL frameworks, among arguably most popular one family. In essence, adopts momentum-based queue dictionary, we perform fine-grained analysis and consistency. We point out that InfoNCE loss...
This paper introduces ITA-MDT, the Image-Timestep-Adaptive Masked Diffusion Transformer Framework for Image-Based Virtual Try-On (IVTON), designed to overcome limitations of previous approaches by leveraging (MDT) improved handling both global garment context and fine-grained details. The IVTON task involves seamlessly superimposing a from one image onto person in another, creating realistic depiction wearing specified garment. Unlike conventional diffusion-based virtual try-on models that...
Actual image super-resolution is an extremely challenging task due to complex degradations existing in the image. To solve this problem, two dominant methodologies have emerged: degradation-estimation-based Addressing actual remains a formidable challenge intricate present images. Two primary and blind-based methods. The former often struggle accurately estimate degradation, limiting their effectiveness on real low-resolution Conversely, methods rely single perceptual perspective,...
We propose E-MD3C ($\underline{E}$fficient $\underline{M}$asked $\underline{D}$iffusion Transformer with Disentangled $\underline{C}$onditions and $\underline{C}$ompact $\underline{C}$ollector), a highly efficient framework for zero-shot object image customization. Unlike prior works reliant on resource-intensive Unet architectures, our approach employs lightweight masked diffusion transformers operating latent patches, offering significantly improved computational efficiency. The integrates...
To avoid collapse in self-supervised learning (SSL), a contrastive loss is widely used but often requires large number of negative samples. Without samples yet achieving competitive performance, recent work has attracted significant attention for providing minimalist simple Siamese (SimSiam) method to collapse. However, the reason how it avoids without remains not fully clear and our investigation starts by revisiting explanatory claims original SimSiam. After refuting their claims, we...
Data augmentation can impact the generalization performance of an image classification model in a significant way. However, it is currently conducted on basis trial and error, its cannot be predicted during training. This paper considers influence function that predicts how performance, terms validation loss, affected by particular augmented training sample. The provides approximation change loss without actually comparing performances include exclude sample process. Based this function,...
Self-supervised learning (SSL) has gained remarkable success, for which contrastive (CL) plays a key role. However, the recent development of new non-CL frameworks achieved comparable or better performance with high improvement potential, prompting researchers to enhance these further. Assimilating CL into been thought be beneficial, but empirical evidence indicates no visible improvements. In view that, this paper proposes strategy performing along dimensional direction instead batch as...
Model agnostic meta-learning (MAML) is a popular state-of-the-art algorithm that provides good weight initialization of model given variety learning tasks. The initialized by provided can be fine-tuned to an unseen task despite only using small amount samples and within few adaptation steps. MAML simple versatile but requires costly rate tuning careful design the distribution which affects its scalability generalization. This paper proposes more robust based on adaptive scheme prioritization...
In article number 1904020, Keon Jae Lee and co-workers review recent developments in speech recognition, terms of flexible piezoelectric materials, self-powered sensors, machine-learning algorithms, speaker recognition. Such systems will play an innovative interface artificial intelligence services.
Herein, we introduce "Look and Diagnose" (LAD), a hybrid deep learning-based system that aims to support doctors in the medical field for diagnosing effectively <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Benign Paroxysmal Positional Vertigo</i> (BPPV) disorder. Given body postures of patient Dix-Hallpike lateral head turns test, visual information both eyes is captured fed into LAD analyzing classifying one six possible disorders which...
Self-supervised learning (SSL) has emerged as a promising approach for representations from unlabeled data. Momentum-based contrastive frameworks such MoCo-v3 have shown remarkable success among the many SSL methods proposed in recent years. However, significant gap encoder representation exists between online (student) and momentum (teacher) these frameworks, limiting performance on downstream tasks. We identify this bottleneck often overlooked existing propose "residual momentum" that...
Non-intrusive load monitoring (NILM) has been being the method to estimate and disaggregate information about power consumption of individual electric appliances in a building or home by aggregate measurements voltage current. The plays key component NILM system monitor, reduces overall energy building. This paper proposes extract characteristic fingerprint root-mean-square (RMS) current household discusses feasibility classifying as well identifying appliance with expert system. proposed...
This paper proposes combination of a cognitive agent architecture named Soar (State, operator, and result) ROS (Robot Operating System), which can be basic framework for robot to interact cope with its environment more intelligently appropriately. The proposed Soar-ROS human-robot interaction (HRI) understands set human’s commands by voice recognition chooses properly react the command according symbol detected image recognition, implemented on humanoid robot. robotic is allowed refuse...
We present X-MDPT (Cross-view Masked Diffusion Prediction Transformers), a novel diffusion model designed for pose-guided human image generation. distinguishes itself by employing masked transformers that operate on latent patches, departure from the commonly-used Unet structures in existing works. The comprises three key modules: 1) denoising Transformer, 2) an aggregation network consolidates conditions into single vector process, and 3) mask cross-prediction module enhances representation...
We introduce MDSGen, a novel framework for vision-guided open-domain sound generation optimized model parameter size, memory consumption, and inference speed. This incorporates two key innovations: (1) redundant video feature removal module that filters out unnecessary visual information, (2) temporal-aware masking strategy leverages temporal context enhanced audio accuracy. In contrast to existing resource-heavy Unet-based models, MDSGen employs denoising masked diffusion transformers,...