- Recommender Systems and Techniques
- Multimodal Machine Learning Applications
- Generative Adversarial Networks and Image Synthesis
- Human Pose and Action Recognition
- Video Analysis and Summarization
- Advanced Vision and Imaging
- Topic Modeling
- Domain Adaptation and Few-Shot Learning
- Advanced Graph Neural Networks
- Music and Audio Processing
- Text and Document Classification Technologies
- Advanced Image and Video Retrieval Techniques
- Image Retrieval and Classification Techniques
- Advanced Bandit Algorithms Research
- Music Technology and Sound Studies
- Interactive and Immersive Displays
- Advanced Neural Network Applications
- Computer Graphics and Visualization Techniques
- Anomaly Detection Techniques and Applications
- Advanced Image Processing Techniques
- Web Data Mining and Analysis
- Advanced Text Analysis Techniques
- Multimedia Communication and Technology
- Diverse Musicological Studies
- Medical Image Segmentation Techniques
Google (United States)
2016-2025
University of Toronto
2025
Seoul National University
2022-2024
Samsung SDS (South Korea)
2022
Samsung (South Korea)
2022
National University
2021-2022
Korea Post
2022
Pohang University of Science and Technology
2022
Gwangju Institute of Science and Technology
2021
Alphabet (United States)
2019
Many recent advancements in Computer Vision are attributed to large datasets. Open-source software packages for Machine Learning and inexpensive commodity hardware have reduced the barrier of entry exploring novel approaches at scale. It is possible train models over millions examples within a few days. Although large-scale datasets exist image understanding, such as ImageNet, there no comparable size video classification In this paper, we introduce YouTube-8M, largest multi-label dataset,...
Although essential to revealing biased performance, well intentioned attempts at algorithmic auditing can have effects that may harm the very populations these measures are meant protect. This concern is even more salient while biometric systems such as facial recognition, where data sensitive and technology often used in ethically questionable manners. We demonstrate a set of fiveethical concerns particular case commercial processing technology, highlighting additional design considerations...
Personalized recommendation systems are used in a wide variety of applications such as electronic commerce, social networks, web search, and more. Collaborative filtering approaches to typically assume that the rating matrix (e.g., movie ratings by viewers) is low-rank. In this paper, we examine an alternative approach which locally Concretely, low-rank within certain neighborhoods metric space defined (user, item) pairs. We combine recent for local approximation based on Frobenius norm with...
Tracking and predicting extreme events in large-scale spatio-temporal climate data are long standing challenges science. In this paper, we propose Convolutional LSTM (ConvLSTM)-based models to track predict hurricane trajectories from data; namely, pixel-level history of tropical cyclones. To address the tracking problem, model time-sequential density maps trajectories, enabling capture not only temporal dynamics but also spatial distribution trajectories. Furthermore, introduce a new...
For cold-start recommendation, it is important to rapidly profile new users and generate a good initial set of recommendations through an interview process --- should be queried adaptively in sequential fashion, multiple items offered for opinion solicitation at each trial. In this work, we propose novel algorithm that learns conduct the guided by decision tree with questions split. The splits, represented as sparse weight vectors, are learned L_1-constrained optimization framework. directed...
The goal of video understanding is to develop algorithms that enable machines understand videos at the level human experts. Researchers have tackled various domains including classification, search, personalized recommendation, and more. However, there a research gap in combining these one unified learning framework. Towards that, we propose deep network embeds using their audio-visual content, onto metric space which preserves video-to-video relationships. Then, use trained embedding tackle...
The task of predicting future actions from a video is crucial for real-world agent interacting with others. When anticipating in the distant future, we humans typically consider long-term relations over whole sequence actions, i.e., not only observed past but also potential future. In similar spirit, propose an end-to-end attention model action anticipation, dubbed Future Transformer (FUTR), that leverages global all input frames and output tokens to predict minutes-long actions. Unlike...
Although essential to revealing biased performance, well intentioned attempts at algorithmic auditing can have effects that may harm the very populations these measures are meant protect. This concern is even more salient while biometric systems such as facial recognition, where data sensitive and technology often used in ethically questionable manners. We demonstrate a set of five ethical concerns particular case commercial processing technology, highlighting additional design...
Music tagging and content-based retrieval systems have traditionally been constructed using pre-defined ontologies covering a rigid set of music attributes or text queries. This paper presents MuLan: first attempt at new generation acoustic models that link audio directly to unconstrained natural language descriptions. MuLan takes the form two-tower, joint audio-text embedding model trained 44 million recordings (370K hours) weakly-associated, free-form annotations. Through its compatibility...
Traditional recommendation systems using collaborative filtering (CF) approaches work relatively well when the candidate videos are sufficiently popular With increase of user-created videos, however, recommending fresh gets more and important, but pure CF-based may not perform in such cold-start situation. In this paper, we model as a video content-based similarity learning problem, learn deep embeddings trained to predict relationships identified by co-watch-based system only visual audial...
Graph Convolutional Networks (GCNs) have shown significant improvements in semi-supervised learning on graph-structured data. Concurrently, unsupervised of graph embeddings has benefited from the information contained random walks. In this paper, we propose a model: Network GCNs (N-GCN), which marries these two lines work. At its core, N-GCN trains multiple instances over node pairs discovered at different distances walks, and learns combination instance outputs optimizes classification...
Zero-shot learning offers an efficient solution for a machine model to treat unseen categories, avoiding exhaustive data collection. Sketch-based Image Retrieval (ZS-SBIR) simulates real-world scenarios where it is hard and costly collect paired sketch-photo samples. We propose novel framework that indirectly aligns sketches photos by contrasting them through texts, removing the necessity of access pairs. With explicit modality encoding learned from data, our approach disentangles...
Session-based recommendation aims at predicting the next item given a sequence of previous items consumed in session, e.g., on e-commerce or multimedia streaming services. Specifically, session data exhibits some unique characteristics, i.e., consistency and sequential dependency over within repeated consumption, timeliness. In this paper, we propose simple-yet-effective linear models for considering holistic aspects sessions. The comprehensive nature our helps improve quality session-based...
X-ray computed tomography (CT) is one of the most common imaging techniques used to diagnose various diseases in medical field. Its high contrast sensitivity and spatial resolution allow physician observe details body parts such as bones, soft tissue, blood vessels, etc. As it involves potentially harmful radiation exposure patients surgeons, however, reconstructing 3D CT volume from perpendicular 2D images considered a promising alternative, thanks its lower risk better accessibility. This...
Video-to-music generation demands both a temporally localized high-quality listening experience and globally aligned video-acoustic signatures. While recent music models excel at the former through advanced audio codecs, exploration of signatures has been confined to specific visual scenarios. In contrast, our research confronts challenge learning between video directly from paired videos, without explicitly modeling domain-specific rhythmic or semantic relationships. We propose V2Meow,...