- Music and Audio Processing
- Speech and Audio Processing
- Music Technology and Sound Studies
- Video Analysis and Summarization
- Diverse Musicological Studies
- Speech Recognition and Synthesis
- Time Series Analysis and Forecasting
- Speech and dialogue systems
- Digital Media Forensic Detection
- Visual perception and processing mechanisms
- Human Motion and Animation
- Anomaly Detection Techniques and Applications
- Topic Modeling
- Color perception and design
- Social Robot Interaction and HRI
- Advanced Steganography and Watermarking Techniques
- Multimodal Machine Learning Applications
- Phonetics and Phonology Research
- Multisensory perception and integration
- Natural Language Processing Techniques
Harvey Mudd College
2016-2025
University of California, Berkeley
2013-2016
International Computer Science Institute
2015
Microsoft (United States)
2015
Chiba University
2009
Speech activity detection (SAD) on channel transmissions is a critical preprocessing task for speech, speaker and language recognition or further human analysis. This paper presents feature combination approach to improve SAD highly degraded speech as part of the Defense Advanced Research Projects Agency’s (DARPA) Robust Automatic Transcription (RATS) program. The key contribution exploration different novel features based pitch spectro-temporal processing standard Mel Frequency Cepstral...
The goal of addressee detection is to answer the question , "Are you talking me?" When a dialogue system interacts with multiple users, it crucial detect when user speaking as opposed another person. We study this problem in multimodal scenario, using lexical, acoustic, visual, state, and beamforming information. Using data from multiparty system, we quantify benefits modalities over single modality. also assess relative importance various modalities, well key individual features, estimating...
Dynamic time warping estimates the alignment between two sequences and is designed to handle a variable amount of warping. In many contexts, it performs poorly when confronted with different scale, in which average slope true path pairwise cost matrix deviates significantly from one. This paper investigates ways improve robustness DTW such global conditions, using an audio–audio task as motivating scenario interest. We modify dataset commonly used for studying synchronization order construct...
TED talks are the pinnacle of public speaking. They combine compelling content with flawless delivery, and their popularity is attested by millions views they attract. In this work, we compare prosodic voice characteristics speakers university professors. Our aim to identify that separate from other speakers. Based on a simple set features derived pitch energy, train discriminative classifier predict whether 5 minute audio sample talk or lecture. We able achieve < 10% equal error rate. then...
Addressee detection answers the question, "Are you talking to me?" When multiple users interact with a dialogue system, it is important know when user speaking computer and he or she another person. We approach this problem from multimodal perspective, using lexical, acoustic, visual, dialog state, beam-forming information. Using data multiparty we demonstrate benefit of modalities over single modality. also assess relative importance various in predicting addressee. In our experiments, find...
This article studies a composer style classification task based on raw sheet music images. While previous works recognition have relied exclusively supervised learning, we explore the use of self-supervised pretraining methods that been recently developed for natural language processing. We first convert images to sequences musical words, train model large set unlabeled “sentences”, initialize classifier with pretrained weights, and then finetune small labeled data. conduct extensive...
This paper proposes a way to generate single high-quality audio recording of meeting using no equipment other than participants' personal devices. Each participant in the uses their mobile device as local node, and they begin whenever arrive an unsynchronized fashion. The main problem generating summary is temporally align various recordings robust efficient manner. We propose do this adaptive fingerprint based on spectrotemporal eigenfilters, where design learned on-the-fly totally...
This paper investigates an ordered partial matching alignment problem, in which the goal is to align two sequences presence of potentially non-matching regions. We propose a novel parameter-free dynamic programming method called hidden state time warping that allows path switch between different planes: “visible” plane corresponding sections and “hidden” sections. By defining distinct planes, we can allow types each (e.g., imposing maximum factor regions while allowing completely...
This paper studies the problem of automatically generating piano score following videos given an audio recording and raw sheet music images. Whereas previous works focus on synthetic where data has been cleaned preprocessed, we instead developing a system that can cope with messiness raw, unprocessed PDFs from IMSLP. We investigate how well existing systems real scanned music, filler pages unrelated pieces or movements, discontinuities due to jumps repeats. find significant bottleneck in...
This article introduces a method for large-scale retrieval of piano sheet music images. We study this problem in two different scenarios: camera-based identification and MIDI-sheet image retrieval. Our proposed combines bootleg score features with novel hashing scheme called dynamic N-gram fingerprinting. ensures that every fingerprint is discriminative enough to warrant table lookup, which improves both accuracy runtime. On experiments using all images the IMSLP database, achieves >0.8 mean...
In this work we explore parallelizable alternatives to DTW for globally aligning two feature sequences. One of the main practical limitations is its quadratic computation and memory cost. Previous works have sought reduce computational cost in various ways, such as imposing bands matrix or using a multiresolution approach. work, utilize fact that an abundant resource focus instead on exploring approximate inherently sequential algorithm with one parallelizable. We describe variations called...
This paper studies the problem of identifying piano music in various modalities using a single, unified approach called marketplace fingerprinting. The key defining characteristic fingerprinting is choice: we consider broad range fingerprint designs based on generalization standard n-grams, and then select at runtime that are best for specific query. We show large-scale retrieval can be framed as an economics which consumer store interact. In our analogy, search like shopping store, items...
This article studies the problem of generating a piano score following video from an audio recording in fully automated manner. contains two components: identifying piece and aligning with raw sheet music images. Unlike previous work, we focus primarily on working raw, unprocessed IMSLP, which may contain filler pages, other unrelated pieces or movements, repeats jumps whose locations are unknown priori. To solve this problem, combine state-of-the-art methods novel alignment algorithm called...
This article motivates, describes, and presents the PBSCSR dataset for studying composer style recognition of piano sheet music. Our overarching goal was to create a that is "as accessible as MNIST challenging ImageNet." To achieve this goal, we sample fixed-length bootleg score fragments from music images on IMSLP. The itself contains 40,000 62x64 9-way classification task, 100,000 100-way 29,310 unlabeled variable-length pretraining. labeled data presented in form mirrors images, order...
This paper studies composer style classification of piano sheet music images. Previous approaches to the task have been limited by a scarcity data. We address this issue in two ways: (1) we recast problem be based on raw images rather than symbolic format, and (2) propose an approach that can trained unlabeled Our first converts image into sequence musical "words" bootleg feature representation, then feeds text classifier. show it is possible significantly improve classifier performance...
This paper tackles the problem of verifying authenticity speech recordings from world leaders. Whereas previous work on detecting deep fake or tampered audio focus scrutinizing an recording in isolation, we instead reframe and cross-verifying a questionable against trusted references. We present method for reference that consists two steps: aligning then classifying each query frame as matching non-matching. propose subsequence alignment based Needleman-Wunsch algorithm show it significantly...
Audio fingerprinting refers to the process of extracting a robust, compact representation audio which can be used uniquely identify an segment. Works in literature generally report results using system-level metrics. Because these systems are usually very complex, overall performance depends on many different factors. So, while metrics useful understanding how well entire system performs, they not knowing good or bad fingerprint design is. In this work, we propose metric effectiveness that...
This paper proposes a robust and efficient way to temporally align set of unsynchronized meeting recordings, such as might be collected by participants’ cell phones. We propose an adaptive audio fingerprint which is learned on-the-fly in completely unsupervised manner adapt the characteristics given unaligned recordings. The design formulated series optimization problems can solved very efficiently using eigenvector routines. also method aligning sets files uses cumulative evidence from...
This paper studies instrument classification of solo sheet music. Whereas previous work has focused on recognition in audio data, we instead approach the problem using raw music images. Our first converts image into a sequence musical words based bootleg score representation, and then treats as text task. We show that it is possible to significantly improve classifier performance by training language model unlabeled initializing with pretrained weights, finetuning labeled data. In this work,...