NFDI4DS | UHH-SEMS - Publication Details

T. J. Tsai

ORCID: 0000-0003-3832-496X

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5034550382

Research Areas

Music and Audio Processing
Speech and Audio Processing
Music Technology and Sound Studies
Video Analysis and Summarization
Diverse Musicological Studies
Speech Recognition and Synthesis
Time Series Analysis and Forecasting
Speech and dialogue systems
Digital Media Forensic Detection
Visual perception and processing mechanisms
Human Motion and Animation
Anomaly Detection Techniques and Applications
Topic Modeling
Color perception and design
Social Robot Interaction and HRI
Advanced Steganography and Watermarking Techniques
Multimodal Machine Learning Applications
Phonetics and Phonology Research
Multisensory perception and integration
Natural Language Processing Techniques

Harvey Mudd College
2016-2025

University of California, Berkeley
2013-2016

International Computer Science Institute
2015

Microsoft (United States)
2015

Chiba University
2009

All for one: feature combination for highly channel-degraded speech activity detection

OPENALEX - Publications

Martin Graciarena Abeer Alwan Dan Ellis Horacio Franco Luciana Ferrer and 11 more

Speech activity detection (SAD) on channel transmissions is a critical preprocessing task for speech, speaker and language recognition or further human analysis. This paper presents feature combination approach to improve SAD highly degraded speech as part of the Defense Advanced Research Projects Agency’s (DARPA) Robust Automatic Transcription (RATS) program. The key contribution exploration different novel features based pitch spectro-temporal processing standard Mel Frequency Cepstral...

10.21437/interspeech.2013-199 article EN Interspeech 2022 2013-08-25

A Study of Multimodal Addressee Detection in Human-Human-Computer Interaction

OPENALEX - Publications

T. J. Tsai Andreas Stolcke Malcolm Slaney

The goal of addressee detection is to answer the question , "Are you talking me?" When a dialogue system interacts with multiple users, it crucial detect when user speaking as opposed another person. We study this problem in multimodal scenario, using lexical, acoustic, visual, state, and beamforming information. Using data from multiparty system, we quantify benefits modalities over single modality. also assess relative importance various modalities, well key individual features, estimating...

10.1109/tmm.2015.2454332 article EN IEEE Transactions on Multimedia 2015-07-09

Dense-Sparse Dynamic Time Warping for Customizing Piano Concerto Accompaniments

OPENALEX - Publications

T. J. Tsai K. Dey Yigitcan Özer Meinard Müeller

10.1109/icassp49660.2025.10890080 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Improving the Robustness of DTW to Global Time Warping Conditions in Audio Synchronization

OPENALEX - Publications

Jittisa Kraprayoon Austin Pham T. J. Tsai

Dynamic time warping estimates the alignment between two sequences and is designed to handle a variable amount of warping. In many contexts, it performs poorly when confronted with different scale, in which average slope true path pairwise cost matrix deviates significantly from one. This paper investigates ways improve robustness DTW such global conditions, using an audio–audio task as motivating scenario interest. We modify dataset commonly used for studying synchronization order construct...

10.3390/app14041459 article EN cc-by Applied Sciences 2024-02-10

Are you TED talk material? comparing prosody in professors and TED speakers

OPENALEX - Publications

T. J. Tsai

TED talks are the pinnacle of public speaking. They combine compelling content with flawless delivery, and their popularity is attested by millions views they attract. In this work, we compare prosodic voice characteristics speakers university professors. Our aim to identify that separate from other speakers. Based on a simple set features derived pitch energy, train discriminative classifier predict whether 5 minute audio sample talk or lecture. We able achieve < 10% equal error rate. then...

10.21437/interspeech.2015-546 article EN Interspeech 2022 2015-09-06

Multimodal addressee detection in multiparty dialogue systems

OPENALEX - Publications

T. J. Tsai Andreas Stolcke Malcolm Slaney

Addressee detection answers the question, "Are you talking to me?" When multiple users interact with a dialogue system, it is important know when user speaking computer and he or she another person. We approach this problem from multimodal perspective, using lexical, acoustic, visual, dialog state, beam-forming information. Using data multiparty we demonstrate benefit of modalities over single modality. also assess relative importance various in predicting addressee. In our experiments, find...

10.1109/icassp.2015.7178384 article EN 2015-04-01

Longer features: they do a speech detector good

OPENALEX - Publications

T. J. Tsai Nelson Morgan

10.21437/interspeech.2012-391 article EN Interspeech 2022 2012-09-09

A Deeper Look at Sheet Music Composer Classification Using Self-Supervised Pretraining

OPENALEX - Publications

Daniel Yang Kevin Ji T. J. Tsai

This article studies a composer style classification task based on raw sheet music images. While previous works recognition have relied exclusively supervised learning, we explore the use of self-supervised pretraining methods that been recently developed for natural language processing. We first convert images to sequences musical words, train model large set unlabeled “sentences”, initialize classifier with pretrained weights, and then finetune small labeled data. conduct extensive...

10.3390/app11041387 article EN cc-by Applied Sciences 2021-02-04

Robust and Efficient Multiple Alignment of Unsynchronized Meeting Recordings

OPENALEX - Publications

T. J. Tsai Andreas Stolcke

This paper proposes a way to generate single high-quality audio recording of meeting using no equipment other than participants' personal devices. Each participant in the uses their mobile device as local node, and they begin whenever arrive an unsynchronized fashion. The main problem generating summary is temporally align various recordings robust efficient manner. We propose do this adaptive fingerprint based on spectrotemporal eigenfilters, where design learned on-the-fly totally...

10.1109/taslp.2016.2526787 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2016-02-08

Parameter-Free Ordered Partial Match Alignment with Hidden State Time Warping

OPENALEX - Publications

Claire S. Chang Thaxter Shaw Arya Goutam Christina L. Lau Mengyi Shan and 1 more

This paper investigates an ordered partial matching alignment problem, in which the goal is to align two sequences presence of potentially non-matching regions. We propose a novel parameter-free dynamic programming method called hidden state time warping that allows path switch between different planes: “visible” plane corresponding sections and “hidden” sections. By defining distinct planes, we can allow types each (e.g., imposing maximum factor regions while allowing completely...

10.3390/app12083783 article EN cc-by Applied Sciences 2022-04-08

Improved Handling of Repeats and Jumps in Audio-Sheet Image Synchronization

OPENALEX - Publications

Mengyi Shan T. J. Tsai

This paper studies the problem of automatically generating piano score following videos given an audio recording and raw sheet music images. Whereas previous works focus on synthetic where data has been cleaned preprocessed, we instead developing a system that can cope with messiness raw, unprocessed PDFs from IMSLP. We investigate how well existing systems real scanned music, filler pages unrelated pieces or movements, discontinuities due to jumps repeats. find significant bottleneck in...

10.48550/arxiv.2007.14580 preprint EN cc-by arXiv (Cornell University) 2020-01-01

Piano Sheet Music Identification Using Dynamic N-gram Fingerprinting

OPENALEX - Publications

Daniel Yang T. J. Tsai

This article introduces a method for large-scale retrieval of piano sheet music images. We study this problem in two different scenarios: camera-based identification and MIDI-sheet image retrieval. Our proposed combines bootleg score features with novel hashing scheme called dynamic N-gram fingerprinting. ensures that every fingerprint is discriminative enough to warrant table lookup, which improves both accuracy runtime. On experiments using all images the IMSLP database, achieves >0.8 mean...

10.5334/tismir.70 article EN cc-by Transactions of the International Society for Music Information Retrieval 2021-01-01

Segmental Dtw: A Parallelizable Alternative to Dynamic Time Warping

OPENALEX - Publications

T. J. Tsai

In this work we explore parallelizable alternatives to DTW for globally aligning two feature sequences. One of the main practical limitations is its quadratic computation and memory cost. Previous works have sought reduce computational cost in various ways, such as imposing bands matrix or using a multiresolution approach. work, utilize fact that an abundant resource focus instead on exploring approximate inherently sequential algorithm with one parallelizable. We describe variations called...

10.1109/icassp39728.2021.9413827 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

Large-Scale Multimodal Piano Music Identification Using Marketplace Fingerprinting

OPENALEX - Publications

Daniel Yang Arya Goutam Kevin Ji T. J. Tsai

This paper studies the problem of identifying piano music in various modalities using a single, unified approach called marketplace fingerprinting. The key defining characteristic fingerprinting is choice: we consider broad range fingerprint designs based on generalization standard n-grams, and then select at runtime that are best for specific query. We show large-scale retrieval can be framed as an economics which consumer store interact. In our analogy, search like shopping store, items...

10.3390/a15050146 article EN cc-by Algorithms 2022-04-26

Automatic Generation of Piano Score Following Videos

OPENALEX - Publications

Mengyi Shan T. J. Tsai

This article studies the problem of generating a piano score following video from an audio recording in fully automated manner. contains two components: identifying piece and aligning with raw sheet music images. Unlike previous work, we focus primarily on working raw, unprocessed IMSLP, which may contain filler pages, other unrelated pieces or movements, repeats jumps whose locations are unknown priori. To solve this problem, combine state-of-the-art methods novel alignment algorithm called...

10.5334/tismir.69 article EN cc-by Transactions of the International Society for Music Information Retrieval 2021-01-01

PBSCSR: The Piano Bootleg Score Composer Style Recognition Dataset

OPENALEX - Publications

Arhan Jain Alec Bunn T. J. Tsai

This article motivates, describes, and presents the PBSCSR dataset for studying composer style recognition of piano sheet music. Our overarching goal was to create a that is "as accessible as MNIST challenging ImageNet." To achieve this goal, we sample fixed-length bootleg score fragments from music images on IMSLP. The itself contains 40,000 62x64 9-way classification task, 100,000 100-way 29,310 unlabeled variable-length pretraining. labeled data presented in form mirrors images, order...

10.48550/arxiv.2401.16803 preprint EN arXiv (Cornell University) 2024-01-30

PBSCR: The Piano Bootleg Score Composer Recognition Dataset

OPENALEX - Publications

Arhan Jain Alec Bunn Austin Pham T. J. Tsai

10.5334/tismir.185 article EN cc-by Transactions of the International Society for Music Information Retrieval 2024-01-01

Composer Style Classification of Piano Sheet Music Images Using Language Model Pretraining

OPENALEX - Publications

T. J. Tsai Kevin Ji

This paper studies composer style classification of piano sheet music images. Previous approaches to the task have been limited by a scarcity data. We address this issue in two ways: (1) we recast problem be based on raw images rather than symbolic format, and (2) propose an approach that can trained unlabeled Our first converts image into sequence musical "words" bootleg feature representation, then feeds text classifier. show it is possible significantly improve classifier performance...

10.48550/arxiv.2007.14587 preprint EN cc-by arXiv (Cornell University) 2020-01-01

A Cross-Verification Approach for Protecting World Leaders from Fake and Tampered Audio

OPENALEX - Publications

Mengyi Shan T. J. Tsai

This paper tackles the problem of verifying authenticity speech recordings from world leaders. Whereas previous work on detecting deep fake or tampered audio focus scrutinizing an recording in isolation, we instead reframe and cross-verifying a questionable against trusted references. We present method for reference that consists two steps: aligning then classifying each query frame as matching non-matching. propose subsequence alignment based Needleman-Wunsch algorithm show it significantly...

10.48550/arxiv.2010.12173 preprint EN other-oa arXiv (Cornell University) 2020-01-01

An information-theoretic metric of fingerprint effectiveness

OPENALEX - Publications

T. J. Tsai Gerald Friedland Xavier Anguera

Audio fingerprinting refers to the process of extracting a robust, compact representation audio which can be used uniquely identify an segment. Works in literature generally report results using system-level metrics. Because these systems are usually very complex, overall performance depends on many different factors. So, while metrics useful understanding how well entire system performs, they not knowing good or bad fingerprint design is. In this work, we propose metric effectiveness that...

10.1109/icassp.2015.7177987 article EN 2015-04-01

Aligning meeting recordings via adaptive fingerprinting

OPENALEX - Publications

T. J. Tsai Andreas Stolcke

This paper proposes a robust and efficient way to temporally align set of unsynchronized meeting recordings, such as might be collected by participants’ cell phones. We propose an adaptive audio fingerprint which is learned on-the-fly in completely unsupervised manner adapt the characteristics given unaligned recordings. The design formulated series optimization problems can solved very efficiently using eigenvector routines. also method aligning sets files uses cumulative evidence from...

10.21437/interspeech.2015-224 article EN Interspeech 2022 2015-09-06

Instrument Classification of Solo Sheet Music Images

OPENALEX - Publications

Kevin Ji Daniel Yang T. J. Tsai

This paper studies instrument classification of solo sheet music. Whereas previous work has focused on recognition in audio data, we instead approach the problem using raw music images. Our first converts image into a sequence musical words based bootleg score representation, and then treats as text task. We show that it is possible to significantly improve classifier performance by training language model unlabeled initializing with pretrained weights, finetuning labeled data. In this work,...

10.1109/icassp39728.2021.9413732 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

Coming Soon ...