NFDI4DS | UHH-SEMS - Publication Details

Kin Wai Cheuk

ORCID: 0000-0003-3213-8242

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5069612434

Research Areas

Music and Audio Processing
Speech and Audio Processing
Music Technology and Sound Studies
Speech Recognition and Synthesis
Neuroscience and Music Perception
Diverse Musicological Studies
Neural Networks and Applications
Model-Driven Software Engineering Techniques
Advanced Data Compression Techniques
Blind Source Separation Techniques
Digital Filter Design and Implementation
Image and Signal Denoising Methods
Network Traffic and Congestion Control
Computer Graphics and Visualization Techniques
Emotion and Mood Recognition
Advanced Adaptive Filtering Techniques
Caching and Content Delivery
Peer-to-Peer Network Technologies

Sony Corporation (United States)
2024-2025

Dexerials (Japan)
2025

Agency for Science, Technology and Research
2020-2023

Singapore University of Technology and Design
2020-2023

Institute of High Performance Computing
2019-2021

Hong Kong University of Science and Technology
2004

University of Hong Kong
2004

Music Demixing Challenge 2021

OPENALEX - Publications

Yuki Mitsufuji Giorgio Fabbro Stefan Uhlich Fabian-Robert Stöter Alexandre Défossez and 4 more

Music source separation has been intensively studied in the last decade and tremendous progress with advent of deep learning could be observed. Evaluation campaigns such as MIREX or SiSEC connected state-of-the-art models corresponding papers, which can help researchers integrate best practices into their models. In recent years, widely used MUSDB18 dataset played an important role measuring performance music separation. While made a considerable contribution to advancement field, it is also...

10.3389/frsip.2021.808395 article EN cc-by Frontiers in Signal Processing 2022-01-28

nnAudio: An on-the-Fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolutional Neural Networks

OPENALEX - Publications

Kin Wai Cheuk Hans Anderson Kat Agres Dorien Herremans

In this paper, we present nnAudio, a new neural network-based audio processing framework with graphics unit (GPU) support that leverages 1D convolutional networks to perform time domain frequency conversion. It allows on-the-fly spectrogram extraction due its fast speed, without the need store any spectrograms on disk. Moreover, approach also back-propagation waveforms-to-spectrograms transformation layer, and hence, process can be made trainable, further optimizing waveform-to-spectrogram...

10.1109/access.2020.3019084 article EN cc-by IEEE Access 2020-01-01

Variable Bitrate Residual Vector Quantization for Audio Coding

OPENALEX - Publications

Yunkee Chae Woosung Choi Yuhta Takida Junghyun Koo Yukara Ikemiya and 6 more

10.1109/icassp49660.2025.10889508 article ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Latent Diffusion Bridges for Unsupervised Musical Audio Timbre Transfer

OPENALEX - Publications

Michele Mancusi Yurii Halychanskyi Kin Wai Cheuk Eloi Moliner Chieh-Hsin Lai and 6 more

10.1109/icassp49660.2025.10890708 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Regression-based Music Emotion Prediction using Triplet Neural Networks

OPENALEX - Publications

Kin Wai Cheuk Yin-Jyun Luo B T Balamurali Gemma Roig Dorien Herremans

In this paper, we adapt triplet neural networks (TNNs) to a regression task, music emotion prediction. Since TNNs were initially introduced for classification, and not regression, propose mechanism that allows them provide meaningful low dimensional representations tasks. We then use these new as the input algorithms such support vector machines gradient boosting machines. To demonstrate TNNs' effectiveness at creating representations, compare different dimensionality reduction methods on...

10.1109/ijcnn48605.2020.9207212 article EN 2022 International Joint Conference on Neural Networks (IJCNN) 2020-07-01

MERP: A Music Dataset with Emotion Ratings and Raters’ Profile Information

OPENALEX - Publications

En Yan Koh Kin Wai Cheuk Kwan Yee Heung Kat Agres Dorien Herremans

Music is capable of conveying many emotions. The level and type emotion the music perceived by a listener, however, highly subjective. In this study, we present Emotion Recognition with Profile information dataset (MERP). This database was collected through Amazon Mechanical Turk (MTurk) features dynamical valence arousal ratings 54 selected full-length songs. contains features, as well user profile annotators. songs were from Free Archive using an innovative method (a Triple Neural Network...

10.3390/s23010382 article EN cc-by Sensors 2022-12-29

The Impact of Audio Input Representations on Neural Network based Music Transcription

OPENALEX - Publications

Kin Wai Cheuk Kat Agres Dorien Herremans

This paper thoroughly analyses the effect of different input representations on polyphonic multi-instrument music transcription. We use our own GPU based spectrogram extraction tool, nnAudio, to investigate influence using a linear-frequency spectrogram, log-frequency Mel and constant-Q transform (CQT). Our results show that 8.33% increase in transcription accuracy 9.39% reduction error can be obtained by choosing appropriate representation (log-frequency with STFT window length 4,096 2,048...

10.1109/ijcnn48605.2020.9207605 article EN 2022 International Joint Conference on Neural Networks (IJCNN) 2020-07-01

ReconVAT: A Semi-Supervised Automatic Music Transcription Framework for Low-Resource Real-World Data

OPENALEX - Publications

Kin Wai Cheuk Dorien Herremans Li Su

Most of the current supervised automatic music transcription (AMT) models lack ability to generalize. This means that they have trouble transcribing real-world recordings from diverse musical genres are not presented in labelled training data. In this paper, we propose a semi-supervised framework, ReconVAT, which solves issue by leveraging huge amount available unlabelled recordings. The proposed ReconVAT uses reconstruction loss and virtual adversarial training. When combined with existing...

10.1145/3474085.3475405 article EN Proceedings of the 30th ACM International Conference on Multimedia 2021-10-17

The Effect of Spectrogram Reconstruction on Automatic Music Transcription: An Alternative Approach to Improve Transcription Accuracy

OPENALEX - Publications

Kin Wai Cheuk Yin-Jvun Luo Emmanouil Benetos Dorien Herremans

Most of the state-of-the-art automatic music transcription (AMT) models break down main task into sub-tasks such as onset prediction and offset train them with labels. These predictions are then concatenated together used input to another model pitch labels obtain final transcription. We attempt use only (together spectrogram reconstruction loss) explore how far this can go without introducing supervised sub-tasks. In paper, we do not aim at achieving accuracy, instead, effect that has on...

10.1109/icpr48806.2021.9412155 article EN 2022 26th International Conference on Pattern Recognition (ICPR) 2021-01-10

Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music Transcription

OPENALEX - Publications

Frank Cwitkowitz Kin Wai Cheuk Woosung Choi Marco A. Martínez-Ramírez Keisuke Toyama and 2 more

In recent years, research on music transcription has focused mainly architecture design and instrument-specific data acquisition. With the lack of availability diverse datasets, progress is often limited to solo-instrument tasks such as piano transcription. Several works have explored multi-instrument a means bolster performance models low-resource tasks, but these methods face same issues. We propose Timbre-Trap, novel framework which unifies audio reconstruction by exploiting strong...

10.1109/icassp48485.2024.10446141 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

Diffroll: Diffusion-Based Generative Music Transcription with Unsupervised Pretraining Capability

OPENALEX - Publications

Kin Wai Cheuk Ryosuke Sawata Toshimitsu Uesaka Naoki Murata Naoya Takahashi and 3 more

In this paper we propose a novel generative approach, DiffRoll, to tackle automatic music transcription (AMT). Instead of treating AMT as discriminative task in which the model is trained convert spectrograms into piano rolls, think it conditional where train our generate realistic looking rolls from pure Gaussian noise conditioned on spectrograms. This new formulation enables DiffRoll transcribe, and even inpaint music. Due classifier-free nature, also able be unpaired datasets only are...

10.1109/icassp49357.2023.10095935 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

Revisiting the Onsets and Frames Model with Additive Attention

OPENALEX - Publications

Kin Wai Cheuk Yin-Jyun Luo Emmanouil Benetos Dorien Herremans

Recent advances in automatic music transcription (AMT) have achieved highly accurate polyphonic piano results by incorporating onset and offset detection. The existing literature, however, focuses mainly on the leverage of deep complex models to achieve state-of-the-art (SOTA) accuracy, without understanding model behaviour. In this paper, we conduct a comprehensive examination Onsets-and-Frames AMT model, pinpoint essential components contributing strong performance. This is through...

10.1109/ijcnn52387.2021.9533407 article EN 2022 International Joint Conference on Neural Networks (IJCNN) 2021-07-18

nnAudio: An on-the-fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolution Neural Networks

OPENALEX - Publications

Kin Wai Cheuk Hans Anderson Kat Agres Dorien Herremans

Converting time domain waveforms to frequency spectrograms is typically considered be a prepossessing step done before model training. This approach, however, has several drawbacks. First, it takes lot of hard disk space store different representations. especially true during the development and tuning process, when exploring various types for optimal performance. Second, if another dataset used, one must process all audio clips again network can retrained. In this paper, we integrate...

10.48550/arxiv.1912.12055 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Latent Space Representation for Multi-Target Speaker Detection and Identification with a Sparse Dataset Using Triplet Neural Networks

OPENALEX - Publications

Kin Wai Cheuk B T Balamurali Gemma Roig Dorien Herremans

We present an approach to tackle the speaker recognition problem using Triplet Neural Networks. Currently, i-vector representation with probabilistic linear discriminant analysis (PLDA) is most commonly used technique solve this problem, due high classification accuracy a relatively short computation time. In paper, we explore neural network approach, namely Networks (TNNs), built latent space for different classifiers Multi-Target Speaker Detection and Identification Challenge Evaluation...

10.1109/asru46091.2019.9003922 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019-12-01

Unsupervised Disentanglement of Pitch and Timbre for Isolated Musical Instrument Sounds.

OPENALEX - Publications

Yin-Jyun Luo Kin Wai Cheuk Tomoyasu Nakano Masataka Goto Dorien Herremans

Disentangling factors of variation aims to uncover latent variables that underlie the process data generation. In this paper, we propose a framework achieves unsupervised pitch and timbre disentanglement for isolated musical instrument sounds without relying on annotations or pre-trained neural networks. Our framework, based variational auto-encoders, takes as input spectral frame, encodes categorical continuous variables, respectively. The is then reconstructed by combining those variables....

10.5281/zenodo.4245532 article EN International Symposium/Conference on Music Information Retrieval 2020-10-11

Jointist: Joint Learning for Multi-instrument Transcription and Its Applications

OPENALEX - Publications

Kin Wai Cheuk Keunwoo Choi Qiuqiang Kong Bochen Li Minz Won and 3 more

In this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from audio clip. Jointist consists the instrument recognition module conditions other modules: transcription outputs instrument-specific piano rolls, source separation utilizes information results. The conditioning designed for explicit functionality while connection between modules better performance. Our challenging...

10.48550/arxiv.2206.10805 preprint EN cc-by arXiv (Cornell University) 2022-01-01

MR-MT3: Memory Retaining Multi-Track Music Transcription to Mitigate Instrument Leakage

OPENALEX - Publications

Hao Tan Kin Wai Cheuk Taemin Cho Wei‐Hsiang Liao Yuki Mitsufuji

This paper presents enhancements to the MT3 model, a state-of-the-art (SOTA) token-based multi-instrument automatic music transcription (AMT) model. Despite SOTA performance, has issue of instrument leakage, where transcriptions are fragmented across different instruments. To mitigate this, we propose MR-MT3, with including memory retention mechanism, prior token sampling, and shuffling proposed. These methods evaluated on Slakh2100 dataset, demonstrating improved onset F1 scores reduced...

10.48550/arxiv.2403.10024 preprint EN arXiv (Cornell University) 2024-03-15

Improving Unsupervised Clean-to-Rendered Guitar Tone Transformation Using GANs and Integrated Unaligned Clean Data

OPENALEX - Publications

Yuhua Chen Woosung Choi Wei‐Hsiang Liao Marco A. Martínez-Ramírez Kin Wai Cheuk and 3 more

Recent years have seen increasing interest in applying deep learning methods to the modeling of guitar amplifiers or effect pedals. Existing are mainly based on supervised approach, requiring temporally-aligned data pairs unprocessed and rendered audio. However, this approach does not scale well, due complicated process involved creating pairs. A very recent work done by Wright et al. has explored potential leveraging unpaired for training, using a generative adversarial network (GAN)-based...

10.48550/arxiv.2406.15751 preprint EN arXiv (Cornell University) 2024-06-22

DisMix: Disentangling Mixtures of Musical Instruments for Source-level Pitch and Timbre Manipulation

OPENALEX - Publications

Yin-Jyun Luo Kin Wai Cheuk Woosung Choi Toshimitsu Uesaka Keisuke Toyama and 6 more

Existing work on pitch and timbre disentanglement has been mostly focused single-instrument music audio, excluding the cases where multiple instruments are presented. To fill gap, we propose DisMix, a generative framework in which representations act as modular building blocks for constructing melody instrument of source, collection forms set per-instrument latent underlying observed mixture. By manipulating representations, our model samples mixtures with novel combinations constituent...

10.48550/arxiv.2408.10807 preprint EN arXiv (Cornell University) 2024-08-20

Variable Bitrate Residual Vector Quantization for Audio Coding

OPENALEX - Publications

Yunkee Chae Woosung Choi Yuhta Takida Junghyun Koo Yukara Ikemiya and 6 more

Recent state-of-the-art neural audio compression models have progressively adopted residual vector quantization (RVQ). Despite this success, these employ a fixed number of codebooks per frame, which can be suboptimal in terms rate-distortion tradeoff, particularly scenarios with simple input audio, such as silence. To address limitation, we propose variable bitrate RVQ (VRVQ) for codecs, allows more efficient coding by adapting the used frame. Furthermore, gradient estimation method...

10.48550/arxiv.2410.06016 preprint EN arXiv (Cornell University) 2024-10-08

Latent Diffusion Bridges for Unsupervised Musical Audio Timbre Transfer

OPENALEX - Publications

Michele Mancusi Yurii Halychanskyi Kin Wai Cheuk Chieh-Hsin Lai Stefan Uhlich and 5 more

Music timbre transfer is a challenging task that involves modifying the timbral characteristics of an audio signal while preserving its melodic structure. In this paper, we propose novel method based on dual diffusion bridges, trained using CocoChorales Dataset, which consists unpaired monophonic single-instrument data. Each model specific instrument with Gaussian prior. During inference, designated as source to map input corresponding prior, and another target reconstruct from thereby...

10.48550/arxiv.2409.06096 preprint EN arXiv (Cornell University) 2024-09-09

Music Foundation Model as Generic Booster for Music Downstream Tasks

OPENALEX - Publications

WeiHsiang Liao Yuhta Takida Yukara Ikemiya Zhi Zhong Chieh-Hsin Lai and 11 more

We demonstrate the efficacy of using intermediate representations from a single foundation model to enhance various music downstream tasks. introduce SoniDo, (MFM) designed extract hierarchical features target samples. By leveraging features, SoniDo constrains information granularity, leading improved performance across tasks including both understanding and generative specifically evaluated this approach on representative such as tagging, transcription, source separation, mixing. Our...

10.48550/arxiv.2411.01135 preprint EN arXiv (Cornell University) 2024-11-02

Coming Soon ...