Kin Wai Cheuk

ORCID: 0000-0003-3213-8242
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Music and Audio Processing
  • Speech and Audio Processing
  • Music Technology and Sound Studies
  • Speech Recognition and Synthesis
  • Neuroscience and Music Perception
  • Diverse Musicological Studies
  • Neural Networks and Applications
  • Model-Driven Software Engineering Techniques
  • Advanced Data Compression Techniques
  • Blind Source Separation Techniques
  • Digital Filter Design and Implementation
  • Image and Signal Denoising Methods
  • Network Traffic and Congestion Control
  • Computer Graphics and Visualization Techniques
  • Emotion and Mood Recognition
  • Advanced Adaptive Filtering Techniques
  • Caching and Content Delivery
  • Peer-to-Peer Network Technologies

Sony Corporation (United States)
2024-2025

Dexerials (Japan)
2025

Agency for Science, Technology and Research
2020-2023

Singapore University of Technology and Design
2020-2023

Institute of High Performance Computing
2019-2021

Hong Kong University of Science and Technology
2004

University of Hong Kong
2004

Music source separation has been intensively studied in the last decade and tremendous progress with advent of deep learning could be observed. Evaluation campaigns such as MIREX or SiSEC connected state-of-the-art models corresponding papers, which can help researchers integrate best practices into their models. In recent years, widely used MUSDB18 dataset played an important role measuring performance music separation. While made a considerable contribution to advancement field, it is also...

10.3389/frsip.2021.808395 article EN cc-by Frontiers in Signal Processing 2022-01-28

In this paper, we present nnAudio, a new neural network-based audio processing framework with graphics unit (GPU) support that leverages 1D convolutional networks to perform time domain frequency conversion. It allows on-the-fly spectrogram extraction due its fast speed, without the need store any spectrograms on disk. Moreover, approach also back-propagation waveforms-to-spectrograms transformation layer, and hence, process can be made trainable, further optimizing waveform-to-spectrogram...

10.1109/access.2020.3019084 article EN cc-by IEEE Access 2020-01-01

In this paper, we adapt triplet neural networks (TNNs) to a regression task, music emotion prediction. Since TNNs were initially introduced for classification, and not regression, propose mechanism that allows them provide meaningful low dimensional representations tasks. We then use these new as the input algorithms such support vector machines gradient boosting machines. To demonstrate TNNs' effectiveness at creating representations, compare different dimensionality reduction methods on...

10.1109/ijcnn48605.2020.9207212 article EN 2022 International Joint Conference on Neural Networks (IJCNN) 2020-07-01

Music is capable of conveying many emotions. The level and type emotion the music perceived by a listener, however, highly subjective. In this study, we present Emotion Recognition with Profile information dataset (MERP). This database was collected through Amazon Mechanical Turk (MTurk) features dynamical valence arousal ratings 54 selected full-length songs. contains features, as well user profile annotators. songs were from Free Archive using an innovative method (a Triple Neural Network...

10.3390/s23010382 article EN cc-by Sensors 2022-12-29

This paper thoroughly analyses the effect of different input representations on polyphonic multi-instrument music transcription. We use our own GPU based spectrogram extraction tool, nnAudio, to investigate influence using a linear-frequency spectrogram, log-frequency Mel and constant-Q transform (CQT). Our results show that 8.33% increase in transcription accuracy 9.39% reduction error can be obtained by choosing appropriate representation (log-frequency with STFT window length 4,096 2,048...

10.1109/ijcnn48605.2020.9207605 article EN 2022 International Joint Conference on Neural Networks (IJCNN) 2020-07-01

Most of the current supervised automatic music transcription (AMT) models lack ability to generalize. This means that they have trouble transcribing real-world recordings from diverse musical genres are not presented in labelled training data. In this paper, we propose a semi-supervised framework, ReconVAT, which solves issue by leveraging huge amount available unlabelled recordings. The proposed ReconVAT uses reconstruction loss and virtual adversarial training. When combined with existing...

10.1145/3474085.3475405 article EN Proceedings of the 30th ACM International Conference on Multimedia 2021-10-17

Most of the state-of-the-art automatic music transcription (AMT) models break down main task into sub-tasks such as onset prediction and offset train them with labels. These predictions are then concatenated together used input to another model pitch labels obtain final transcription. We attempt use only (together spectrogram reconstruction loss) explore how far this can go without introducing supervised sub-tasks. In paper, we do not aim at achieving accuracy, instead, effect that has on...

10.1109/icpr48806.2021.9412155 article EN 2022 26th International Conference on Pattern Recognition (ICPR) 2021-01-10

In recent years, research on music transcription has focused mainly architecture design and instrument-specific data acquisition. With the lack of availability diverse datasets, progress is often limited to solo-instrument tasks such as piano transcription. Several works have explored multi-instrument a means bolster performance models low-resource tasks, but these methods face same issues. We propose Timbre-Trap, novel framework which unifies audio reconstruction by exploiting strong...

10.1109/icassp48485.2024.10446141 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

In this paper we propose a novel generative approach, DiffRoll, to tackle automatic music transcription (AMT). Instead of treating AMT as discriminative task in which the model is trained convert spectrograms into piano rolls, think it conditional where train our generate realistic looking rolls from pure Gaussian noise conditioned on spectrograms. This new formulation enables DiffRoll transcribe, and even inpaint music. Due classifier-free nature, also able be unpaired datasets only are...

10.1109/icassp49357.2023.10095935 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

Recent advances in automatic music transcription (AMT) have achieved highly accurate polyphonic piano results by incorporating onset and offset detection. The existing literature, however, focuses mainly on the leverage of deep complex models to achieve state-of-the-art (SOTA) accuracy, without understanding model behaviour. In this paper, we conduct a comprehensive examination Onsets-and-Frames AMT model, pinpoint essential components contributing strong performance. This is through...

10.1109/ijcnn52387.2021.9533407 article EN 2022 International Joint Conference on Neural Networks (IJCNN) 2021-07-18

Converting time domain waveforms to frequency spectrograms is typically considered be a prepossessing step done before model training. This approach, however, has several drawbacks. First, it takes lot of hard disk space store different representations. especially true during the development and tuning process, when exploring various types for optimal performance. Second, if another dataset used, one must process all audio clips again network can retrained. In this paper, we integrate...

10.48550/arxiv.1912.12055 preprint EN other-oa arXiv (Cornell University) 2019-01-01

We present an approach to tackle the speaker recognition problem using Triplet Neural Networks. Currently, i-vector representation with probabilistic linear discriminant analysis (PLDA) is most commonly used technique solve this problem, due high classification accuracy a relatively short computation time. In paper, we explore neural network approach, namely Networks (TNNs), built latent space for different classifiers Multi-Target Speaker Detection and Identification Challenge Evaluation...

10.1109/asru46091.2019.9003922 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019-12-01

Disentangling factors of variation aims to uncover latent variables that underlie the process data generation. In this paper, we propose a framework achieves unsupervised pitch and timbre disentanglement for isolated musical instrument sounds without relying on annotations or pre-trained neural networks. Our framework, based variational auto-encoders, takes as input spectral frame, encodes categorical continuous variables, respectively. The is then reconstructed by combining those variables....

10.5281/zenodo.4245532 article EN International Symposium/Conference on Music Information Retrieval 2020-10-11

In this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from audio clip. Jointist consists the instrument recognition module conditions other modules: transcription outputs instrument-specific piano rolls, source separation utilizes information results. The conditioning designed for explicit functionality while connection between modules better performance. Our challenging...

10.48550/arxiv.2206.10805 preprint EN cc-by arXiv (Cornell University) 2022-01-01

This paper presents enhancements to the MT3 model, a state-of-the-art (SOTA) token-based multi-instrument automatic music transcription (AMT) model. Despite SOTA performance, has issue of instrument leakage, where transcriptions are fragmented across different instruments. To mitigate this, we propose MR-MT3, with including memory retention mechanism, prior token sampling, and shuffling proposed. These methods evaluated on Slakh2100 dataset, demonstrating improved onset F1 scores reduced...

10.48550/arxiv.2403.10024 preprint EN arXiv (Cornell University) 2024-03-15

Recent years have seen increasing interest in applying deep learning methods to the modeling of guitar amplifiers or effect pedals. Existing are mainly based on supervised approach, requiring temporally-aligned data pairs unprocessed and rendered audio. However, this approach does not scale well, due complicated process involved creating pairs. A very recent work done by Wright et al. has explored potential leveraging unpaired for training, using a generative adversarial network (GAN)-based...

10.48550/arxiv.2406.15751 preprint EN arXiv (Cornell University) 2024-06-22

Existing work on pitch and timbre disentanglement has been mostly focused single-instrument music audio, excluding the cases where multiple instruments are presented. To fill gap, we propose DisMix, a generative framework in which representations act as modular building blocks for constructing melody instrument of source, collection forms set per-instrument latent underlying observed mixture. By manipulating representations, our model samples mixtures with novel combinations constituent...

10.48550/arxiv.2408.10807 preprint EN arXiv (Cornell University) 2024-08-20

Recent state-of-the-art neural audio compression models have progressively adopted residual vector quantization (RVQ). Despite this success, these employ a fixed number of codebooks per frame, which can be suboptimal in terms rate-distortion tradeoff, particularly scenarios with simple input audio, such as silence. To address limitation, we propose variable bitrate RVQ (VRVQ) for codecs, allows more efficient coding by adapting the used frame. Furthermore, gradient estimation method...

10.48550/arxiv.2410.06016 preprint EN arXiv (Cornell University) 2024-10-08

Music timbre transfer is a challenging task that involves modifying the timbral characteristics of an audio signal while preserving its melodic structure. In this paper, we propose novel method based on dual diffusion bridges, trained using CocoChorales Dataset, which consists unpaired monophonic single-instrument data. Each model specific instrument with Gaussian prior. During inference, designated as source to map input corresponding prior, and another target reconstruct from thereby...

10.48550/arxiv.2409.06096 preprint EN arXiv (Cornell University) 2024-09-09

We demonstrate the efficacy of using intermediate representations from a single foundation model to enhance various music downstream tasks. introduce SoniDo, (MFM) designed extract hierarchical features target samples. By leveraging features, SoniDo constrains information granularity, leading improved performance across tasks including both understanding and generative specifically evaluated this approach on representative such as tagging, transcription, source separation, mixing. Our...

10.48550/arxiv.2411.01135 preprint EN arXiv (Cornell University) 2024-11-02
Coming Soon ...