- Music and Audio Processing
- Speech and Audio Processing
- Music Technology and Sound Studies
- Speech Recognition and Synthesis
- Neuroscience and Music Perception
- Diverse Musicological Studies
- Neural Networks and Applications
- Model-Driven Software Engineering Techniques
- Advanced Data Compression Techniques
- Blind Source Separation Techniques
- Digital Filter Design and Implementation
- Image and Signal Denoising Methods
- Network Traffic and Congestion Control
- Computer Graphics and Visualization Techniques
- Emotion and Mood Recognition
- Advanced Adaptive Filtering Techniques
- Caching and Content Delivery
- Peer-to-Peer Network Technologies
Sony Corporation (United States)
2024-2025
Dexerials (Japan)
2025
Agency for Science, Technology and Research
2020-2023
Singapore University of Technology and Design
2020-2023
Institute of High Performance Computing
2019-2021
Hong Kong University of Science and Technology
2004
University of Hong Kong
2004
Music source separation has been intensively studied in the last decade and tremendous progress with advent of deep learning could be observed. Evaluation campaigns such as MIREX or SiSEC connected state-of-the-art models corresponding papers, which can help researchers integrate best practices into their models. In recent years, widely used MUSDB18 dataset played an important role measuring performance music separation. While made a considerable contribution to advancement field, it is also...
In this paper, we present nnAudio, a new neural network-based audio processing framework with graphics unit (GPU) support that leverages 1D convolutional networks to perform time domain frequency conversion. It allows on-the-fly spectrogram extraction due its fast speed, without the need store any spectrograms on disk. Moreover, approach also back-propagation waveforms-to-spectrograms transformation layer, and hence, process can be made trainable, further optimizing waveform-to-spectrogram...
In this paper, we adapt triplet neural networks (TNNs) to a regression task, music emotion prediction. Since TNNs were initially introduced for classification, and not regression, propose mechanism that allows them provide meaningful low dimensional representations tasks. We then use these new as the input algorithms such support vector machines gradient boosting machines. To demonstrate TNNs' effectiveness at creating representations, compare different dimensionality reduction methods on...
Music is capable of conveying many emotions. The level and type emotion the music perceived by a listener, however, highly subjective. In this study, we present Emotion Recognition with Profile information dataset (MERP). This database was collected through Amazon Mechanical Turk (MTurk) features dynamical valence arousal ratings 54 selected full-length songs. contains features, as well user profile annotators. songs were from Free Archive using an innovative method (a Triple Neural Network...
This paper thoroughly analyses the effect of different input representations on polyphonic multi-instrument music transcription. We use our own GPU based spectrogram extraction tool, nnAudio, to investigate influence using a linear-frequency spectrogram, log-frequency Mel and constant-Q transform (CQT). Our results show that 8.33% increase in transcription accuracy 9.39% reduction error can be obtained by choosing appropriate representation (log-frequency with STFT window length 4,096 2,048...
Most of the current supervised automatic music transcription (AMT) models lack ability to generalize. This means that they have trouble transcribing real-world recordings from diverse musical genres are not presented in labelled training data. In this paper, we propose a semi-supervised framework, ReconVAT, which solves issue by leveraging huge amount available unlabelled recordings. The proposed ReconVAT uses reconstruction loss and virtual adversarial training. When combined with existing...
Most of the state-of-the-art automatic music transcription (AMT) models break down main task into sub-tasks such as onset prediction and offset train them with labels. These predictions are then concatenated together used input to another model pitch labels obtain final transcription. We attempt use only (together spectrogram reconstruction loss) explore how far this can go without introducing supervised sub-tasks. In paper, we do not aim at achieving accuracy, instead, effect that has on...
In recent years, research on music transcription has focused mainly architecture design and instrument-specific data acquisition. With the lack of availability diverse datasets, progress is often limited to solo-instrument tasks such as piano transcription. Several works have explored multi-instrument a means bolster performance models low-resource tasks, but these methods face same issues. We propose Timbre-Trap, novel framework which unifies audio reconstruction by exploiting strong...
In this paper we propose a novel generative approach, DiffRoll, to tackle automatic music transcription (AMT). Instead of treating AMT as discriminative task in which the model is trained convert spectrograms into piano rolls, think it conditional where train our generate realistic looking rolls from pure Gaussian noise conditioned on spectrograms. This new formulation enables DiffRoll transcribe, and even inpaint music. Due classifier-free nature, also able be unpaired datasets only are...
Recent advances in automatic music transcription (AMT) have achieved highly accurate polyphonic piano results by incorporating onset and offset detection. The existing literature, however, focuses mainly on the leverage of deep complex models to achieve state-of-the-art (SOTA) accuracy, without understanding model behaviour. In this paper, we conduct a comprehensive examination Onsets-and-Frames AMT model, pinpoint essential components contributing strong performance. This is through...
Converting time domain waveforms to frequency spectrograms is typically considered be a prepossessing step done before model training. This approach, however, has several drawbacks. First, it takes lot of hard disk space store different representations. especially true during the development and tuning process, when exploring various types for optimal performance. Second, if another dataset used, one must process all audio clips again network can retrained. In this paper, we integrate...
We present an approach to tackle the speaker recognition problem using Triplet Neural Networks. Currently, i-vector representation with probabilistic linear discriminant analysis (PLDA) is most commonly used technique solve this problem, due high classification accuracy a relatively short computation time. In paper, we explore neural network approach, namely Networks (TNNs), built latent space for different classifiers Multi-Target Speaker Detection and Identification Challenge Evaluation...
Disentangling factors of variation aims to uncover latent variables that underlie the process data generation. In this paper, we propose a framework achieves unsupervised pitch and timbre disentanglement for isolated musical instrument sounds without relying on annotations or pre-trained neural networks. Our framework, based variational auto-encoders, takes as input spectral frame, encodes categorical continuous variables, respectively. The is then reconstructed by combining those variables....
In this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from audio clip. Jointist consists the instrument recognition module conditions other modules: transcription outputs instrument-specific piano rolls, source separation utilizes information results. The conditioning designed for explicit functionality while connection between modules better performance. Our challenging...
This paper presents enhancements to the MT3 model, a state-of-the-art (SOTA) token-based multi-instrument automatic music transcription (AMT) model. Despite SOTA performance, has issue of instrument leakage, where transcriptions are fragmented across different instruments. To mitigate this, we propose MR-MT3, with including memory retention mechanism, prior token sampling, and shuffling proposed. These methods evaluated on Slakh2100 dataset, demonstrating improved onset F1 scores reduced...
Recent years have seen increasing interest in applying deep learning methods to the modeling of guitar amplifiers or effect pedals. Existing are mainly based on supervised approach, requiring temporally-aligned data pairs unprocessed and rendered audio. However, this approach does not scale well, due complicated process involved creating pairs. A very recent work done by Wright et al. has explored potential leveraging unpaired for training, using a generative adversarial network (GAN)-based...
Existing work on pitch and timbre disentanglement has been mostly focused single-instrument music audio, excluding the cases where multiple instruments are presented. To fill gap, we propose DisMix, a generative framework in which representations act as modular building blocks for constructing melody instrument of source, collection forms set per-instrument latent underlying observed mixture. By manipulating representations, our model samples mixtures with novel combinations constituent...
Recent state-of-the-art neural audio compression models have progressively adopted residual vector quantization (RVQ). Despite this success, these employ a fixed number of codebooks per frame, which can be suboptimal in terms rate-distortion tradeoff, particularly scenarios with simple input audio, such as silence. To address limitation, we propose variable bitrate RVQ (VRVQ) for codecs, allows more efficient coding by adapting the used frame. Furthermore, gradient estimation method...
Music timbre transfer is a challenging task that involves modifying the timbral characteristics of an audio signal while preserving its melodic structure. In this paper, we propose novel method based on dual diffusion bridges, trained using CocoChorales Dataset, which consists unpaired monophonic single-instrument data. Each model specific instrument with Gaussian prior. During inference, designated as source to map input corresponding prior, and another target reconstruct from thereby...
We demonstrate the efficacy of using intermediate representations from a single foundation model to enhance various music downstream tasks. introduce SoniDo, (MFM) designed extract hierarchical features target samples. By leveraging features, SoniDo constrains information granularity, leading improved performance across tasks including both understanding and generative specifically evaluated this approach on representative such as tagging, transcription, source separation, mixing. Our...