NFDI4DS | UHH-SEMS - Publication Details

Transformer VAE: A Hierarchical Model for Structure-Aware and Interpretable Music Representation Learning

OPENALEX - Publications

Junyan Jiang Gus Xia Dave B. Carlton Chris Anderson Ryan Miyakawa

Structure awareness and interpretability are two of the most desired properties music generation algorithms. Structure-aware models generate more natural coherent with long-term dependencies, while interpretable friendly for human-computer interaction co-creation. To achieve these goals simultaneously, we designed Transformer Variational AutoEncoder, a hierarchical model that unifies efforts recent breakthroughs in deep generation: 1) Music 2) Deep Analogy. The former learns dependencies...

10.1109/icassp40776.2020.9054554 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020-04-09

MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training

OPENALEX - Publications

Yizhi Li Ruibin Yuan Ge Zhang Yinghao Ma Xingran Chen and 13 more

Self-supervised learning (SSL) has recently emerged as a promising paradigm for training generalisable models on large-scale data in the fields of vision, text, and speech. Although SSL been proven effective speech audio, its application to music audio yet be thoroughly explored. This is partially due distinctive challenges associated with modelling musical knowledge, particularly tonal pitched characteristics music. To address this research gap, we propose an acoustic Music undERstanding...

10.48550/arxiv.2306.00107 preprint EN cc-by-sa arXiv (Cornell University) 2023-01-01

POP909: A Pop-song Dataset for Music Arrangement Generation

OPENALEX - Publications

Ziyu Wang Ke Chen Junyan Jiang Yiyi Zhang Maoran Xu and 3 more

Music arrangement generation is a subtask of automatic music generation, which involves reconstructing and re-conceptualizing piece with new compositional techniques. Such process inevitably requires reference from the original melody, chord progression, or other structural information. Despite some promising models for arrangement, they lack more refined data to achieve better evaluations practical results. In this paper, we propose POP909, dataset contains multiple versions piano...

10.48550/arxiv.2008.07142 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Explainable AI for the Arts: XAIxArts

OPENALEX - Publications

Nick Bryan–Kinns Corey Ford Alan Chamberlain Steve Benford Helen W. Kennedy and 4 more

This first workshop on explainable AI for the Arts (XAIxArts) brings together a community of researchers and creative practitioners in Human-Computer Interaction (HCI), Design, AI, (XAI), Digital to explore role XAI Arts. is core concern Human-Centred relies heavily HCI techniques how complex difficult understand models such as deep learning can be made more understandable people. However, research has primarily focused work-oriented task-oriented explanations there been little domains will:...

10.1145/3591196.3593517 article EN Creativity and Cognition 2023-06-18

The Effect of Explicit Structure Encoding of Deep Neural Networks for Symbolic Music Generation

OPENALEX - Publications

Ke Chen Weilin Zhang Shlomo Dubnov Gus Xia Wei Li

With recent breakthroughs in artificial neural networks, deep generative models have become one of the leading techniques for computational creativity. Despite very promising progress on image and short sequence generation, symbolic music generation remains a challenging problem since structure compositions are usually complicated. In this study, we attempt to solve melody constrained by given chord progression. particular, explore effect explicit architectural encoding musical via comparing...

10.1109/mmrp.2019.8665362 article EN 2019-01-01

Deep Music Analogy Via Latent Representation Disentanglement

OPENALEX - Publications

Ruihan Yang Dingsu Wang Ziyu Wang Tianyao Chen Junyan Jiang and 1 more

Analogy-making is a key method for computer algorithms to generate both natural and creative music pieces. In general, an analogy made by partially transferring the abstractions, i.e., high-level representations their relationships, from one piece another; however, this procedure requires disentangling representations, which usually takes little effort musicians but non-trivial computers. Three sub-problems arise: extracting latent observation, so that each part has unique semantic...

10.48550/arxiv.1906.03626 preprint EN other-oa arXiv (Cornell University) 2019-01-01

ChatMusician: Understanding and Generating Music Intrinsically with LLM

OPENALEX - Publications

Ruibin Yuan H.-J. Lin Yi Wang Zeyue Tian Shangda Wu and 27 more

10.18653/v1/2024.findings-acl.373 article EN Findings of the Association for Computational Linguistics: ACL 2022 2024-01-01

Improvised Performance Following in Real Time for Automatic Accompaniment

OPENALEX - Publications

Jianhui Jiang Akira Maezawa Gus Xia

10.1109/icassp49660.2025.10888267 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Explainable AI for the Arts 2 (XAIxArts2)

OPENALEX - Publications

Nick Bryan–Kinns Corey Ford Shuoyang Zheng Helen W. Kennedy Alan Chamberlain and 9 more

This second workshop on explainable AI for the Arts (XAIxArts) brings together a community of researchers and creative practitioners in Human-Computer Interaction (HCI), Design, AI, (XAI), Digital to explore role XAI Arts. is core concern Human-Centred relies heavily HCI techniques how make complex difficult understand models more understandable people. Our first explored landscape XAIxArts identified emergent themes. To move discourse forward contribute broadly this will: i) bring expand...

10.1145/3635636.3660763 article EN Creativity and Cognition 2024-06-22

CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages

OPENALEX - Publications

Shangda Wu Zhancheng Guo Ruibin Yuan Junyan Jiang SeungHeon Doh and 5 more

CLaMP 3 is a unified framework developed to address challenges of cross-modal and cross-lingual generalization in music information retrieval. Using contrastive learning, it aligns all major modalities--including sheet music, performance signals, audio recordings--with multilingual text shared representation space, enabling retrieval across unaligned modalities with as bridge. It features encoder adaptable unseen languages, exhibiting strong generalization. Leveraging retrieval-augmented...

10.48550/arxiv.2502.10362 preprint EN arXiv (Cornell University) 2025-02-14

Music Style Transfer: A Position Paper

OPENALEX - Publications

Shuqi Dai Zheng Zhang Gus Xia

Led by the success of neural style transfer on visual arts, there has been a rising trend very recently in effort music transfer. However, "music style" is not yet well-defined concept from scientific point view. The difficulty lies intrinsic multi-level and multi-modal character representation (which different image representation). As result, depending their interpretation style", current studies under category transfer", are actually solving completely problems that belong to variety...

10.48550/arxiv.1803.06841 preprint EN cc-by arXiv (Cornell University) 2018-01-01

The Effect of Explicit Structure Encoding of Deep Neural Networks for Symbolic Music Generation

OPENALEX - Publications

Ke Chen Weilin Zhang Shlomo Dubnov Gus Xia Wei Li

With recent breakthroughs in artificial neural networks, deep generative models have become one of the leading techniques for computational creativity. Despite very promising progress on image and short sequence generation, symbolic music generation remains a challenging problem since structure compositions are usually complicated. In this study, we attempt to solve melody constrained by given chord progression. particular, explore effect explicit architectural encoding musical via comparing...

10.1109/mmrp.2019.00022 preprint EN 2019-01-01

PIANOTREE VAE: Structured Representation Learning for Polyphonic Music

OPENALEX - Publications

Ziyu Wang Yiyi Zhang Yixiao Zhang Junyan Jiang Ruihan Yang and 2 more

The dominant approach for music representation learning involves the deep unsupervised model family variational autoencoder (VAE). However, most, if not all, viable attempts on this problem have largely been limited to monophonic music. Normally composed of richer modality and more complex musical structures, polyphonic counterpart has yet be addressed in context learning. In work, we propose PianoTree VAE, a novel tree-structure extension upon VAE aiming fit experiments prove validity via...

10.48550/arxiv.2008.07118 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Learning Interpretable Representation for Controllable Polyphonic Music Generation

OPENALEX - Publications

Ziyu Wang Dingsu Wang Yixiao Zhang Gus Xia

While deep generative models have become the leading methods for algorithmic composition, it remains a challenging problem to control generation process because latent variables of most deep-learning lack good interpretability. Inspired by content-style disentanglement idea, we design novel architecture, under VAE framework, that effectively learns two interpretable factors polyphonic music: chord and texture. The current model focuses on learning 8-beat long piano composition segments. We...

10.48550/arxiv.2008.07122 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Foundation Models for Music: A Survey

OPENALEX - Publications

Yinghao Ma Anders Øland Anton Ragni Bleiz M Del Sette Charalampos Saitis and 37 more

In recent years, foundation models (FMs) such as large language (LLMs) and latent diffusion (LDMs) have profoundly impacted diverse sectors, including music. This comprehensive review examines state-of-the-art (SOTA) pre-trained in music, spanning from representation learning, generative learning multimodal learning. We first contextualise the significance of music various industries trace evolution AI By delineating modalities targeted by models, we discover many representations are...

10.48550/arxiv.2408.14340 preprint EN arXiv (Cornell University) 2024-08-26

Deep Music Analogy Via Latent Representation Disentanglement

OPENALEX - Publications

Ruihan Yang Dingsu Wang Ziyu Wang Tianyao Chen Junyan Jiang and 1 more

Analogy-making is a key method for computer algorithms to generate both natural and creative music pieces. In general, an analogy made by partially transferring the abstractions, i.e., high-level representations their relationships, from one piece another; however, this procedure requires disentangling representations, which usually takes little effort musicians but non-trivial computers. Three sub-problems arise: extracting latent observation, so that each part has unique semantic...

10.5281/zenodo.3527880 article EN arXiv (Cornell University) 2019-11-04

Calliffusion: Chinese Calligraphy Generation and Style Transfer with Diffusion Modeling

OPENALEX - Publications

Qisheng Liao Gus Xia Zhinuo Wang

In this paper, we propose Calliffusion, a system for generating high-quality Chinese calligraphy using diffusion models. Our model architecture is based on DDPM (Denoising Diffusion Probabilistic Models), and it capable of common characters in five different scripts mimicking the styles famous calligraphers. Experiments demonstrate that our can generate difficult to distinguish from real artworks controls characters, scripts, are effective. Moreover, one-shot transfer learning, LoRA...

10.48550/arxiv.2305.19124 preprint EN cc-by-nc-nd arXiv (Cornell University) 2023-01-01

MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models

OPENALEX - Publications

Yixiao Zhang Yukara Ikemiya Gus Xia Naoki Murata Marco Martínez and 3 more

Recent advances in text-to-music generation models have opened new avenues musical creativity. However, music usually involves iterative refinements, and how to edit the generated remains a significant challenge. This paper introduces novel approach editing of by such models, enabling modification specific attributes, as genre, mood instrument, while maintaining other aspects unchanged. Our method transforms text \textit{latent space manipulation} adding an extra constraint enforce...

10.48550/arxiv.2402.06178 preprint EN arXiv (Cornell University) 2024-02-08

MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models

OPENALEX - Publications

Yixiao Zhang Yukara Ikemiya Gus Xia Naoki Murata Marco A. Martínez-Ramírez and 3 more

Recent advances in text-to-music generation models have opened new avenues musical creativity. However, the task of editing these generated music remains a significant challenge. This paper introduces novel approach to edit by such models, enabling modification specific attributes, as genre, mood, and instrument, while maintaining other aspects unchanged. Our method transforms text latent space manipulation, adds an additional constraint enforce consistency. It seamlessly integrates with...

10.24963/ijcai.2024/864 article EN 2024-07-26

POP909: A pop-song dataset for music arrangement generation

OPENALEX - Publications

Ziyu Wang Ke Chen Junyan Jiang Yiyi Zhang Maoran Xu and 2 more

10.5281/zenodo.4245366 article EN International Symposium/Conference on Music Information Retrieval 2020-10-11

ChatMusician: Understanding and Generating Music Intrinsically with LLM

OPENALEX - Publications

Ruibin Yuan H.-J. Lin Y. F. Wang Zeyue Tian Shangda Wu and 30 more

While Large Language Models (LLMs) demonstrate impressive capabilities in text generation, we find that their ability has yet to be generalized music, humanity's creative language. We introduce ChatMusician, an open-source LLM integrates intrinsic musical abilities. It is based on continual pre-training and finetuning LLaMA2 a text-compatible music representation, ABC notation, the treated as second ChatMusician can understand generate with pure tokenizer without any external multi-modal...

10.48550/arxiv.2402.16153 preprint EN arXiv (Cornell University) 2024-02-25

GPT-4 Driven Cinematic Music Generation Through Text Processing

OPENALEX - Publications

Muhammad Taimoor Haseeb Ahmad Hammoudeh Gus Xia

This paper presents Herrmann-1 <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> , a multimodal framework to generate background music tailored movie scenes, by integrating state-of-the-art vision, language, music, and speech processing models. Our pipeline begins extracting visual information from scene, performing emotional analysis on it, converting these into descriptive texts. Then, GPT-4 translates high-level descriptions low-level...

10.1109/icassp48485.2024.10447950 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

M2M-Gen: A Multimodal Framework for Automated Background Music Generation in Japanese Manga Using Large Language Models

OPENALEX - Publications

Megha Sharma Muhammad Taimoor Haseeb Gus Xia Yoshimasa Tsuruoka

This paper introduces M2M Gen, a multi modal framework for generating background music tailored to Japanese manga. The key challenges in this task are the lack of an available dataset or baseline. To address these challenges, we propose automated generation pipeline that produces input manga book. Initially, use dialogues detect scene boundaries and perform emotion classification using characters faces within scene. Then, GPT4o translate low level information into high directive. Conditioned...

10.48550/arxiv.2410.09928 preprint EN arXiv (Cornell University) 2024-10-13

ShIFT: A Semi-haptic Interface for Flute Tutoring

OPENALEX - Publications

Gus Xia Carter Jacobsen Qianwen Chen Xing-Dong Yang Roger B. Dannenberg

Traditional instrument learning is time-consuming. It begins with music notation and necessitates layers of sophistication abstraction. Haptic interfaces open another door to the world for vast majority beginners when traditional training methods are not effective. However, existing haptic can only deal specially designed pieces great restrictions on performance duration pitch range due fact that all motions could be guided haptically most instruments. Our system breaks such using a...

10.48550/arxiv.1803.06625 preprint EN other-oa arXiv (Cornell University) 2018-01-01