Gus Xia

ORCID: 0000-0003-3629-906X
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Music and Audio Processing
  • Music Technology and Sound Studies
  • Neuroscience and Music Perception
  • Speech Recognition and Synthesis
  • Diverse Musicological Studies
  • Generative Adversarial Networks and Image Synthesis
  • Speech and Audio Processing
  • Natural Language Processing Techniques
  • Topic Modeling
  • Tactile and Sensory Interactions
  • Teleoperation and Haptic Systems
  • Human Motion and Animation
  • Aesthetic Perception and Analysis
  • Scientific Computing and Data Management
  • Machine Learning in Materials Science
  • Data Visualization and Analytics
  • Explainable Artificial Intelligence (XAI)
  • Computer Graphics and Visualization Techniques
  • Model Reduction and Neural Networks
  • Social Robot Interaction and HRI
  • Modular Robots and Swarm Intelligence
  • Multi-Agent Systems and Negotiation
  • Advanced Image and Video Retrieval Techniques
  • Hand Gesture Recognition Systems
  • Multimodal Machine Learning Applications

New York University Shanghai
2019-2025

Mohamed bin Zayed University of Artificial Intelligence
2023-2024

New York University
2018-2024

Carnegie Mellon University
2019

Structure awareness and interpretability are two of the most desired properties music generation algorithms. Structure-aware models generate more natural coherent with long-term dependencies, while interpretable friendly for human-computer interaction co-creation. To achieve these goals simultaneously, we designed Transformer Variational AutoEncoder, a hierarchical model that unifies efforts recent breakthroughs in deep generation: 1) Music 2) Deep Analogy. The former learns dependencies...

10.1109/icassp40776.2020.9054554 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020-04-09

Self-supervised learning (SSL) has recently emerged as a promising paradigm for training generalisable models on large-scale data in the fields of vision, text, and speech. Although SSL been proven effective speech audio, its application to music audio yet be thoroughly explored. This is partially due distinctive challenges associated with modelling musical knowledge, particularly tonal pitched characteristics music. To address this research gap, we propose an acoustic Music undERstanding...

10.48550/arxiv.2306.00107 preprint EN cc-by-sa arXiv (Cornell University) 2023-01-01

Music arrangement generation is a subtask of automatic music generation, which involves reconstructing and re-conceptualizing piece with new compositional techniques. Such process inevitably requires reference from the original melody, chord progression, or other structural information. Despite some promising models for arrangement, they lack more refined data to achieve better evaluations practical results. In this paper, we propose POP909, dataset contains multiple versions piano...

10.48550/arxiv.2008.07142 preprint EN other-oa arXiv (Cornell University) 2020-01-01

This first workshop on explainable AI for the Arts (XAIxArts) brings together a community of researchers and creative practitioners in Human-Computer Interaction (HCI), Design, AI, (XAI), Digital to explore role XAI Arts. is core concern Human-Centred relies heavily HCI techniques how complex difficult understand models such as deep learning can be made more understandable people. However, research has primarily focused work-oriented task-oriented explanations there been little domains will:...

10.1145/3591196.3593517 article EN Creativity and Cognition 2023-06-18

With recent breakthroughs in artificial neural networks, deep generative models have become one of the leading techniques for computational creativity. Despite very promising progress on image and short sequence generation, symbolic music generation remains a challenging problem since structure compositions are usually complicated. In this study, we attempt to solve melody constrained by given chord progression. particular, explore effect explicit architectural encoding musical via comparing...

10.1109/mmrp.2019.8665362 article EN 2019-01-01

Analogy-making is a key method for computer algorithms to generate both natural and creative music pieces. In general, an analogy made by partially transferring the abstractions, i.e., high-level representations their relationships, from one piece another; however, this procedure requires disentangling representations, which usually takes little effort musicians but non-trivial computers. Three sub-problems arise: extracting latent observation, so that each part has unique semantic...

10.48550/arxiv.1906.03626 preprint EN other-oa arXiv (Cornell University) 2019-01-01

10.1109/icassp49660.2025.10888267 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

This second workshop on explainable AI for the Arts (XAIxArts) brings together a community of researchers and creative practitioners in Human-Computer Interaction (HCI), Design, AI, (XAI), Digital to explore role XAI Arts. is core concern Human-Centred relies heavily HCI techniques how make complex difficult understand models more understandable people. Our first explored landscape XAIxArts identified emergent themes. To move discourse forward contribute broadly this will: i) bring expand...

10.1145/3635636.3660763 article EN Creativity and Cognition 2024-06-22

CLaMP 3 is a unified framework developed to address challenges of cross-modal and cross-lingual generalization in music information retrieval. Using contrastive learning, it aligns all major modalities--including sheet music, performance signals, audio recordings--with multilingual text shared representation space, enabling retrieval across unaligned modalities with as bridge. It features encoder adaptable unseen languages, exhibiting strong generalization. Leveraging retrieval-augmented...

10.48550/arxiv.2502.10362 preprint EN arXiv (Cornell University) 2025-02-14

Led by the success of neural style transfer on visual arts, there has been a rising trend very recently in effort music transfer. However, "music style" is not yet well-defined concept from scientific point view. The difficulty lies intrinsic multi-level and multi-modal character representation (which different image representation). As result, depending their interpretation style", current studies under category transfer", are actually solving completely problems that belong to variety...

10.48550/arxiv.1803.06841 preprint EN cc-by arXiv (Cornell University) 2018-01-01

With recent breakthroughs in artificial neural networks, deep generative models have become one of the leading techniques for computational creativity. Despite very promising progress on image and short sequence generation, symbolic music generation remains a challenging problem since structure compositions are usually complicated. In this study, we attempt to solve melody constrained by given chord progression. particular, explore effect explicit architectural encoding musical via comparing...

10.1109/mmrp.2019.00022 preprint EN 2019-01-01

The dominant approach for music representation learning involves the deep unsupervised model family variational autoencoder (VAE). However, most, if not all, viable attempts on this problem have largely been limited to monophonic music. Normally composed of richer modality and more complex musical structures, polyphonic counterpart has yet be addressed in context learning. In work, we propose PianoTree VAE, a novel tree-structure extension upon VAE aiming fit experiments prove validity via...

10.48550/arxiv.2008.07118 preprint EN other-oa arXiv (Cornell University) 2020-01-01

While deep generative models have become the leading methods for algorithmic composition, it remains a challenging problem to control generation process because latent variables of most deep-learning lack good interpretability. Inspired by content-style disentanglement idea, we design novel architecture, under VAE framework, that effectively learns two interpretable factors polyphonic music: chord and texture. The current model focuses on learning 8-beat long piano composition segments. We...

10.48550/arxiv.2008.07122 preprint EN other-oa arXiv (Cornell University) 2020-01-01

In recent years, foundation models (FMs) such as large language (LLMs) and latent diffusion (LDMs) have profoundly impacted diverse sectors, including music. This comprehensive review examines state-of-the-art (SOTA) pre-trained in music, spanning from representation learning, generative learning multimodal learning. We first contextualise the significance of music various industries trace evolution AI By delineating modalities targeted by models, we discover many representations are...

10.48550/arxiv.2408.14340 preprint EN arXiv (Cornell University) 2024-08-26

Analogy-making is a key method for computer algorithms to generate both natural and creative music pieces. In general, an analogy made by partially transferring the abstractions, i.e., high-level representations their relationships, from one piece another; however, this procedure requires disentangling representations, which usually takes little effort musicians but non-trivial computers. Three sub-problems arise: extracting latent observation, so that each part has unique semantic...

10.5281/zenodo.3527880 article EN arXiv (Cornell University) 2019-11-04

In this paper, we propose Calliffusion, a system for generating high-quality Chinese calligraphy using diffusion models. Our model architecture is based on DDPM (Denoising Diffusion Probabilistic Models), and it capable of common characters in five different scripts mimicking the styles famous calligraphers. Experiments demonstrate that our can generate difficult to distinguish from real artworks controls characters, scripts, are effective. Moreover, one-shot transfer learning, LoRA...

10.48550/arxiv.2305.19124 preprint EN cc-by-nc-nd arXiv (Cornell University) 2023-01-01

Recent advances in text-to-music generation models have opened new avenues musical creativity. However, music usually involves iterative refinements, and how to edit the generated remains a significant challenge. This paper introduces novel approach editing of by such models, enabling modification specific attributes, as genre, mood instrument, while maintaining other aspects unchanged. Our method transforms text \textit{latent space manipulation} adding an extra constraint enforce...

10.48550/arxiv.2402.06178 preprint EN arXiv (Cornell University) 2024-02-08

Recent advances in text-to-music generation models have opened new avenues musical creativity. However, the task of editing these generated music remains a significant challenge. This paper introduces novel approach to edit by such models, enabling modification specific attributes, as genre, mood, and instrument, while maintaining other aspects unchanged. Our method transforms text latent space manipulation, adds an additional constraint enforce consistency. It seamlessly integrates with...

10.24963/ijcai.2024/864 article EN 2024-07-26

10.5281/zenodo.4245366 article EN International Symposium/Conference on Music Information Retrieval 2020-10-11

While Large Language Models (LLMs) demonstrate impressive capabilities in text generation, we find that their ability has yet to be generalized music, humanity's creative language. We introduce ChatMusician, an open-source LLM integrates intrinsic musical abilities. It is based on continual pre-training and finetuning LLaMA2 a text-compatible music representation, ABC notation, the treated as second ChatMusician can understand generate with pure tokenizer without any external multi-modal...

10.48550/arxiv.2402.16153 preprint EN arXiv (Cornell University) 2024-02-25

This paper presents Herrmann-1 <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> , a multimodal framework to generate background music tailored movie scenes, by integrating state-of-the-art vision, language, music, and speech processing models. Our pipeline begins extracting visual information from scene, performing emotional analysis on it, converting these into descriptive texts. Then, GPT-4 translates high-level descriptions low-level...

10.1109/icassp48485.2024.10447950 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

This paper introduces M2M Gen, a multi modal framework for generating background music tailored to Japanese manga. The key challenges in this task are the lack of an available dataset or baseline. To address these challenges, we propose automated generation pipeline that produces input manga book. Initially, use dialogues detect scene boundaries and perform emotion classification using characters faces within scene. Then, GPT4o translate low level information into high directive. Conditioned...

10.48550/arxiv.2410.09928 preprint EN arXiv (Cornell University) 2024-10-13

Traditional instrument learning is time-consuming. It begins with music notation and necessitates layers of sophistication abstraction. Haptic interfaces open another door to the world for vast majority beginners when traditional training methods are not effective. However, existing haptic can only deal specially designed pieces great restrictions on performance duration pitch range due fact that all motions could be guided haptically most instruments. Our system breaks such using a...

10.48550/arxiv.1803.06625 preprint EN other-oa arXiv (Cornell University) 2018-01-01
Coming Soon ...