Xiangming Gu

ORCID: 0000-0003-0637-8664
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech Recognition and Synthesis
  • Music and Audio Processing
  • Speech and Audio Processing
  • Natural Language Processing Techniques
  • Diverse Musicological Studies
  • Soft Robotics and Applications
  • Advanced Vision and Imaging
  • Artificial Intelligence in Law
  • Micro and Nano Robotics
  • Privacy-Preserving Technologies in Data
  • Fuzzy Logic and Control Systems
  • Machine Learning and Data Classification
  • Generative Adversarial Networks and Image Synthesis
  • Vehicle License Plate Recognition
  • Surgical Simulation and Training
  • Video Surveillance and Tracking Methods
  • Hydraulic and Pneumatic Systems
  • Model Reduction and Neural Networks
  • Neural Networks and Applications
  • Voice and Speech Disorders
  • Human Pose and Action Recognition
  • Robot Manipulation and Learning
  • Topic Modeling

National University of Singapore
2022-2024

Tsinghua University
2020-2021

Monocular 3D human pose estimation is challenging due to depth ambiguity. Convolution-based and Graph-Convolution-based methods have been developed extract information from temporal cues in motion videos. Typically, the lifting-based methods, most recent works adopt transformer model relationship of 2D keypoint sequences. These previous usually consider all joints a skeleton as whole then calculate attention based on overall characteristics skeleton. Nevertheless, exhibits obvious part-wise...

10.1109/tip.2022.3182269 article EN IEEE Transactions on Image Processing 2022-01-01

Automatic lyric transcription (ALT) is a nascent field of study attracting increasing interest from both the speech and music information retrieval communities, given its significant application potential. However, ALT with audio data alone notoriously difficult task due to instrumental accompaniment musical constraints resulting in degradation phonetic cues intelligibility sung lyrics. To tackle this challenge, we propose MultiModal Lyric Transcription system (MM-ALT), together new dataset,...

10.1145/3503161.3548411 article EN public-domain Proceedings of the 30th ACM International Conference on Multimedia 2022-10-10

The flexible manipulators and related robotic systems have been widely used in minimally invasive surgery (MIS) for enhancing the intraoperative inspection surgical operation. Although a variety of using different mechanisms developed, most them are rigid mechanical structures or single function, which lack softness, robustness in-situ diagnosis treatment. To enrich flexibility therapeutic function manipulators, we developed laser endoscopic manipulator with soft bendable tip. tip is...

10.1109/lra.2021.3100617 article EN IEEE Robotics and Automation Letters 2021-07-29

Automatic lyric transcription (ALT) refers to transcribing singing voices into lyrics, while automatic music (AMT) note events, i.e., musical MIDI notes. Despite these two tasks having significant potential for practical application, they are still nascent. This is because the of lyrics and events solely from audio notoriously difficult due presence noise contamination, e.g., accompaniment, resulting in a degradation both intelligibility sung recognizability To address this challenge, we...

10.1145/3651310 article EN public-domain ACM Transactions on Multimedia Computing Communications and Applications 2024-03-12

Abstract Compared to rigid-structure robots, soft robots possess higher degrees of freedom and stronger environmental adaptability, which has aroused increasing attention in the robotic field. Among them, pneumatic have excellent performances various practical applications. However, nonlinearity instability pressure response actuators caused by lateral expansion come a great challenge. To address this problem, we proposed embed spring constraint layer around each single air chamber....

10.1088/1361-665x/ad74bf article EN Smart Materials and Structures 2024-08-28

Deep neural networks (DNNs) demonstrate great success in classification tasks. However, they act as black boxes and we don't know how make decisions a particular task. To this end, propose to distill the knowledge from DNN into fuzzy inference system (FIS), which is Takagi-Sugeno-Kang (TSK)-type paper. The model has capability express acquired by based on rules, thus explaining decision much easier. Knowledge distillation (KD) applied create TSK-type FIS that generalizes better than one...

10.48550/arxiv.2010.04974 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Phonation mode detection predicts phonation modes and their temporal boundaries in singing speech, holding promise for characterizing voice quality vocal health. However, it is very challenging due to the domain disparities between training data unannotated real-world recordings. To tackle this problem, we develop a disentangled adversarial adaptation network, which adapts model with structure of convolutional recurrent neural network pre-trained on source target without labels. Based our...

10.1109/taslp.2023.3317568 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2023-01-01

It is widely known that males and females typically possess different sound characteristics when singing, such as timbre pitch, but it has never been explored whether these gender-based lead to a performance disparity in singing voice transcription (SVT), whose target includes pitch. Such could cause fairness issues severely affect the user experience of downstream SVT applications. Motivated by this, we first demonstrate female superiority systems, which observed across models datasets. We...

10.1145/3581783.3612272 article EN 2023-10-26

A multimodal large language model (MLLM) agent can receive instructions, capture images, retrieve histories from memory, and decide which tools to use. Nonetheless, red-teaming efforts have revealed that adversarial images/prompts jailbreak an MLLM cause unaligned behaviors. In this work, we report even more severe safety issue in multi-agent environments, referred as infectious jailbreak. It entails the adversary simply jailbreaking a single agent, without any further intervention...

10.48550/arxiv.2402.08567 preprint EN arXiv (Cornell University) 2024-02-13

Singing voice transcription converts recorded singing audio to musical notation. Sound contamination (such as accompaniment) and lack of annotated data make an extremely difficult task. We take two approaches tackle the above challenges: 1) introducing multimodal learning for together with a new dataset, N20EMv2, enhancing noise robustness by utilizing video information (lip movements predict onset/offset notes), 2) adapting self-supervised models from speech domain task, significantly...

10.48550/arxiv.2304.12082 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Due to their capacity generate novel and high-quality samples, diffusion models have attracted significant research interest in recent years. Notably, the typical training objective of models, i.e., denoising score matching, has a closed-form optimal solution that can only data replicating samples. This indicates memorization behavior is theoretically expected, which contradicts common generalization ability state-of-the-art thus calls for deeper understanding. Looking into this, we first...

10.48550/arxiv.2310.02664 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Automatic speech recognition (ASR) has progressed significantly in recent years due to the emergence of large-scale datasets and self-supervised learning (SSL) paradigm. However, as its counterpart problem singing domain, development automatic lyric transcription (ALT) suffers from limited data degraded intelligibility sung lyrics. To fill performance gap between ALT ASR, we attempt exploit similarities singing. In this work, propose a transfer-learning-based solution that takes advantage...

10.48550/arxiv.2207.09747 preprint EN other-oa arXiv (Cornell University) 2022-01-01

It is widely known that males and females typically possess different sound characteristics when singing, such as timbre pitch, but it has never been explored whether these gender-based lead to a performance disparity in singing voice transcription (SVT), whose target includes pitch. Such could cause fairness issues severely affect the user experience of downstream SVT applications. Motivated by this, we first demonstrate female superiority systems, which observed across models datasets. We...

10.48550/arxiv.2308.02898 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Content mismatch usually occurs when data from one modality is translated to another, e.g. language learners producing mispronunciations (errors in speech) reading a sentence (target text) aloud. However, most existing alignment algorithms assume that the content involved two modalities perfectly matched, thus leading difficulty locating such between speech and text. In this work, we develop an unsupervised learning algorithm can infer relationship content-mismatched cross-modal sequential...

10.48550/arxiv.2205.02670 preprint EN cc-by arXiv (Cornell University) 2022-01-01
Coming Soon ...