Shih-Lun Wu

ORCID: 0000-0002-8315-0762
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Music and Audio Processing
  • Music Technology and Sound Studies
  • Neuroscience and Music Perception
  • Speech Recognition and Synthesis
  • Natural Language Processing Techniques
  • Speech and dialogue systems
  • Remote Sensing and LiDAR Applications
  • Multimodal Machine Learning Applications
  • Topic Modeling
  • Tensor decomposition and applications
  • Generative Adversarial Networks and Image Synthesis
  • Smart Agriculture and AI
  • Industrial Vision Systems and Defect Detection
  • Advanced Image and Video Retrieval Techniques
  • Parallel Computing and Optimization Techniques
  • Speech and Audio Processing
  • Neural Networks and Applications

Adobe Systems (United States)
2024

Carnegie Mellon University
2023-2024

National Taiwan University
2020-2022

Research Center for Information Technology Innovation, Academia Sinica
2021

Attention-based Transformer models have been increasingly employed for automatic music generation. To condition the generation process of such a model with user-specified sequence, popular approach is to take that conditioning sequence as priming and ask decoder generate continuation. However, this <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">prompt-based conditioning</i> cannot guarantee would develop or even simply repeat itself in...

10.1109/tmm.2022.3161851 article EN IEEE Transactions on Multimedia 2022-03-23

Transformers and variational autoencoders (VAE) have been extensively employed for symbolic (e.g., MIDI) domain music generation. While the former boast an impressive capability in modeling long sequences, latter allow users to willingly exert control over different parts bars) of be generated. In this paper, we are interested bringing two together construct a single model that exhibits both strengths. The task is split into steps. First, equip Transformer decoders with ability accept...

10.1109/taslp.2023.3270726 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2023-01-01

Text-to-music generation models are now capable of generating high-quality music audio in broad styles. However, text control is primarily suitable for the manipulation <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">global</i> musical attributes like genre, mood, and tempo, less precise over xmlns:xlink="http://www.w3.org/1999/xlink">time-varying</i> such as positions beats time or changing dynamics music. We propose Music ControlNet, a...

10.1109/taslp.2024.3399026 article EN cc-by-nc-nd IEEE/ACM Transactions on Audio Speech and Language Processing 2024-01-01

This paper presents the Jazz Transformer, a generative model that utilizes neural sequence called Transformer-XL for modeling lead sheets of music. Moreover, endeavors to incorporate structural events present in Weimar Database (WJazzD) inducing structures generated While we are able reduce training loss low value, our listening test suggests however clear gap between average ratings and real compositions. We therefore go one step further conduct series computational analysis compositions...

10.48550/arxiv.2008.01307 preprint EN cc-by arXiv (Cornell University) 2020-01-01

The quality grading of mangoes is a crucial task for mango growers as it vastly affects their profit. However, until today, this process still relies on laborious efforts humans, who are prone to fatigue and errors. To remedy this, the paper approaches with various convolutional neural networks (CNN), tried-and-tested deep learning technology in computer vision. models involved include Mask R-CNN (for background removal), numerous past winners ImageNet challenge, namely AlexNet, VGGs,...

10.1109/icmla51294.2020.00076 article EN 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA) 2020-12-01

Recent advances in Transformer models allow for unprecedented sequence lengths, due to linear space and time complexity. In the meantime, relative positional encoding (RPE) was proposed as beneficial classical Transformers consists exploiting lags instead of absolute positions inference. Still, RPE is not available recent linear-variants Transformer, because it requires explicit computation attention matrix, which precisely what avoided by such methods. this paper, we bridge gap present...

10.48550/arxiv.2105.08399 preprint EN other-oa arXiv (Cornell University) 2021-01-01

This paper describes our system for the low-resource domain adaptation track (Track 3) in Spoken Language Understanding Grand Challenge, which is a part of ICASSP Signal Processing Challenge 2023. In track, we adopt pipeline approach ASR and NLU. For ASR, fine-tune Whisper each with upsampling. NLU, BART on all Track3 data then data. We apply masked LM (MLM) -based augmentation, where some input tokens corresponding target labels are replaced using MLM. also retrieval-based approach, model...

10.1109/icassp49357.2023.10096049 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

Attention-based Transformer models have been increasingly employed for automatic music generation. To condition the generation process of such a model with user-specified sequence, popular approach is to take that conditioning sequence as priming and ask decoder generate continuation. However, this prompt-based cannot guarantee would develop or even simply repeat itself in generated In paper, we propose an alternative approach, called theme-based conditioning, explicitly trains treat...

10.48550/arxiv.2111.04093 preprint EN cc-by arXiv (Cornell University) 2021-01-01

The quality grading of mangoes is a crucial task for mango growers as it vastly affects their profit. However, until today, this process still relies on laborious efforts humans, who are prone to fatigue and errors. To remedy this, the paper approaches with various convolutional neural networks (CNN), tried-and-tested deep learning technology in computer vision. models involved include Mask R-CNN (for background removal), numerous past winners ImageNet challenge, namely AlexNet, VGGs,...

10.48550/arxiv.2011.11378 preprint EN cc-by arXiv (Cornell University) 2020-01-01

Transformers and variational autoencoders (VAE) have been extensively employed for symbolic (e.g., MIDI) domain music generation. While the former boast an impressive capability in modeling long sequences, latter allow users to willingly exert control over different parts bars) of be generated. In this paper, we are interested bringing two together construct a single model that exhibits both strengths. The task is split into steps. First, equip Transformer decoders with ability accept...

10.48550/arxiv.2105.04090 preprint EN cc-by arXiv (Cornell University) 2021-01-01

This paper describes our system for the low-resource domain adaptation track (Track 3) in Spoken Language Understanding Grand Challenge, which is a part of ICASSP Signal Processing Challenge 2023. In track, we adopt pipeline approach ASR and NLU. For ASR, fine-tune Whisper each with upsampling. NLU, BART on all Track3 data then data. We apply masked LM (MLM) -based augmentation, where some input tokens corresponding target labels are replaced using MLM. also retrieval-based approach, model...

10.48550/arxiv.2305.01194 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Recently there have been efforts to introduce new benchmark tasks for spoken language understanding (SLU), like semantic parsing. In this paper, we describe our proposed parsing system the quality track (Track 1) in Spoken Language Understanding Grand Challenge which is part of ICASSP Signal Processing 2023. We experiment with both end-to-end and pipeline systems task. Strong automatic speech recognition (ASR) models Whisper pretrained (LM) BART are utilized inside SLU framework boost...

10.48550/arxiv.2305.01620 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Spoken Language Understanding (SLU) is a critical speech recognition application and often deployed on edge devices. Consequently, on-device processing plays significant role in the practical implementation of SLU. This paper focuses end-to-end (E2E) SLU model due to its small latency property, unlike cascade system, aims minimize computational cost. We reduce size by applying tensor decomposition Conformer E-Branchformer architectures used our E2E models. propose apply singular value linear...

10.48550/arxiv.2306.01247 preprint EN other-oa arXiv (Cornell University) 2023-01-01

PhotoBook is a collaborative dialogue game where two players receive private, partially-overlapping sets of images and resolve which they have in common. It presents machines with great challenge to learn how people build common ground around multimodal context communicate effectively. Methods developed the literature, however, cannot be deployed real gameplay since only tackle some subtasks game, require additional reference chains inputs, whose extraction process imperfect. Therefore, we...

10.48550/arxiv.2306.09607 preprint EN cc-by arXiv (Cornell University) 2023-01-01

PhotoBook is a collaborative dialogue game where two players receive private, partially-overlapping sets of images and resolve which they have in common.It presents machines with great challenge to learn how people build common ground around multimodal context communicate effectively.Methods developed the literature, however, cannot be deployed real gameplaysince only tackle some subtasks game,and require additional reference chains inputs, whose extraction process imperfect.Therefore, we...

10.18653/v1/2023.acl-short.121 article EN cc-by 2023-01-01

Even with strong sequence models like Transformers, generating expressive piano performances long-range musical structures remains challenging. Meanwhile, methods to compose well-structured melodies or lead sheets (melody + chords), i.e., simpler forms of music, gained more success. Observing the above, we devise a two-stage Transformer-based framework that Composes sheet first, and then Embellishes it accompaniment touches. Such factorization also enables pretraining on non-piano data. Our...

10.48550/arxiv.2209.08212 preprint EN cc-by-sa arXiv (Cornell University) 2022-01-01
Coming Soon ...