Alexander H. Liu

ORCID: 0000-0003-1628-0855
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech Recognition and Synthesis
  • Speech and Audio Processing
  • Music and Audio Processing
  • Natural Language Processing Techniques
  • Topic Modeling
  • Speech and dialogue systems
  • Adversarial Robustness in Machine Learning
  • Digital Media Forensic Detection
  • Image Processing Techniques and Applications
  • Covalent Organic Framework Applications
  • Advanced Image and Video Retrieval Techniques
  • Network Packet Processing and Optimization
  • Generative Adversarial Networks and Image Synthesis
  • Neural Networks and Applications
  • Robotics and Sensor-Based Localization
  • Advanced Vision and Imaging
  • Multimodal Machine Learning Applications
  • Color Science and Applications
  • Ionic liquids properties and applications
  • Blind Source Separation Techniques
  • CO2 Reduction Techniques and Catalysts
  • Domain Adaptation and Few-Shot Learning
  • Video Analysis and Summarization

Massachusetts Institute of Technology
1994-2024

Wuhan Institute of Technology
2024

National Taiwan University
2018-2022

Monocular depth estimation is a challenging task in scene understanding, with the goal to acquire geometric properties of 3D space from 2D images. Due lack RGB-depth image pairs, unsupervised learning methods aim at deriving information alternative supervision such as stereo pairs. However, most existing works fail model structure objects, which generally results considering pixel-level objective functions during training. In this paper, we propose SceneNet overcome limitation aid semantic...

10.1109/cvpr.2019.00273 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

We present a novel and unified deep learning framework which is capable of domain-invariant representation from data across multiple domains. Realized by adversarial training with additional ability to exploit domain-specific information, the proposed network able perform continuous cross-domain image translation manipulation, produces desirable output images accordingly. In addition, resulting feature exhibits superior performance unsupervised domain adaptation, also verifies effectiveness...

10.48550/arxiv.1809.01361 preprint EN other-oa arXiv (Cornell University) 2018-01-01

Self-supervised speech representations have been shown to be effective in a variety of applications. However, existing representation learning methods generally rely on the autoregressive model and/or observed global dependencies while generating representation. In this work, we propose Non-Autoregressive Predictive Coding (NPC), self-supervised method, learn non-autoregressive manner by relying only local speech. NPC has conceptually simple objective and can implemented easily with...

10.21437/interspeech.2021-349 article EN Interspeech 2022 2021-08-27

The ability of artificial intelligence (AI) systems to perceive and comprehend audio signals is crucial for many applications. Although significant progress has been made in this area since the development AudioSet, most existing models are designed map inputs pre-defined, discrete sound label sets. In contrast, humans possess not only classify sounds into general categories, but also listen finer details sounds, explain reason predictions, think about what infers, understand scene action...

10.48550/arxiv.2305.10790 preprint EN cc-by arXiv (Cornell University) 2023-01-01

In this paper we propose a Sequential Representation Quantization AutoEncoder (SeqRQ-AE) to learn from primarily unpaired audio data and produce sequences of representations very close phoneme speech utterances. This is achieved by proper temporal segmentation make the phoneme-synchronized, phonetic clustering have total number distinct phonemes. Mapping between phonemes learned small amount annotated paired data. Preliminary experiments on LJSpeech demonstrated for vowels relative locations...

10.1109/icassp40776.2020.9053571 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020-04-09

In this paper we proposed a novel Adversarial Training (AT) approach for end-to-end speech recognition using Criticizing Language Model (CLM). way the CLM and automatic (ASR) model can challenge learn from each other iteratively to improve performance. Since only takes text as input, huge quantities of unpaired data be utilized in within training. Moreover, AT applied any ASR deep-learning-based language modeling frameworks, compatible with existing decoding method. Initial results an...

10.1109/icassp.2019.8683602 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

In electrocatalysis, mechanistic analysis of reaction rate data often relies on the linearization relatively simple equations; this is basis for typical Tafel and reactant order dependence analyses. However, more complex phenomena, such as surface coverage effects or mixed control, these common strategies will yield incomplete uninterpretable results. Cohesive kinetic analysis, which used in thermocatalysis involves quantitative model fitting collected over a wide range conditions, requires...

10.1021/acscentsci.3c01295 article EN cc-by ACS Central Science 2024-06-28

Self-supervised speech representation learning (speech SSL) has demonstrated the benefit of scale in rich representations for Automatic Speech Recognition (ASR) with limited paired data, such as wav2vec 2.0. We investigate existence sparse subnetworks pre-trained SSL models that achieve even better low-resource ASR results. However, directly applying widely adopted pruning methods Lottery Ticket Hypothesis (LTH) is suboptimal computational cost needed. Moreover, we show discovered yield...

10.48550/arxiv.2106.05933 preprint EN other-oa arXiv (Cornell University) 2021-01-01

In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities. Subsequently, propose Contrastive Audio-Visual (CAV-MAE) by combining contrastive learning and masked data modeling, two major self-supervised frameworks, learn joint coordinated representation. Our experiments show that correspondence objective not only enables perform retrieval tasks, but also helps better As result, our fully pretrained CAV-MAE achieves new...

10.48550/arxiv.2210.07839 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Humans are surrounded by audio signals that include both speech and non-speech sounds. The recognition understanding of events, along with a profound comprehension the relationship between them, constitute fundamental cognitive capabilities. For first time, we build machine learning model, called LTU-AS, has conceptually similar universal perception advanced reasoning ability. Specifically, integrating Whisper [1] as module LLaMA [2] module, LTU-AS can simultaneously recognize jointly...

10.1109/asru57964.2023.10389742 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2023-12-16

Neural audio codecs are initially introduced to compress data into compact codes reduce transmission latency. Researchers recently discovered the potential of as suitable tokenizers for converting continuous discrete codes, which can be employed develop language models (LMs). Numerous high-performance neural and codec-based LMs have been developed. The paper aims provide a thorough systematic overview codec LMs.

10.48550/arxiv.2402.13236 preprint EN arXiv (Cornell University) 2024-02-20

Conventional audio-visual models have independent audio and video branches. In this work, we <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">unify</i> the visual branches by designing a <underline xmlns:xlink="http://www.w3.org/1999/xlink">U</u> nified xmlns:xlink="http://www.w3.org/1999/xlink">A</u> udio- xmlns:xlink="http://www.w3.org/1999/xlink">V</u> isual xmlns:xlink="http://www.w3.org/1999/xlink">M</u> odel (UAVM). The UAVM achieves new...

10.1109/lsp.2022.3224688 article EN IEEE Signal Processing Letters 2022-01-01

Speech translation (ST) aims to learn transformations from speech in the source language text target language. Previous works show that multitask learning improves ST performance, which recognition decoder generates of language, and obtains final translations based on output decoder. Because whether has correct semantics is more critical than its accuracy, we propose improve model by utilizing word embedding as intermediate.

10.18653/v1/2020.acl-main.533 preprint EN cc-by 2020-01-01

In this article, we target speech translation (ST). We propose lightweight approaches that generally improve either ASR or end-to-end ST models. leverage continuous representations of words, known as word embeddings, to in cascaded systems well The benefit using embedding is can be obtained easily by training on pure textual data, which alleviates data scarcity issue. Also, provides additional contextual information motivate distill the knowledge from into ASR, use embeddings a regularizer...

10.1109/taslp.2020.3037543 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2020-11-12

In this paper, we investigate the benefit that off-the-shelf word embedding can bring to sequence-to-sequence (seq-to-seq) automatic speech recognition (ASR). We first introduced regularization by maximizing cosine similarity between a transformed decoder feature and target embedding. Based on regularized decoder, further proposed fused decoding mechanism. This allows consider semantic consistency during absorbing information carried feature, which is learned be close Initial results...

10.1109/icassp40776.2020.9053324 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020-04-09

Existing studies on self-supervised speech representation learning have focused developing new training methods and applying pre-trained models for different applications. However, the quality of these is often measured by performance downstream tasks. How well representations access information interest less studied. In this work, we take a closer look into existing from an information-theoretic perspective. We aim to develop metrics using mutual help practical problems such as model design...

10.1109/icassp48485.2024.10447758 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

In this paper, we introduce self-distillation and online clustering for self-supervised speech representation learning (DinoSR) which combines masked language modeling, self-distillation, clustering. We show that these concepts complement each other result in a strong model speech. DinoSR first extracts contextualized embeddings from the input audio with teacher network, then runs an system on to yield machine-discovered phone inventory, finally uses discretized tokens guide student network....

10.48550/arxiv.2305.10005 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Are end-to-end text-to-speech (TTS) models over-parametrized? To what extent can these be pruned, and happens to their synthesis capabilities? This work serves as a starting point explore pruning both spectrogram prediction networks vocoders. We thoroughly investigate the tradeoffs between sparsity its subsequent effects on synthetic speech. Additionally, we several aspects of TTS pruning: amount finetuning data versus sparsity, TTS-Augmentation utilize unspoken text, combining knowledge...

10.1109/icassp43922.2022.9747728 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

Whispering is an important mode of human speech, but no end-to-end recognition results for it were reported yet, probably due to the scarcity available whispered speech data. In this paper, we present several approaches (E2E) considering special characteristics and This includes a frequency-weighted SpecAugment policy frequency-divided CNN feature extractor better capturing high-frequency structures layer-wise transfer learning approach pre-train model with normal or normal-to-whispered...

10.1109/slt48900.2021.9383595 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2021-01-19

Existing studies on self-supervised speech representation learning have focused developing new training methods and applying pre-trained models for different applications. However, the quality of these is often measured by performance downstream tasks. How well representations access information interest less studied. In this work, we take a closer look into existing from an information-theoretic perspective. We aim to develop metrics using mutual help practical problems such as model design...

10.48550/arxiv.2401.08833 preprint EN cc-by arXiv (Cornell University) 2024-01-01

Neural Audio Codecs, initially designed as a compression technique, have gained more attention recently for speech generation. Codec models represent each audio frame sequence of tokens, i.e., discrete embeddings. The and low-frequency nature neural codecs introduced new way to generate with token-based models. As these tokens encode information at various levels granularity, from coarse fine, most existing works focus on how better the tokens. In this paper, we an equally important but...

10.48550/arxiv.2410.22448 preprint EN arXiv (Cornell University) 2024-10-29
Coming Soon ...