Zexu Pan

ORCID: 0000-0002-8106-1176
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech and Audio Processing
  • Speech Recognition and Synthesis
  • Music and Audio Processing
  • EEG and Brain-Computer Interfaces
  • Indoor and Outdoor Localization Technologies
  • Blind Source Separation Techniques
  • Underwater Acoustics Research
  • Neural dynamics and brain function
  • Advanced Memory and Neural Computing
  • Emotion and Mood Recognition
  • Face recognition and analysis
  • Video Analysis and Summarization
  • Hearing Loss and Rehabilitation
  • Advanced Adaptive Filtering Techniques
  • Hand Gesture Recognition Systems
  • Natural Language Processing Techniques
  • Wireless Communication Networks Research
  • Phonetics and Phonology Research
  • Human Pose and Action Recognition
  • Neural Networks and Applications

National University of Singapore
2020-2024

Mitsubishi Electric (United States)
2023-2024

Duke-NUS Medical School
2020-2021

Data Storage Institute
2020

Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or more speakers. The successful ASD depends on accurate interpretation short-term and long-term audio information, as well audio-visual interaction. Unlike the prior work where systems make decision instantaneously using features, we propose novel framework, named TalkNet, that makes by taking both features into consideration. TalkNet consists temporal encoders for feature representation, cross-attention...

10.1145/3474085.3475587 article EN Proceedings of the 30th ACM International Conference on Multimedia 2021-10-17

Emotion represents an essential aspect of human speech that is manifested in prosody.Speech, visual, and textual cues are complementary communication.In this paper, we study a hybrid fusion method, referred to as multi-modal attention network (MMAN) make use visual emotion recognition.We propose novel multimodal mechanism, cLSTM-MMA, which facilitates the across three modalities selectively fuse information.cLSTM-MMA fused with other uni-modal subnetworks late fusion.The experiments show...

10.21437/interspeech.2020-1653 article EN Interspeech 2022 2020-10-25

A speaker extraction algorithm seeks to extract the target speaker's speech from a multi-talker mixture. The prior studies focus mostly on highly overlapped However, target-interference overlapping ratios could vary over wide range 0% 100% in natural communication, furthermore, be absent mixture, mixtures such universal scenarios are described as <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">general mixtures</i> . requires an auxiliary...

10.1109/taslp.2022.3205759 article EN cc-by IEEE/ACM Transactions on Audio Speech and Language Processing 2022-01-01

Prior studies on audio-visual speech recognition typically assume the visibility of speaking lips, ignoring fact that visual occlusion occurs in real-world videos, thus adversely affecting performance. To address this issue, we propose a framework restores occluded lips video by utilizing both itself and corresponding noisy audio. Specifically, aims to achieve these three tasks: detecting frames, masking areas, reconstruction masked regions. We tackle first two issues Class Activation Map...

10.1609/aaai.v38i17.29882 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2024-03-24

10.1109/taslp.2024.3463498 article EN cc-by IEEE/ACM Transactions on Audio Speech and Language Processing 2024-01-01

Most of the prior studies in spatial Direction Arrival (DoA) domain focus on a single modality. However, humans use auditory and visual senses to detect presence sound sources. With this motivation, we propose neural networks with audio signals for multi-speaker localization. The heterogeneous sensors can provide complementary information overcome uni-modal challenges, such as noise, reverberation, illumination variations, occlusions. We attempt address these issues by introducing an...

10.1109/icassp39728.2021.9413776 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

A speaker extraction algorithm seeks to extract the speech of a target from multi-talker mixture when given cue that represents speaker, such as pre-enrolled utterance, or an accompanying video track. Visual cues are particularly useful is not available. In this work, we don't rely on speaker's speech, but rather use face track cue, referred auxiliary reference, form attractor towards speaker. We advocate temporal synchronization between and its lip movements direct dominant audio-visual...

10.1109/taslp.2022.3153258 article EN cc-by IEEE/ACM Transactions on Audio Speech and Language Processing 2022-01-01

Recently, the application of diffusion probabilistic models has advanced speech enhancement through generative approaches. However, existing diffusion-based methods have focused on generation process in high-dimensional waveform or spectral domains, leading to increased complexity and slower inference speeds. Additionally, these primarily modelled clean distributions, with limited exploration noise thereby constraining discriminative capability for enhancement. To address issues, we propose...

10.48550/arxiv.2501.10052 preprint EN arXiv (Cornell University) 2025-01-17

Decoding speech from brain signals is a challenging research problem that holds significant importance for studying processing in the brain. Although breakthroughs have been made reconstructing mel spectrograms of audio stimuli perceived by subjects at word or letter level using noninvasive electroencephalography (EEG), there still critical gap precisely continuous features, especially minute level. To address this issue, paper proposes State Space Model (SSM) to reconstruct spectrogram EEG,...

10.48550/arxiv.2501.10402 preprint EN arXiv (Cornell University) 2025-01-03

10.1109/icassp49660.2025.10890477 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

10.1109/icassp49660.2025.10888872 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

10.1109/icassp49660.2025.10888785 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Speaker extraction algorithm relies on the speech sample from target speaker as reference point to focus its attention. Such a is typically pre-recorded. On other hand, temporal synchronization between and lip movement also serves an informative cue. Motivated by this idea, we study novel technique use speech-lip visual cues extract directly mixture during inference time, without need of pre-recorded speech. We propose multi-modal network, named MuSE, that conditioned only image sequence....

10.1109/icassp39728.2021.9414023 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

Speaker extraction seeks to extract the clean speech of a target speaker from multi-talker mixture speech. There have been studies use pre-recorded sample or face image as cue. In human communication, co-speech gestures that are naturally timed with also contribute perception. this work, we explore sequence, e.g. hand and body movements, cue for extraction, which could be easily obtained low-resolution video recordings, thus more available than recordings. We propose two networks using...

10.1109/lsp.2022.3175130 article EN cc-by IEEE Signal Processing Letters 2022-01-01

End-to-end time-domain speech separation with masking strategy has shown its performance advantage, where a 1-D convolutional layer is used as the encoder to encode sliding window of waveform latent feature representation, i.e. an embedding vector. A large leads low resolution in processing, on other hand, small offers high but at expense computational cost. In this work, we propose graph encoding technique model fine structural knowledge samples reasonable size. Specifically, build...

10.1109/lsp.2023.3243764 article EN IEEE Signal Processing Letters 2023-01-01

Target speaker extraction aims to extract the speech of a specific from multi-talker mixture as specified by an auxiliary reference. Most studies focus on scenario where target is highly overlapped with interfering speech. However, this only accounts for small percentage real-world conversations. In paper, we aim at sparsely scenarios in which reference needs perform two tasks simultaneously: detect activity and disentangle active any We propose audio-visual model named ActiveExtract,...

10.1109/icassp48485.2024.10448398 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

Neuro-steered speaker extraction aims to extract the listener's brainattended speech signal from a multi-talker signal, in which attention is derived cortical activity. This activity usually recorded using electroencephalography (EEG) devices. Though promising, current methods often have high confusion error, where interfering extracted instead of attended speaker, degrading listening experience. In this work, we aim reduce error neuro-steered model through jointly fine-tuned auxiliary...

10.1109/icassp48485.2024.10446333 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

Speaker tracking plays a significant role in numerous real-world human robot interaction (HRI) applications. In recent years, there has been growing interest utilizing multi-sensory information, such as complementary audio and visual signals, to address the challenges of speaker tracking. Despite promising results, existing approaches still encounter difficulties accurately determining speaker's true location, particularly adverse conditions speech pauses, reverberation, or occlusions,...

10.1109/icassp48485.2024.10446460 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

The prevailing noise-resistant and reverberation-resistant localization algorithms primarily emphasize separating providing directional output for each speaker in multi-speaker scenarios, without association with the identity of speakers. In this paper, we present a target algorithm selective hearing mechanism. Given reference speech speaker, first produce speaker-dependent spectrogram mask to eliminate interfering speakers' speech. Subsequently, Long-Short-Term Memory (LSTM) network is...

10.1109/icassp48485.2024.10447143 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

The speaker extraction algorithm extracts the target speech from a mixture containing interference and background noise. process sometimes over-suppresses extracted speech, which not only creates artifacts during listening but also harms performance of downstream automatic recognition algorithms. We propose hybrid continuity loss function for time-domain algorithms to settle over-suppression problem. On top waveform-level used superior signal quality, i.e., SI-SDR, we introduce...

10.21437/interspeech.2022-157 article EN Interspeech 2022 2022-09-16

Speaker extraction seeks to extract the target speech in a multitalker scenario given an auxiliary reference.Such reference can be auditory, i.e., pre-recorded speech, visual, lip movements, or contextual, phonetic sequence.References different modalities provide distinct and complementary information that could fused form top-down attention on speaker.Previous studies have introduced visual contextual single model.In this paper, we propose two-stage time-domain visual-contextual speaker...

10.21437/interspeech.2022-11183 article EN Interspeech 2022 2022-09-16
Coming Soon ...