NFDI4DS | UHH-SEMS - Publication Details

Xinyuan Qian

ORCID: 0000-0002-9511-6713

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5056495776

Research Areas

Speech and Audio Processing
Music and Audio Processing
Speech Recognition and Synthesis
Indoor and Outdoor Localization Technologies
CCD and CMOS Imaging Sensors
Advanced Adaptive Filtering Techniques
Infrared Target Detection Methodologies
Adaptive optics and wavefront sensing
Blind Source Separation Techniques
Image Processing Techniques and Applications
Face recognition and analysis
Tactile and Sensory Interactions
Video Surveillance and Tracking Methods
Advanced Sensor and Energy Harvesting Materials
Underwater Acoustics Research
Music Technology and Sound Studies
Inertial Sensor and Navigation
Advanced Optical Imaging Technologies
Generative Adversarial Networks and Image Synthesis
Infant Health and Development
Advanced MEMS and NEMS Technologies
Hearing Loss and Rehabilitation
Particle Detector Development and Performance
Natural Language Processing Techniques
Advanced Vision and Imaging

University of Science and Technology Beijing
2022-2025

University of Electronic Science and Technology of China
2024

National University of Singapore
2021-2023

Chinese University of Hong Kong, Shenzhen
2022

Shenzhen Research Institute of Big Data
2022

Queen Mary University of London
2017-2021

Shenyang University of Technology
2017

Nanyang Technological University
2011-2015

Heriot-Watt University
2014

Seoul National University
2013

Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert

OPENALEX - Publications

Jiadong Wang Xinyuan Qian Malu Zhang Robby T. Tan Haizhou Li

Talking face generation, also known as speech-to-lip reconstructs facial motions concerning lips given coherent speech input. The previous studies revealed the importance of lip-speech synchronization and visual quality. Despite much progress, they hardly focus on content lip movements i.e., intelligibility spoken words, which is an important aspect generation To address problem, we propose using a lipreading expert to improve generated regions by penalizing incorrect results. Moreover,...

10.1109/cvpr52729.2023.01408 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

Is Someone Speaking?

OPENALEX - Publications

Ruijie Tao Zexu Pan Rohan Kumar Das Xinyuan Qian Mike Zheng Shou and 1 more

Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or more speakers. The successful ASD depends on accurate interpretation short-term and long-term audio information, as well audio-visual interaction. Unlike the prior work where systems make decision instantaneously using features, we propose novel framework, named TalkNet, that makes by taking both features into consideration. TalkNet consists temporal encoders for feature representation, cross-attention...

10.1145/3474085.3475587 article EN Proceedings of the 30th ACM International Conference on Multimedia 2021-10-17

Multi-Speaker Tracking From an Audio–Visual Sensing Device

OPENALEX - Publications

Xinyuan Qian Alessio Brutti Oswald Lanz Maurizio Omologo Andrea Cavallaro

Compact multi-sensor platforms are portable and thus desirable for robotics personal-assistance tasks. However, compared to physically distributed sensors, the size of these makes person tracking more difficult. To address this challenge, we propose a novel 3-D audio-visual people tracker that exploits visual observations (object detections) guide acoustic processing by constraining likelihood on horizontal plane defined predicted height speaker. This solution allows estimate, with small...

10.1109/tmm.2019.2902489 article EN IEEE Transactions on Multimedia 2019-03-01

Multi-Target DoA Estimation with an Audio-Visual Fusion Mechanism

OPENALEX - Publications

Xinyuan Qian Maulik C. Madhavi Zexu Pan Jiadong Wang Haizhou Li

Most of the prior studies in spatial Direction Arrival (DoA) domain focus on a single modality. However, humans use auditory and visual senses to detect presence sound sources. With this motivation, we propose neural networks with audio signals for multi-speaker localization. The heterogeneous sensors can provide complementary information overcome uni-modal challenges, such as noise, reverberation, illumination variations, occlusions. We attempt address these issues by introducing an...

10.1109/icassp39728.2021.9413776 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

A Time-Frequency Attention Module for Neural Speech Enhancement

OPENALEX - Publications

Qiquan Zhang Xinyuan Qian Zhaoheng Ni Aaron Nicolson Eliathamby Ambikairajah and 1 more

Speech enhancement plays an essential role in a wide range of speech processing applications. Recent studies on tend to investigate how effectively capture the long-term contextual dependencies signals boost performance. However, these generally neglect time-frequency (T-F) distribution information spectral components, which is equally important for enhancement. In this paper, we propose simple yet very effective network module, term T-F attention (TFA) that uses two parallel branches, i.e.,...

10.1109/taslp.2022.3225649 article EN cc-by IEEE/ACM Transactions on Audio Speech and Language Processing 2022-12-01

L$^{3}$ F-TOUCH: A Wireless GelSight With Decoupled Tactile and Three-Axis Force Sensing

OPENALEX - Publications

Wanlin Li Meng Wang J. Li Yao Su Devesh K. Jha and 3 more

GelSight sensors that estimate contact geometry and force by reconstructing the deformation of their soft elastomer from images would yield poor measurements when deforms uniformly or reaches saturation. Here we present an L <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$^{3}$</tex-math></inline-formula> F-TOUCH sensor considerably enhances three-axis sensing capability typical sensors. Specifically,...

10.1109/lra.2023.3292575 article EN IEEE Robotics and Automation Letters 2023-07-05

Mamba in Speech: Towards an Alternative to Self-Attention

OPENALEX - Publications

Xiangyu Zhang Qiquan Zhang Hexin Liu Tianyi Xiao Xinyuan Qian and 4 more

Transformer and its derivatives have achieved success in diverse tasks across computer vision, natural language processing, speech processing. To reduce the complexity of computations within multi-head self-attention mechanism Transformer, Selective State Space Models (i.e., Mamba) were proposed as an alternative. Mamba exhibited effectiveness processing vision tasks, but superiority has rarely been investigated signal This paper explores solutions for applying to using two typical tasks:...

10.48550/arxiv.2405.12609 preprint EN arXiv (Cornell University) 2024-05-21

FaceSpeak: Expressive and High-Quality Speech Synthesis from Human Portraits of Different Styles

OPENALEX - Publications

Tianhao Zhang Jiawei Zhang Jun Wang Xinyuan Qian Xu-Cheng Yin

Humans can perceive speakers' characteristics (e.g., identity, gender, personality and emotion) by their appearance, which are generally aligned to voice style. Recently, vision-driven Text-to-speech (TTS) scholars grounded investigations on real-person faces, thereby restricting effective speech synthesis from applying vast potential usage scenarios with diverse characters image styles. To solve this issue, we introduce a novel FaceSpeak approach. It extracts salient identity emotional...

10.48550/arxiv.2501.03181 preprint EN arXiv (Cornell University) 2025-01-01

Breaking Through the Spike: Spike Window Decoding for Accelerated and Precise Automatic Speech Recognition

OPENALEX - Publications

Wei Zhang Tianhao Zhang Chao Luo Hui Zhou Chao Yang and 2 more

Recently, end-to-end automatic speech recognition has become the mainstream approach in both industry and academia. To optimize system performance specific scenarios, Weighted Finite-State Transducer (WFST) is extensively used to integrate acoustic language models, leveraging its capacity implicitly fuse models within static graphs, thereby ensuring robust while also facilitating rapid error correction. However, WFST necessitates a frame-by-frame search of CTC posterior probabilities through...

10.48550/arxiv.2501.03257 preprint EN arXiv (Cornell University) 2025-01-01

Audio-Visual Target Speaker Extraction with Selective Auditory Attention

OPENALEX - Publications

Ruijie Tao Xinyuan Qian Yidi Jiang Junjie Li Jiadong Wang and 1 more

Audio-visual target speaker extraction (AV-TSE) aims to extract the specific person's speech from audio mixture given auxiliary visual cues. Previous methods usually search for voice through speech-lip synchronization. However, this strategy mainly focuses on existence of speech, while ignoring variations noise characteristics, i.e., interference and background noise. That may result in extracting noisy signals incorrect sound source challenging acoustic situations. To end, we propose a...

10.1109/taslpro.2025.3527766 article EN IEEE Transactions on Audio Speech and Language Processing 2025-01-01

M2PAIR: A High-Quality Acoustic Impulse Response Computation Model

OPENALEX - Publications

Zhiyu Li Xiaoyan Zhao Jing Wang Xinyuan Qian Xiang Xie

10.1109/icassp49660.2025.10889128 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Breaking Through the Spike: Spike Window Decoding for Accelerated and Precise Automatic Speech Recognition

OPENALEX - Publications

Wei Zhang Tianhao Zhang Chao Luo Hui Zhou Chao Yang and 2 more

10.1109/icassp49660.2025.10887831 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

SAV-SE: Scene-aware Audio-Visual Speech Enhancement with Selective State Space Model

OPENALEX - Publications

Xinyuan Qian Jie Gao Yaodan Zhang Qiquan Zhang Hexin Liu and 2 more

10.1109/jstsp.2025.3558654 article EN IEEE Journal of Selected Topics in Signal Processing 2025-01-01

Binauralspeech: Controllable Text-to-Binaural Speech Synthesis with Text Prompt

OPENALEX - Publications

Jiawei Zhang Liumeng Xue Xinyuan Qian Tianhao Zhang Jiawen Chua and 1 more

10.2139/ssrn.5209579 preprint EN 2025-01-01

FaceSpeak: Expressive and High-Quality Speech Synthesis from Human Portraits of Different Styles

OPENALEX - Publications

Tianhao Zhang Jiawei Zhang Jun Wang Xinyuan Qian Xu-Cheng Yin

Humans can perceive speakers’ characteristics (e.g., identity, gender, personality and emotion) by their appearance, which are generally aligned to voice style. Recently, vision-driven Text-to-speech ( TTS ) scholars grounded investigations on real-person faces, thereby restricting effective speech synthesis from applying vast potential usage scenarios with diverse characters image styles. To solve this issue, we introduce a novel FaceSpeak approach. It extracts salient identity emotional...

10.1609/aaai.v39i24.34786 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2025-04-11

Audio-Visual Tracking of Concurrent Speakers

OPENALEX - Publications

Xinyuan Qian Alessio Brutti Oswald Lanz Maurizio Omologo Andrea Cavallaro

Audio-visual tracking of an unknown number concurrent speakers in 3D is a challenging task, especially when sound and video are collected with compact sensing platform. In this paper, we propose tracker that builds on generative discriminative audio-visual likelihood models formulated particle filtering framework. We localize multiple de-emphasized acoustic map assisted by the image detection-derived observations. The multi-modal observations either assigned to existing tracks for...

10.1109/tmm.2021.3061800 article EN IEEE Transactions on Multimedia 2021-02-24

Speaker Extraction With Co-Speech Gestures Cue

OPENALEX - Publications

Zexu Pan Xinyuan Qian Haizhou Li

Speaker extraction seeks to extract the clean speech of a target speaker from multi-talker mixture speech. There have been studies use pre-recorded sample or face image as cue. In human communication, co-speech gestures that are naturally timed with also contribute perception. this work, we explore sequence, e.g. hand and body movements, cue for extraction, which could be easily obtained low-resolution video recordings, thus more available than recordings. We propose two networks using...

10.1109/lsp.2022.3175130 article EN cc-by IEEE Signal Processing Letters 2022-01-01

Audio-Visual Cross-Attention Network for Robotic Speaker Tracking

OPENALEX - Publications

Xinyuan Qian Zhengdong Wang Jiadong Wang Guohui Guan Haizhou Li

Audio-visual signals can be used jointly for robotic perception as they complement each other. Such multi-modal sensory fusion has a clear advantage, especially under noisy acoustic conditions. Speaker localization, an essential function, was traditionally solved signal processing problem that now increasingly finds deep learning solutions. The question is how to fuse audio-visual in effective way. tracking not only more desirable, but also potentially accurate than speaker localization...

10.1109/taslp.2022.3226330 article EN cc-by IEEE/ACM Transactions on Audio Speech and Language Processing 2022-12-02

3D audio-visual speaker tracking with an adaptive particle filter

OPENALEX - Publications

Xinyuan Qian Alessio Brutti Maurizio Omologo Andrea Cavallaro

We propose an audio-visual fusion algorithm for 3D speaker tracking from a localised multi-modal sensor platform composed of camera and small microphone array. After extracting cues individual modalities we fuse them adaptively using their reliability in particle filter framework. The the audio signal is measured based on maximum Global Coherence Field (GCF) peak value at each frame. visual colour-histogram matching with detection results compared reference image RGB space. Experiments...

10.1109/icassp.2017.7952686 article EN 2017-03-01

A Time-Delay-Integration CMOS image sensor with pipelined charge transfer architecture

OPENALEX - Publications

Hang Yu Xinyuan Qian Shoushun Chen Kay‐Soon Low

In this paper, we report a novel Time-Delay-Integration (TDI) CMOS image sensor for low-earth orbit (LEO) nano-satellite imaging application, where limited exposure time and unexpected flight fluctuations are major design challenges. The features programmable integration per stage, dynamic charge transfer path tunable well capacity. A prototype chip of 1536×8 pixels was implemented using TSMC 0.18µm process. Photodiode other transistors floor-planned in different arrays, providing small...

10.1109/iscas.2012.6271566 article EN 1993 IEEE International Symposium on Circuits and Systems 2012-05-01

A Miniaturised Camera-based Multi-Modal Tactile Sensor

OPENALEX - Publications

Kaspar Althoefer Yonggen Ling Wanlin Li Xinyuan Qian Wang Wei Lee and 1 more

In conjunction with huge recent progress in cam-era and computer vision technology, camera-based sensors have increasingly shown considerable promise relation to tactile sensing. comparison competing technologies (be they resistive, capacitive or magnetic based), offer super-high-resolution, while suffering from fewer wiring problems. The human system is composed of various types mechanoreceptors, each able perceive process distinct information such as force, pressure, texture, etc....

10.1109/icra48891.2023.10160634 article EN 2023-05-29

A High Dynamic Range CMOS Image Sensor With Dual-Exposure Charge Subtraction Scheme

OPENALEX - Publications

Xinyuan Qian Hang Yu Shoushun Chen Kay‐Soon Low

In this paper, we present a CMOS image sensor for star centroid measurement the application of trackers. We propose new capacitive transimpedance amplifier pixel architecture with in-pixel charge subtraction scheme. The is able to achieve high signal-to-noise ratio dim stars and, at same time, avoid saturation bright stars. A prototype was fabricated using Global Foundry 65-nm mixed-signal process. Experimental results show that can 3.8 V/lux <inline-formula...

10.1109/jsen.2014.2365173 article EN IEEE Sensors Journal 2014-10-27

GCC-PHAT with Speech-oriented Attention for Robotic Sound Source Localization

OPENALEX - Publications

Jiadong Wang Xinyuan Qian Zihan Pan Malu Zhang Haizhou Li

Robotic audition is a basic sense that helps robots perceive the surroundings and interact with humans. Sound Source Localization (SSL) an essential module for robotic system. However, performance of most sound source localization techniques degrades in noisy reverberant environments due to inaccurate Time Difference Arrival (TDoA) estimation. In localization, we are more interested detecting arrival human speech than other sources. Ideally, expect effective TDoA estimation respond only...

10.1109/icra48506.2021.9561885 article EN 2021-05-30

Coming Soon ...