Xinyuan Qian

ORCID: 0000-0002-9511-6713
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech and Audio Processing
  • Music and Audio Processing
  • Speech Recognition and Synthesis
  • Indoor and Outdoor Localization Technologies
  • CCD and CMOS Imaging Sensors
  • Advanced Adaptive Filtering Techniques
  • Infrared Target Detection Methodologies
  • Adaptive optics and wavefront sensing
  • Blind Source Separation Techniques
  • Image Processing Techniques and Applications
  • Face recognition and analysis
  • Tactile and Sensory Interactions
  • Video Surveillance and Tracking Methods
  • Advanced Sensor and Energy Harvesting Materials
  • Underwater Acoustics Research
  • Music Technology and Sound Studies
  • Inertial Sensor and Navigation
  • Advanced Optical Imaging Technologies
  • Generative Adversarial Networks and Image Synthesis
  • Infant Health and Development
  • Advanced MEMS and NEMS Technologies
  • Hearing Loss and Rehabilitation
  • Particle Detector Development and Performance
  • Natural Language Processing Techniques
  • Advanced Vision and Imaging

University of Science and Technology Beijing
2022-2025

University of Electronic Science and Technology of China
2024

National University of Singapore
2021-2023

Chinese University of Hong Kong, Shenzhen
2022

Shenzhen Research Institute of Big Data
2022

Queen Mary University of London
2017-2021

Shenyang University of Technology
2017

Nanyang Technological University
2011-2015

Heriot-Watt University
2014

Seoul National University
2013

Talking face generation, also known as speech-to-lip reconstructs facial motions concerning lips given coherent speech input. The previous studies revealed the importance of lip-speech synchronization and visual quality. Despite much progress, they hardly focus on content lip movements i.e., intelligibility spoken words, which is an important aspect generation To address problem, we propose using a lipreading expert to improve generated regions by penalizing incorrect results. Moreover,...

10.1109/cvpr52729.2023.01408 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or more speakers. The successful ASD depends on accurate interpretation short-term and long-term audio information, as well audio-visual interaction. Unlike the prior work where systems make decision instantaneously using features, we propose novel framework, named TalkNet, that makes by taking both features into consideration. TalkNet consists temporal encoders for feature representation, cross-attention...

10.1145/3474085.3475587 article EN Proceedings of the 30th ACM International Conference on Multimedia 2021-10-17

Compact multi-sensor platforms are portable and thus desirable for robotics personal-assistance tasks. However, compared to physically distributed sensors, the size of these makes person tracking more difficult. To address this challenge, we propose a novel 3-D audio-visual people tracker that exploits visual observations (object detections) guide acoustic processing by constraining likelihood on horizontal plane defined predicted height speaker. This solution allows estimate, with small...

10.1109/tmm.2019.2902489 article EN IEEE Transactions on Multimedia 2019-03-01

Most of the prior studies in spatial Direction Arrival (DoA) domain focus on a single modality. However, humans use auditory and visual senses to detect presence sound sources. With this motivation, we propose neural networks with audio signals for multi-speaker localization. The heterogeneous sensors can provide complementary information overcome uni-modal challenges, such as noise, reverberation, illumination variations, occlusions. We attempt address these issues by introducing an...

10.1109/icassp39728.2021.9413776 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

Speech enhancement plays an essential role in a wide range of speech processing applications. Recent studies on tend to investigate how effectively capture the long-term contextual dependencies signals boost performance. However, these generally neglect time-frequency (T-F) distribution information spectral components, which is equally important for enhancement. In this paper, we propose simple yet very effective network module, term T-F attention (TFA) that uses two parallel branches, i.e.,...

10.1109/taslp.2022.3225649 article EN cc-by IEEE/ACM Transactions on Audio Speech and Language Processing 2022-12-01

GelSight sensors that estimate contact geometry and force by reconstructing the deformation of their soft elastomer from images would yield poor measurements when deforms uniformly or reaches saturation. Here we present an L <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$^{3}$</tex-math></inline-formula> F-TOUCH sensor considerably enhances three-axis sensing capability typical sensors. Specifically,...

10.1109/lra.2023.3292575 article EN IEEE Robotics and Automation Letters 2023-07-05

Transformer and its derivatives have achieved success in diverse tasks across computer vision, natural language processing, speech processing. To reduce the complexity of computations within multi-head self-attention mechanism Transformer, Selective State Space Models (i.e., Mamba) were proposed as an alternative. Mamba exhibited effectiveness processing vision tasks, but superiority has rarely been investigated signal This paper explores solutions for applying to using two typical tasks:...

10.48550/arxiv.2405.12609 preprint EN arXiv (Cornell University) 2024-05-21

Humans can perceive speakers' characteristics (e.g., identity, gender, personality and emotion) by their appearance, which are generally aligned to voice style. Recently, vision-driven Text-to-speech (TTS) scholars grounded investigations on real-person faces, thereby restricting effective speech synthesis from applying vast potential usage scenarios with diverse characters image styles. To solve this issue, we introduce a novel FaceSpeak approach. It extracts salient identity emotional...

10.48550/arxiv.2501.03181 preprint EN arXiv (Cornell University) 2025-01-01

Recently, end-to-end automatic speech recognition has become the mainstream approach in both industry and academia. To optimize system performance specific scenarios, Weighted Finite-State Transducer (WFST) is extensively used to integrate acoustic language models, leveraging its capacity implicitly fuse models within static graphs, thereby ensuring robust while also facilitating rapid error correction. However, WFST necessitates a frame-by-frame search of CTC posterior probabilities through...

10.48550/arxiv.2501.03257 preprint EN arXiv (Cornell University) 2025-01-01

Audio-visual target speaker extraction (AV-TSE) aims to extract the specific person's speech from audio mixture given auxiliary visual cues. Previous methods usually search for voice through speech-lip synchronization. However, this strategy mainly focuses on existence of speech, while ignoring variations noise characteristics, i.e., interference and background noise. That may result in extracting noisy signals incorrect sound source challenging acoustic situations. To end, we propose a...

10.1109/taslpro.2025.3527766 article EN IEEE Transactions on Audio Speech and Language Processing 2025-01-01

10.1109/icassp49660.2025.10889128 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

10.1109/icassp49660.2025.10887831 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Humans can perceive speakers’ characteristics (e.g., identity, gender, personality and emotion) by their appearance, which are generally aligned to voice style. Recently, vision-driven Text-to-speech ( TTS ) scholars grounded investigations on real-person faces, thereby restricting effective speech synthesis from applying vast potential usage scenarios with diverse characters image styles. To solve this issue, we introduce a novel FaceSpeak approach. It extracts salient identity emotional...

10.1609/aaai.v39i24.34786 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2025-04-11

Audio-visual tracking of an unknown number concurrent speakers in 3D is a challenging task, especially when sound and video are collected with compact sensing platform. In this paper, we propose tracker that builds on generative discriminative audio-visual likelihood models formulated particle filtering framework. We localize multiple de-emphasized acoustic map assisted by the image detection-derived observations. The multi-modal observations either assigned to existing tracks for...

10.1109/tmm.2021.3061800 article EN IEEE Transactions on Multimedia 2021-02-24

Speaker extraction seeks to extract the clean speech of a target speaker from multi-talker mixture speech. There have been studies use pre-recorded sample or face image as cue. In human communication, co-speech gestures that are naturally timed with also contribute perception. this work, we explore sequence, e.g. hand and body movements, cue for extraction, which could be easily obtained low-resolution video recordings, thus more available than recordings. We propose two networks using...

10.1109/lsp.2022.3175130 article EN cc-by IEEE Signal Processing Letters 2022-01-01

Audio-visual signals can be used jointly for robotic perception as they complement each other. Such multi-modal sensory fusion has a clear advantage, especially under noisy acoustic conditions. Speaker localization, an essential function, was traditionally solved signal processing problem that now increasingly finds deep learning solutions. The question is how to fuse audio-visual in effective way. tracking not only more desirable, but also potentially accurate than speaker localization...

10.1109/taslp.2022.3226330 article EN cc-by IEEE/ACM Transactions on Audio Speech and Language Processing 2022-12-02

We propose an audio-visual fusion algorithm for 3D speaker tracking from a localised multi-modal sensor platform composed of camera and small microphone array. After extracting cues individual modalities we fuse them adaptively using their reliability in particle filter framework. The the audio signal is measured based on maximum Global Coherence Field (GCF) peak value at each frame. visual colour-histogram matching with detection results compared reference image RGB space. Experiments...

10.1109/icassp.2017.7952686 article EN 2017-03-01

In this paper, we report a novel Time-Delay-Integration (TDI) CMOS image sensor for low-earth orbit (LEO) nano-satellite imaging application, where limited exposure time and unexpected flight fluctuations are major design challenges. The features programmable integration per stage, dynamic charge transfer path tunable well capacity. A prototype chip of 1536×8 pixels was implemented using TSMC 0.18µm process. Photodiode other transistors floor-planned in different arrays, providing small...

10.1109/iscas.2012.6271566 article EN 1993 IEEE International Symposium on Circuits and Systems 2012-05-01

In conjunction with huge recent progress in cam-era and computer vision technology, camera-based sensors have increasingly shown considerable promise relation to tactile sensing. comparison competing technologies (be they resistive, capacitive or magnetic based), offer super-high-resolution, while suffering from fewer wiring problems. The human system is composed of various types mechanoreceptors, each able perceive process distinct information such as force, pressure, texture, etc....

10.1109/icra48891.2023.10160634 article EN 2023-05-29

In this paper, we present a CMOS image sensor for star centroid measurement the application of trackers. We propose new capacitive transimpedance amplifier pixel architecture with in-pixel charge subtraction scheme. The is able to achieve high signal-to-noise ratio dim stars and, at same time, avoid saturation bright stars. A prototype was fabricated using Global Foundry 65-nm mixed-signal process. Experimental results show that can 3.8 V/lux <inline-formula...

10.1109/jsen.2014.2365173 article EN IEEE Sensors Journal 2014-10-27

Robotic audition is a basic sense that helps robots perceive the surroundings and interact with humans. Sound Source Localization (SSL) an essential module for robotic system. However, performance of most sound source localization techniques degrades in noisy reverberant environments due to inaccurate Time Difference Arrival (TDoA) estimation. In localization, we are more interested detecting arrival human speech than other sources. Ideally, expect effective TDoA estimation respond only...

10.1109/icra48506.2021.9561885 article EN 2021-05-30
Coming Soon ...