- Speech and Audio Processing
- Music and Audio Processing
- Speech Recognition and Synthesis
- Indoor and Outdoor Localization Technologies
- CCD and CMOS Imaging Sensors
- Advanced Adaptive Filtering Techniques
- Infrared Target Detection Methodologies
- Adaptive optics and wavefront sensing
- Blind Source Separation Techniques
- Image Processing Techniques and Applications
- Face recognition and analysis
- Tactile and Sensory Interactions
- Video Surveillance and Tracking Methods
- Advanced Sensor and Energy Harvesting Materials
- Underwater Acoustics Research
- Music Technology and Sound Studies
- Inertial Sensor and Navigation
- Advanced Optical Imaging Technologies
- Generative Adversarial Networks and Image Synthesis
- Infant Health and Development
- Advanced MEMS and NEMS Technologies
- Hearing Loss and Rehabilitation
- Particle Detector Development and Performance
- Natural Language Processing Techniques
- Advanced Vision and Imaging
University of Science and Technology Beijing
2022-2025
University of Electronic Science and Technology of China
2024
National University of Singapore
2021-2023
Chinese University of Hong Kong, Shenzhen
2022
Shenzhen Research Institute of Big Data
2022
Queen Mary University of London
2017-2021
Shenyang University of Technology
2017
Nanyang Technological University
2011-2015
Heriot-Watt University
2014
Seoul National University
2013
Talking face generation, also known as speech-to-lip reconstructs facial motions concerning lips given coherent speech input. The previous studies revealed the importance of lip-speech synchronization and visual quality. Despite much progress, they hardly focus on content lip movements i.e., intelligibility spoken words, which is an important aspect generation To address problem, we propose using a lipreading expert to improve generated regions by penalizing incorrect results. Moreover,...
Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or more speakers. The successful ASD depends on accurate interpretation short-term and long-term audio information, as well audio-visual interaction. Unlike the prior work where systems make decision instantaneously using features, we propose novel framework, named TalkNet, that makes by taking both features into consideration. TalkNet consists temporal encoders for feature representation, cross-attention...
Compact multi-sensor platforms are portable and thus desirable for robotics personal-assistance tasks. However, compared to physically distributed sensors, the size of these makes person tracking more difficult. To address this challenge, we propose a novel 3-D audio-visual people tracker that exploits visual observations (object detections) guide acoustic processing by constraining likelihood on horizontal plane defined predicted height speaker. This solution allows estimate, with small...
Most of the prior studies in spatial Direction Arrival (DoA) domain focus on a single modality. However, humans use auditory and visual senses to detect presence sound sources. With this motivation, we propose neural networks with audio signals for multi-speaker localization. The heterogeneous sensors can provide complementary information overcome uni-modal challenges, such as noise, reverberation, illumination variations, occlusions. We attempt address these issues by introducing an...
Speech enhancement plays an essential role in a wide range of speech processing applications. Recent studies on tend to investigate how effectively capture the long-term contextual dependencies signals boost performance. However, these generally neglect time-frequency (T-F) distribution information spectral components, which is equally important for enhancement. In this paper, we propose simple yet very effective network module, term T-F attention (TFA) that uses two parallel branches, i.e.,...
GelSight sensors that estimate contact geometry and force by reconstructing the deformation of their soft elastomer from images would yield poor measurements when deforms uniformly or reaches saturation. Here we present an L <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$^{3}$</tex-math></inline-formula> F-TOUCH sensor considerably enhances three-axis sensing capability typical sensors. Specifically,...
Transformer and its derivatives have achieved success in diverse tasks across computer vision, natural language processing, speech processing. To reduce the complexity of computations within multi-head self-attention mechanism Transformer, Selective State Space Models (i.e., Mamba) were proposed as an alternative. Mamba exhibited effectiveness processing vision tasks, but superiority has rarely been investigated signal This paper explores solutions for applying to using two typical tasks:...
Humans can perceive speakers' characteristics (e.g., identity, gender, personality and emotion) by their appearance, which are generally aligned to voice style. Recently, vision-driven Text-to-speech (TTS) scholars grounded investigations on real-person faces, thereby restricting effective speech synthesis from applying vast potential usage scenarios with diverse characters image styles. To solve this issue, we introduce a novel FaceSpeak approach. It extracts salient identity emotional...
Recently, end-to-end automatic speech recognition has become the mainstream approach in both industry and academia. To optimize system performance specific scenarios, Weighted Finite-State Transducer (WFST) is extensively used to integrate acoustic language models, leveraging its capacity implicitly fuse models within static graphs, thereby ensuring robust while also facilitating rapid error correction. However, WFST necessitates a frame-by-frame search of CTC posterior probabilities through...
Audio-visual target speaker extraction (AV-TSE) aims to extract the specific person's speech from audio mixture given auxiliary visual cues. Previous methods usually search for voice through speech-lip synchronization. However, this strategy mainly focuses on existence of speech, while ignoring variations noise characteristics, i.e., interference and background noise. That may result in extracting noisy signals incorrect sound source challenging acoustic situations. To end, we propose a...
Humans can perceive speakers’ characteristics (e.g., identity, gender, personality and emotion) by their appearance, which are generally aligned to voice style. Recently, vision-driven Text-to-speech ( TTS ) scholars grounded investigations on real-person faces, thereby restricting effective speech synthesis from applying vast potential usage scenarios with diverse characters image styles. To solve this issue, we introduce a novel FaceSpeak approach. It extracts salient identity emotional...
Audio-visual tracking of an unknown number concurrent speakers in 3D is a challenging task, especially when sound and video are collected with compact sensing platform. In this paper, we propose tracker that builds on generative discriminative audio-visual likelihood models formulated particle filtering framework. We localize multiple de-emphasized acoustic map assisted by the image detection-derived observations. The multi-modal observations either assigned to existing tracks for...
Speaker extraction seeks to extract the clean speech of a target speaker from multi-talker mixture speech. There have been studies use pre-recorded sample or face image as cue. In human communication, co-speech gestures that are naturally timed with also contribute perception. this work, we explore sequence, e.g. hand and body movements, cue for extraction, which could be easily obtained low-resolution video recordings, thus more available than recordings. We propose two networks using...
Audio-visual signals can be used jointly for robotic perception as they complement each other. Such multi-modal sensory fusion has a clear advantage, especially under noisy acoustic conditions. Speaker localization, an essential function, was traditionally solved signal processing problem that now increasingly finds deep learning solutions. The question is how to fuse audio-visual in effective way. tracking not only more desirable, but also potentially accurate than speaker localization...
We propose an audio-visual fusion algorithm for 3D speaker tracking from a localised multi-modal sensor platform composed of camera and small microphone array. After extracting cues individual modalities we fuse them adaptively using their reliability in particle filter framework. The the audio signal is measured based on maximum Global Coherence Field (GCF) peak value at each frame. visual colour-histogram matching with detection results compared reference image RGB space. Experiments...
In this paper, we report a novel Time-Delay-Integration (TDI) CMOS image sensor for low-earth orbit (LEO) nano-satellite imaging application, where limited exposure time and unexpected flight fluctuations are major design challenges. The features programmable integration per stage, dynamic charge transfer path tunable well capacity. A prototype chip of 1536×8 pixels was implemented using TSMC 0.18µm process. Photodiode other transistors floor-planned in different arrays, providing small...
In conjunction with huge recent progress in cam-era and computer vision technology, camera-based sensors have increasingly shown considerable promise relation to tactile sensing. comparison competing technologies (be they resistive, capacitive or magnetic based), offer super-high-resolution, while suffering from fewer wiring problems. The human system is composed of various types mechanoreceptors, each able perceive process distinct information such as force, pressure, texture, etc....
In this paper, we present a CMOS image sensor for star centroid measurement the application of trackers. We propose new capacitive transimpedance amplifier pixel architecture with in-pixel charge subtraction scheme. The is able to achieve high signal-to-noise ratio dim stars and, at same time, avoid saturation bright stars. A prototype was fabricated using Global Foundry 65-nm mixed-signal process. Experimental results show that can 3.8 V/lux <inline-formula...
Robotic audition is a basic sense that helps robots perceive the surroundings and interact with humans. Sound Source Localization (SSL) an essential module for robotic system. However, performance of most sound source localization techniques degrades in noisy reverberant environments due to inaccurate Time Difference Arrival (TDoA) estimation. In localization, we are more interested detecting arrival human speech than other sources. Ideally, expect effective TDoA estimation respond only...