- Speech and Audio Processing
- Music and Audio Processing
- Speech Recognition and Synthesis
- Advanced Adaptive Filtering Techniques
- Hearing Loss and Rehabilitation
- Acoustic Wave Phenomena Research
- Image and Video Quality Assessment
- Video Surveillance and Tracking Methods
- Advanced Vision and Imaging
- Video Analysis and Summarization
- Human Pose and Action Recognition
- Ultrasonics and Acoustic Wave Propagation
- Advanced Data Compression Techniques
- Gait Recognition and Analysis
- Multimedia Communication and Technology
- Structural Health Monitoring Techniques
- Personal Information Management and User Behavior
- Video Coding and Compression Technologies
- Team Dynamics and Performance
- Indoor and Outdoor Localization Technologies
- Machine Learning and Data Classification
- Wireless Networks and Protocols
- Advanced Image and Video Retrieval Techniques
- Knowledge Management and Sharing
- Advanced Image Processing Techniques
Microsoft (United States)
2010-2025
Microsoft (Finland)
2018-2025
Microsoft Research (United Kingdom)
2002-2024
University of Maryland, College Park
1996-2003
We describe new techniques to detect and analyze periodic motion as seen from both a static moving camera. By tracking objects of interest, we compute an object's self-similarity it evolves in time. For motion, the measure is also apply time-frequency analysis characterize motion. The periodicity analyzed robustly using 2D lattice structures inherent similarity matrices. A real-time system has been implemented track classify periodicity. Examples object classification (people, running dogs,...
This paper introduces a new interpolation technique for demosaicing of color images produced by single-CCD digital cameras. We show that the proposed simple linear filter can lead to an improvement in PSNR over 5.5 dB when compared bilinear demosaicing, and about 0.7 R B recently introduced interpolator. The also outperforms most nonlinear algorithms, without artifacts due processing, much reduced computational complexity.
The INTERSPEECH 2020 Deep Noise Suppression (DNS) Challenge is intended to promote collaborative research in realtime single-channel Speech Enhancement aimed maximize the subjective (perceptual) quality of enhanced speech.A typical approach evaluate noise suppression methods use objective metrics on test set obtained by splitting original dataset.While performance good synthetic set, often model degrades significantly real recordings.Also, most conventional do not correlate well with tests...
The Deep Noise Suppression (DNS) challenge was designed to unify the research efforts in area of noise suppression targeted for human perception.We recently organized a DNS special session at INTERSPEECH 2020 and ICASSP 2021.We open-sourced training test datasets wideband scenario along with subjective evaluation framework based on ITU-T standard P.808, which used evaluate participants challenge.Many researchers from academia industry made significant contributions push field forward, yet...
Human subjective evaluation is the "gold standard" to evaluate speech quality optimized for human perception. Perceptual objective metrics serve as a proxy scores. The conventional and widely used require reference clean signal, which unavailable in real recordings. Previous no-reference approaches correlate poorly with ratings are not adopted research community. One of biggest use cases these perceptual noise suppression algorithms. This paper introduces multi-stage self-teaching based...
Human subjective evaluation is the "gold standard" to evaluate speech quality optimized for human perception. Perceptual objective metrics serve as a proxy scores. We have recently developed non-intrusive metric called Deep Noise Suppression Mean Opinion Score (DNSMOS) using scores from ITU-T Rec. P.808 [1] evaluation. The reflect overall of audio clip. P.835 [2] framework gives standalone and background noise in addition quality. In this work, we train an based on ratings that output 3...
The Deep Noise Suppression (DNS) challenge is designed to foster innovation in the area of noise suppression achieve superior perceptual speech quality. This 4th DNS challenge, with previous editions held at INTERSPEECH 2020 [1], ICASSP 2021 [2], and [3]. We open-source datasets test sets for researchers train their deep models, as well a subjective evaluation framework based on ITU-T P.835 rate rank-order entries. provide access DNS-MOS word accuracy (WAcc) APIs participants help iterative...
Presents a correspondence-free method to automatically estimate the spatio-temporal parameters of gait (stride length and cadence) walking person from video. Stride cadence are functions body height, weight gender, we use these biometrics for identification verification people. The is estimated using periodicity person. Using calibrated camera system, stride by first tracking estimating their distance travelled over period time. By counting number steps (again periodicity) assuming...
The common meeting is an integral part of everyday life for most workgroups. However, due to travel, time, or other constraints, people are often not able attend all the meetings they need to. Teleconferencing and recording can address this problem. In paper we describe a system that provides these features, as well user study evaluation system. uses variety capture devices (a novel 360° camera, whiteboard overview microphone array) provide rich experience who want participate in from...
The authors have developed a real-time, view-based gesture recognition system. Optical flow is estimated and segmented into motion blobs. Gestures are recognized using rule-based technique based on characteristics of the blobs such as relative size. Parameters (e.g., frequency) then context specific techniques. system has been applied to create an interactive environment for children.
A motion-based, correspondence-free technique or human gait recognition in monocular video is presented. We contend that the planar dynamics of a walking person are encoded 2D plot consisting pairwise image similarities sequence images person, and can be achieved via standard pattern classification these plots. use background modelling to track for number frames extract segmented person. The self-similarity computed correlation each pair this sequence. For recognition, method applies...
Background noise is a major source of quality impairments in Voice over Internet Protocol (VoIP) and Public Switched Telephone Network (PSTN) calls.Recent work shows the efficacy deep learning for suppression, but datasets have been relatively small compared to those used other domains (e.g., ImageNet) associated evaluations more focused.In order better facilitate research Speech Enhancement, we present noisy speech dataset (MS-SNSD) that can scale arbitrary sizes depending on number...
This paper investigates several aspects of training a RNN (recurrent neural network) that impact the objective and subjective quality enhanced speech for real-time single-channel enhancement. Specifically, we focus on enhances short-time spectra single-frame-in, single-frame-out basis, framework adopted by most classical signal processing methods. We propose two novel mean-squared-error-based learning objectives enable separate control over importance distortion versus noise reduction. The...
Estimating the perceived quality of an audio signal is critical for many multimedia and processing systems. Providers strive to offer optimal reliable services in order increase user experience (QoE). In this work, we present investigation applicability neural networks non-intrusive assessment. We propose three network-based approaches mean opinion score (MOS) estimation. compare our results instrumental measures: perceptual evaluation speech (PESQ), ITU-T Recommendation P.563,...
The Deep Noise Suppression (DNS) challenge is designed to foster innovation in the area of noise suppression achieve superior perceptual speech quality. We recently organized a DNS special session at INTERSPEECH 2020 where we open-sourced training and test datasets for researchers train their models. also subjective evaluation framework used tool evaluate select final winners. Many from academia industry made significant contributions push field forward. learned that as research community,...
The INTERSPEECH 2020 Deep Noise Suppression Challenge is intended to promote collaborative research in real-time single-channel Speech Enhancement aimed maximize the subjective (perceptual) quality of enhanced speech. A typical approach evaluate noise suppression methods use objective metrics on test set obtained by splitting original dataset. Many publications report reasonable performance synthetic drawn from same distribution as that training set. However, often model degrades...
The ITU-T Recommendation P.808 provides a crowdsourcing approach for conducting subjective assessment of speech quality using the Absolute Category Rating (ACR) method. We provide an open-source implementation Rec. that runs on Amazon Mechanical Turk platform. extended our to include Degradation Ratings (DCR) and Comparison (CCR) test methods. also significantly speed up process by integrating participant qualification step into main rating task compared two-stage solution. program scripts...
The ICASSP 2022 Acoustic Echo Cancellation Challenge is intended to stimulate research in acoustic echo cancellation (AEC), which an important area of speech enhancement and still a top issue audio communication. This the third AEC challenge it enhanced by including mobile scenarios, adding recognition word accuracy rate as metric, making 48 kHz. We open source two large datasets train models under both single talk double scenarios. These consist recordings from more than 10,000 real devices...
The ICASSP 2023 Deep Noise Suppression (DNS) Challenge marks the fifth edition of DNS challenge series. challenges were organized from 2019 to foster research in field DNS. Previous held at INTERSPEECH 2020, 2021, and 2022. This aims advance models capable jointly addressing denoising, dereverberation, interfering talker suppression, with separate tracks focusing on headset speakerphone scenarios. facilitates personalized deep noise suppression by providing accompanying enrollment clips for...
The ICASSP 2023 Speech Signal Improvement Challenge is intended to stimulate research in the area of improving speech signal quality communication systems. can be measured with SIG ITU-T P.835 and still a top issue audio conferencing For example, 2022 Deep Noise Suppression challenge, improvement background overall impressive, but not statistically significant. To improve following impairment areas must addressed: coloration, discontinuity, loudness, reverberation, noise. A training test set...
Subjective speech quality assessment is the gold standard for evaluating enhancement processing and telecommunication systems. The commonly used ITU-T Rec. P.800 defines how to measure in lab environments, P.808 extended it crowdsourcing. P.835 extends of presence noise. P.804 targets conversation test introduces perceptual dimensions which are measured during listening phase conversation. noisiness, coloration, discontinuity, loudness. We create a crowd-sourcing implementation...
Gait is one of the few biometrics that can be measured at a distance, and hence useful for passive surveillance as well biometric applications. recognition research still its infancy, however, we have yet to solve fundamental issue finding gait features which once sufficient discrimination power extracted robustly accurately from low-resolution video. This paper describes novel technique based on image self-similarity walking person. We contend similarity plot encodes projection dynamics. It...
The visual motion of the mouth and corresponding audio data generated when a person speaks are highly correlated. This fact has been exploited for lip/speech-reading improving speech recognition. We describe method automatically detecting talking (both spatially temporally) using video from single microphone. audio-visual correlation is learned time-delayed neural network, which then used to perform spatio-temporal search speaking person. Applications include videoconferencing, indexing...
Speech quality, as perceived by humans, is an important performance metric for telephony and voice services. It typically measured through subjective listening tests, which can be tedious expensive. Algorithms such PESQ POLQA serve a computational proxy tests. Here we propose using convolutional neural network to predict the quality of speech with noise, reverberation, distortions, both intrusively non-intrusively, i.e., without clean reference signal. The model trained evaluated on corpus...