- Speech and Audio Processing
- Music and Audio Processing
- Speech Recognition and Synthesis
- Blind Source Separation Techniques
- Advanced Adaptive Filtering Techniques
- Hearing Loss and Rehabilitation
- Topic Modeling
- Authorship Attribution and Profiling
- Adversarial Robustness in Machine Learning
- Natural Language Processing Techniques
- Digital Media Forensic Detection
- Anomaly Detection Techniques and Applications
- Hate Speech and Cyberbullying Detection
- Advanced Data Compression Techniques
- Neural Networks and Applications
- User Authentication and Security Systems
- Traffic control and management
- Indoor and Outdoor Localization Technologies
- Advanced Malware Detection Techniques
- Autonomous Vehicle Technology and Safety
- Water Systems and Optimization
- RFID technology advancements
- Human-Automation Interaction and Safety
- Generative Adversarial Networks and Image Synthesis
- Speech and dialogue systems
MSB Medical School Berlin
2024
Technische Universität Berlin
2004-2024
Ruhr University Bochum
2014-2023
General Directorate for National Roads and Motorways
2022
Signal Processing (United States)
2019-2021
Bucknell University
2021
Université de Toulouse
2014
Centre National de la Recherche Scientifique
2014
Laboratoire d'Analyse et d'Architecture des Systèmes
2014
Université Toulouse III - Paul Sabatier
2014
Voice interfaces are becoming accepted widely as input methods for a diverse set of devices.This development is driven by rapid improvements in automatic speech recognition (ASR), which now performs on par with human listening many tasks.These base an ongoing evolution deep neural networks (DNNs) the computational core ASR.However, recent research results show that DNNs vulnerable to adversarial perturbations, allow attackers force transcription into malicious output.In this paper, we...
Deep neural networks can generate images that are astonishingly realistic, so much it is often hard for humans to distinguish them from actual photos. These achievements have been largely made possible by Generative Adversarial Networks (GANs). While deep fake thoroughly investigated in the image domain - a classical approach area of forensics an analysis frequency has missing far. In this paper, we address shortcoming and our results reveal space, GAN-generated exhibit severe artifacts be...
With the increasing use of multimedia data in communication technologies, idea employing visual information automatic speech recognition (ASR) has recently gathered momentum. In conjunction with acoustical information, enhances performance and improves robustness ASR systems noisy reverberant environments. audio-visual systems, dynamic weighting audio video streams according to their instantaneous confidence is essential for reliably systematically achieving high performance. this paper, we...
Authorship verification is the task of analyzing linguistic patterns two or more texts to determine whether they were written by same author not. The analysis traditionally performed experts who consider features, which include spelling mistakes, grammatical inconsistencies, and stylistics for example. Machine learning algorithms, on other hand, can be trained accomplish same, but have relied so-called stylometric features. disadvantage such features that their reliability greatly diminished...
In order to improve the ASR performance in noisy environments, distorted speech is typically pre-processed by a enhancement algorithm, which usually results estimate containing residual noise and distortion.We may also have some measures of uncertainty or variance estimate.Uncertainty decoding framework that utilizes this knowledge input features during acoustic model scoring.Such frameworks been well explored for traditional probabilistic models, but their optimal use deep neural network...
Automatic speech recognition (ASR) systems can be fooled via targeted adversarial examples, which induce the ASR to produce arbitrary transcriptions in response altered audio signals. However, state-of-the-art examples typically have fed into system directly, and are not successful when played a room. Previously published over-the-air fall one of three categories: they either handcrafted so conspicuous that human listeners easily recognize target transcription once alerted its content, or...
We introduce VIBA, a novel approach for explainable video classification by adapting Information Bottlenecks Attribution (IBA) to sequences. While most traditional explainability methods are designed image models, our IBA framework addresses the need in temporal models used analysis. To demonstrate its effectiveness, we apply VIBA deepfake detection, testing it on two architectures: Xception model spatial features and VGG11-based capturing motion dynamics through optical flow. Using custom...
We discuss how desirable it is that Large Language Models (LLMs) be able to adapt or align their language behavior with users who may diverse in use. User diversity come about among others due i) age differences; ii) gender characteristics, and/or iii) multilingual experience, and associated differences processing consider potential consequences for usability, communication, LLM development.
Authorship verification tries to answer the question if two documents with unknown authors were written by same author or not. A range of successful technical approaches has been proposed for this task, many which are based on traditional linguistic features such as n-grams. These algorithms achieve good results certain types like books and novels. Forensic authorship social media, however, is a much more challenging task since messages tend be relatively short, large variety different...
Room reverberation and background noise severely degrade the quality of hands-free speech communication systems. In this work, we address problem combined dereverberation reduction using a variational Bayesian (VB) inference approach. Our method relies on multichannel state-space model for acoustic channels that combines frame-based observation equations in frequency domain with first-order Markov to describe time-varying nature room impulse responses. By modeling source signal as latent...
Audio-visual speech recognition is a promising approach to tackling the problem of reduced rates under adverse acoustic conditions. However, finding an optimal mechanism for combining multi-modal information remains challenging task. Various methods are applicable integrating and visual in Gaussian-mixture-model-based recognition, e.g., via dynamic stream weighting. The recent advances deep neural network (DNN)-based promise improved performance when using audio-visual information. question...
Despite remarkable improvements, automatic speech recognition is susceptible to adversarial perturbations. Compared standard machine learning architectures, these attacks are significantly more challenging, especially since the inputs a system time series that contain both acoustic and linguistic properties of speech. Extracting all recognition-relevant information requires complex pipelines an ensemble specialized components. Consequently, attacker needs consider entire pipeline. In this...
Time-frequency masking has emerged as a powerful technique for source separation of noisy and convolved speech mixtures. It also been applied successfully recognition. But while significant SNR gains are possible by adequate functions, recognition performance suffers from the involved nonlinear operations so that greatly improved often contrasts with only slight improvements in rate. To address this problem, marginalization techniques have used recognition, but they rely on to be carried out...
Speaker localization using microphone arrays is typically based on the expected phase and amplitude differences between microphones as a function of wave arrival direction. However, in rooms with significant reverberation, direct sound contaminated by reflections often fails. Recently, reverberation-robust method was proposed, which uses only direct-path bins short-time Fourier transform (STFT) speech signals. The thresholding according to ratio first two singular values spatial spectrum...
To prevent abuses of Internet services, CAPTCHAs are used to distinguish humans from programs where an audio-based scheme is beneficial support visually impaired people. Previous studies show that most audio CAPTCHAs, albeit hard solve for humans, lacking security strength. In this work we propose CAPTCHA far more robust against automated attacks than it reported current schemes. The exhibits a good trade-off between human usability and security. This achieved by exploiting the fact...
Linear discriminant analysis (LDA) is a powerful technique in pattern recognition to reduce the dimensionality of data vectors. It maximizes discriminability by retaining only those directions that minimize ratio within-class and between-class variance. In this paper, using same principles as for conventional LDA, we propose employ uncertainties noisy or distorted input order estimate maximally directions. We demonstrate efficiency proposed uncertain LDA on two applications state-of-the-art...
Automatic speech recognition (ASR) has become a widespread and convenient mode of human-machine interaction, but it is still not sufficiently reliable when used under highly noisy or reverberant conditions. One option for achieving far greater robustness to include another modality that unaffected by acoustic noise, such as video information. Currently the most successful approaches audiovisual ASR systems, coupled hidden Markov models (HMMs) turbo decoding, both allow slight asynchrony...
Most of the objective measures employed for speech intelligibility prediction require a clean reference signal, which is not accessible in all realistic scenarios. In this paper, we propose to re-synthesize relevant features signal using only noisy and utilize them inside an framework requires reference. A statistical model called twin hidden Markov (THMM) used synthesize features. For framework, short-time (STOI) measure as accurate well-known method. The experimental results show high...