- Speech and Audio Processing
- Music and Audio Processing
- Speech Recognition and Synthesis
- Indoor and Outdoor Localization Technologies
- Advanced Adaptive Filtering Techniques
- Video Surveillance and Tracking Methods
- Speech and dialogue systems
- Emotion and Mood Recognition
- Music Technology and Sound Studies
- Multimodal Machine Learning Applications
- Hearing Loss and Rehabilitation
- Natural Language Processing Techniques
- Phonetics and Phonology Research
- Human Pose and Action Recognition
- Topic Modeling
- Blind Source Separation Techniques
- Underwater Acoustics Research
- Gait Recognition and Analysis
- Animal Vocal Communication and Behavior
- Anomaly Detection Techniques and Applications
- Domain Adaptation and Few-Shot Learning
- Target Tracking and Data Fusion in Sensor Networks
- Sentiment Analysis and Opinion Mining
- Human Mobility and Location-Based Analysis
- Gaze Tracking and Assistive Technology
Fondazione Bruno Kessler
2016-2025
Free University of Bozen-Bolzano
2022
Queen Mary University of London
2016
Istituto Centrale per la Ricerca Scientifica e Tecnologica Applicata al Mare
2005
Compact multi-sensor platforms are portable and thus desirable for robotics personal-assistance tasks. However, compared to physically distributed sensors, the size of these makes person tracking more difficult. To address this challenge, we propose a novel 3-D audio-visual people tracker that exploits visual observations (object detections) guide acoustic processing by constraining likelihood on horizontal plane defined predicted height speaker. This solution allows estimate, with small...
Outdoor acoustic events detection is an exciting research field but challenged by the need for complex algorithms and deep learning techniques, typically requiring many computational, memory, energy resources. This challenge discourages IoT implementation, where efficient use of resources required. However, current embedded technologies microcontrollers have increased their capabilities without penalizing efficiency. paper addresses application sound event at edge, optimizing techniques on...
Humans express their emotions via facial expressions, voice intonation and word choices. To infer the nature of underlying emotion, recognition models may use a single modality, such as vision, audio, text, or combination modalities. Generally, that fuse complementary information from multiple modalities outperform uni-modal counterparts. However, successful model fuses requires components can effectively aggregate task-relevant each modality. As cross-modal attention is seen an effective...
Comparing the different sound source localization techniques, proposed in literature during last decade, represents a relevant topic order to establish advantages and disadvantages of given approach real-time implementation. Traditionally, algorithms for rely on an estimation time difference arrival (TDOA) at microphone pairs through GCC-PHAT When several are available position can be estimated as point space that best fits set TDOA measurements by applying global coherence field (GCF), also...
We propose an audio-visual fusion algorithm for 3D speaker tracking from a localised multi-modal sensor platform composed of camera and small microphone array. After extracting cues individual modalities we fuse them adaptively using their reliability in particle filter framework. The the audio signal is measured based on maximum Global Coherence Field (GCF) peak value at each frame. visual colour-histogram matching with detection results compared reference image RGB space. Experiments...
Audio-visual tracking of an unknown number concurrent speakers in 3D is a challenging task, especially when sound and video are collected with compact sensing platform. In this paper, we propose tracker that builds on generative discriminative audio-visual likelihood models formulated particle filtering framework. We localize multiple de-emphasized acoustic map assisted by the image detection-derived observations. The multi-modal observations either assigned to existing tracks for...
An interface for distant-talking control of home devices requires the possibility identifying positions multiple users. Acoustic maps, based either on global coherence field (GCF) or oriented (OGCF), have already been exploited successfully to determine position and head orientation a single speaker. This paper proposes new method using acoustic maps deal with case two simultaneous speakers. The is step analysis map: first dominant speaker localized; then map modified by compensating effects...
This paper describes a surveillance system for intrusion detection which is based only on information derived from the processing of audio signals acquired by distributed microphone network (DMN). In particular exploits different acoustic features and estimates event positions in order to detect reject possible false alarms that may be generated sound sources inside outside monitored room. An evaluation has been conducted measure performance terms missed presence events produced test...
Recently, a fully supervised speaker diarization approach was proposed (UIS-RNN) which models speakers using multiple instances of parameter-sharing recurrent neural network. In this paper we propose qualitative modifications to the model that significantly improve learning efficiency and overall performance. particular, introduce novel loss function, called Sample Mean Loss present better modelling turn behaviour, by devising an analytical expression compute probability new joining...
This paper presents an analysis of the Low-Complexity Acoustic Scene Classification task in DCASE 2022 Challenge. The was a continuation from previous years, but low-complexity requirements were changed to following: maximum number allowed parameters, including zero-valued ones, 128 K, with parameters being represented using INT8 numerical format; and multiply-accumulate operations at inference time 30 million. provided baseline system is convolutional neural network which employs...
Acoustic maps created on the basis of signals acquired by distributed networks microphones allow to identify position and orientation an active talker in enclosure. In adverse situations high background noise, reverberation or unavailability direct paths microphones, localization may fail. This paper proposes a novel approach estimation head based classification global coherence field (GCF) oriented GCF maps. Preliminary experiments with data obtained simulated propagation as well real room...
Domestic environments are particularly challenging for distant speech recognition: reverberation, background noise and interfering sources, as well the propagation of acoustic events across adjacent rooms, critically degrade performance standard processing algorithms. In this application scenario, a crucial task is detection localization generated by users within various rooms. A specific challenge multi-room inter-room interference that negatively affects activity detectors. paper, we...
A Smart City based on data acquisition, handling and intelligent analysis requires efficient design implementation of the respective AI technologies underlying infrastructure for seamlessly analyzing large amounts in real-time. The EU project MARVEL will research solutions that can improve integration multiple sources a environment harnessing advantages rooted multimodal perception surrounding environment.
In this paper, we carry out an analysis on the use of speech separation guided diarization (SSGD) in telephone conversations. SSGD performs by separating speakers signals and then applying voice activity detection each estimated speaker signal. particular, compare two low-latency models. Moreover, show a post-processing algorithm that significantly reduces false alarm errors pipeline. We perform our experiments datasets: Fisher Corpus Part 1 CALLHOME, evaluating both metrics. Notably,...
We performed an experimental review of current diarization systems for the conversational telephone speech (CTS) domain. In detail, we considered a total eight different algorithms belonging to clustering-based, end-to-end neural (EEND), and separation guided (SSGD) paradigms. studied inference-time computational requirements accuracy on four CTS datasets with characteristics languages. found that, among all methods considered, EEND-vector clustering (EEND-VC) offers best trade-off in terms...
Domestic environments are particularly challenging for distant speech recognition and audio processing in general. Reverberation, background noise interfering sources, as well the propagation of acoustic events across adjacent rooms, critically degrade performance standard algorithms. The DIRHA EU project addresses development distant-speech interaction with devices services within multiple rooms typical apartments. A corpus multichannel data has been created to represent realistic scenes,...