Alessio Brutti

ORCID: 0000-0003-4146-3071
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech and Audio Processing
  • Music and Audio Processing
  • Speech Recognition and Synthesis
  • Indoor and Outdoor Localization Technologies
  • Advanced Adaptive Filtering Techniques
  • Video Surveillance and Tracking Methods
  • Speech and dialogue systems
  • Emotion and Mood Recognition
  • Music Technology and Sound Studies
  • Multimodal Machine Learning Applications
  • Hearing Loss and Rehabilitation
  • Natural Language Processing Techniques
  • Phonetics and Phonology Research
  • Human Pose and Action Recognition
  • Topic Modeling
  • Blind Source Separation Techniques
  • Underwater Acoustics Research
  • Gait Recognition and Analysis
  • Animal Vocal Communication and Behavior
  • Anomaly Detection Techniques and Applications
  • Domain Adaptation and Few-Shot Learning
  • Target Tracking and Data Fusion in Sensor Networks
  • Sentiment Analysis and Opinion Mining
  • Human Mobility and Location-Based Analysis
  • Gaze Tracking and Assistive Technology

Fondazione Bruno Kessler
2016-2025

Free University of Bozen-Bolzano
2022

Queen Mary University of London
2016

Istituto Centrale per la Ricerca Scientifica e Tecnologica Applicata al Mare
2005

10.1109/icassp49660.2025.10889251 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Compact multi-sensor platforms are portable and thus desirable for robotics personal-assistance tasks. However, compared to physically distributed sensors, the size of these makes person tracking more difficult. To address this challenge, we propose a novel 3-D audio-visual people tracker that exploits visual observations (object detections) guide acoustic processing by constraining likelihood on horizontal plane defined predicted height speaker. This solution allows estimate, with small...

10.1109/tmm.2019.2902489 article EN IEEE Transactions on Multimedia 2019-03-01

Outdoor acoustic events detection is an exciting research field but challenged by the need for complex algorithms and deep learning techniques, typically requiring many computational, memory, energy resources. This challenge discourages IoT implementation, where efficient use of resources required. However, current embedded technologies microcontrollers have increased their capabilities without penalizing efficiency. paper addresses application sound event at edge, optimizing techniques on...

10.1109/jstsp.2020.2969775 article EN IEEE Journal of Selected Topics in Signal Processing 2020-01-27

Humans express their emotions via facial expressions, voice intonation and word choices. To infer the nature of underlying emotion, recognition models may use a single modality, such as vision, audio, text, or combination modalities. Generally, that fuse complementary information from multiple modalities outperform uni-modal counterparts. However, successful model fuses requires components can effectively aggregate task-relevant each modality. As cross-modal attention is seen an effective...

10.1109/icassp43922.2022.9746924 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

Comparing the different sound source localization techniques, proposed in literature during last decade, represents a relevant topic order to establish advantages and disadvantages of given approach real-time implementation. Traditionally, algorithms for rely on an estimation time difference arrival (TDOA) at microphone pairs through GCC-PHAT When several are available position can be estimated as point space that best fits set TDOA measurements by applying global coherence field (GCF), also...

10.1109/hscma.2008.4538690 article EN 2008-05-01

10.1155/2010/147495 article EN cc-by EURASIP Journal on Audio Speech and Music Processing 2010-01-01

10.1109/icassp49660.2025.10890639 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

We propose an audio-visual fusion algorithm for 3D speaker tracking from a localised multi-modal sensor platform composed of camera and small microphone array. After extracting cues individual modalities we fuse them adaptively using their reliability in particle filter framework. The the audio signal is measured based on maximum Global Coherence Field (GCF) peak value at each frame. visual colour-histogram matching with detection results compared reference image RGB space. Experiments...

10.1109/icassp.2017.7952686 article EN 2017-03-01

Audio-visual tracking of an unknown number concurrent speakers in 3D is a challenging task, especially when sound and video are collected with compact sensing platform. In this paper, we propose tracker that builds on generative discriminative audio-visual likelihood models formulated particle filtering framework. We localize multiple de-emphasized acoustic map assisted by the image detection-derived observations. The multi-modal observations either assigned to existing tracks for...

10.1109/tmm.2021.3061800 article EN IEEE Transactions on Multimedia 2021-02-24

An interface for distant-talking control of home devices requires the possibility identifying positions multiple users. Acoustic maps, based either on global coherence field (GCF) or oriented (OGCF), have already been exploited successfully to determine position and head orientation a single speaker. This paper proposes new method using acoustic maps deal with case two simultaneous speakers. The is step analysis map: first dominant speaker localized; then map modified by compensating effects...

10.1109/icassp.2008.4518618 article EN Proceedings of the ... IEEE International Conference on Acoustics, Speech, and Signal Processing 2008-03-01

This paper describes a surveillance system for intrusion detection which is based only on information derived from the processing of audio signals acquired by distributed microphone network (DMN). In particular exploits different acoustic features and estimates event positions in order to detect reject possible false alarms that may be generated sound sources inside outside monitored room. An evaluation has been conducted measure performance terms missed presence events produced test...

10.1109/avss.2009.49 article EN 2009-09-01

Recently, a fully supervised speaker diarization approach was proposed (UIS-RNN) which models speakers using multiple instances of parameter-sharing recurrent neural network. In this paper we propose qualitative modifications to the model that significantly improve learning efficiency and overall performance. particular, introduce novel loss function, called Sample Mean Loss present better modelling turn behaviour, by devising an analytical expression compute probability new joining...

10.1109/icassp40776.2020.9053477 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020-04-09

This paper presents an analysis of the Low-Complexity Acoustic Scene Classification task in DCASE 2022 Challenge. The was a continuation from previous years, but low-complexity requirements were changed to following: maximum number allowed parameters, including zero-valued ones, 128 K, with parameters being represented using INT8 numerical format; and multiply-accumulate operations at inference time 30 million. provided baseline system is convolutional neural network which employs...

10.48550/arxiv.2206.03835 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Acoustic maps created on the basis of signals acquired by distributed networks microphones allow to identify position and orientation an active talker in enclosure. In adverse situations high background noise, reverberation or unavailability direct paths microphones, localization may fail. This paper proposes a novel approach estimation head based classification global coherence field (GCF) oriented GCF maps. Preliminary experiments with data obtained simulated propagation as well real room...

10.1109/icassp.2007.366957 article EN 2007-04-01

Domestic environments are particularly challenging for distant speech recognition: reverberation, background noise and interfering sources, as well the propagation of acoustic events across adjacent rooms, critically degrade performance standard processing algorithms. In this application scenario, a crucial task is detection localization generated by users within various rooms. A specific challenge multi-room inter-room interference that negatively affects activity detectors. paper, we...

10.1109/eusipco.2015.7362588 article EN 2015-08-01

In this paper, we carry out an analysis on the use of speech separation guided diarization (SSGD) in telephone conversations. SSGD performs by separating speakers signals and then applying voice activity detection each estimated speaker signal. particular, compare two low-latency models. Moreover, show a post-processing algorithm that significantly reduces false alarm errors pipeline. We perform our experiments datasets: Fisher Corpus Part 1 CALLHOME, evaluating both metrics. Notably,...

10.1109/slt54892.2023.10023280 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2023-01-09

We performed an experimental review of current diarization systems for the conversational telephone speech (CTS) domain. In detail, we considered a total eight different algorithms belonging to clustering-based, end-to-end neural (EEND), and separation guided (SSGD) paradigms. studied inference-time computational requirements accuracy on four CTS datasets with characteristics languages. found that, among all methods considered, EEND-vector clustering (EEND-VC) offers best trade-off in terms...

10.1016/j.csl.2023.101534 article EN cc-by Computer Speech & Language 2023-05-30

Domestic environments are particularly challenging for distant speech recognition and audio processing in general. Reverberation, background noise interfering sources, as well the propagation of acoustic events across adjacent rooms, critically degrade performance standard algorithms. The DIRHA EU project addresses development distant-speech interaction with devices services within multiple rooms typical apartments. A corpus multichannel data has been created to represent realistic scenes,...

10.1109/hscma.2014.6843271 article EN 2014-05-01
Coming Soon ...