NFDI4DS | UHH-SEMS - Publication Details

Jonas Beskow

ORCID: 0000-0003-1399-6604

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5088064508

Research Areas

Speech and dialogue systems
Social Robot Interaction and HRI
Speech and Audio Processing
Speech Recognition and Synthesis
Face recognition and analysis
Phonetics and Phonology Research
Multisensory perception and integration
Hearing Impairment and Communication
Human Motion and Animation
Natural Language Processing Techniques
Hand Gesture Recognition Systems
Hearing Loss and Rehabilitation
Music and Audio Processing
Language, Metaphor, and Cognition
Face Recognition and Perception
Language, Discourse, Communication Strategies
Gaze Tracking and Assistive Technology
Robotics and Automated Systems
Tactile and Sensory Interactions
Human Pose and Action Recognition
AI in Service Interactions
Multimodal Machine Learning Applications
Advanced Vision and Imaging
Video Analysis and Summarization
Subtitles and Audiovisual Media

KTH Royal Institute of Technology
2015-2024

CBot (Sweden)
2018-2019

Royal College of Music in Stockholm
2017

Swedish e-Science Research Centre
1998-2007

Google (United States)
2006

Columbia University
2005

University of California, Santa Cruz
1998

Wavesurfer - an open source speech tool

OPENALEX - Publications

Kåre Sjölander Jonas Beskow

In the speech technology research community there is an increasing trend to use open source solutions. We present a new tool in that spirit, WaveSurfer, which has been developed at Centre for Speech Technology KTH. It designed tasks such as viewing, editing, and labeling of audio data. WaveSurfer built around small core most functionality added form plug-ins. The work on common platforms with aims it should be easy configure extend. provided source, under GPL license explicit goal jointly...

10.21437/icslp.2000-849 article EN 4th International Conference on Spoken Language Processing (ICSLP 1996) 2000-10-16

MoGlow

OPENALEX - Publications

Gustav Eje Henter Simon Alexanderson Jonas Beskow

Data-driven modelling and synthesis of motion is an active research area with applications that include animation, games, social robotics. This paper introduces a new class probabilistic, generative, controllable motion-data models based on normalising flows. Models this kind can describe highly complex distributions, yet be trained efficiently using exact maximum likelihood, unlike GANs or VAEs. Our proposed model autoregressive uses LSTMs to enable arbitrarily long time-dependencies....

10.1145/3414685.3417836 article EN ACM Transactions on Graphics 2020-11-27

Style‐Controllable Speech‐Driven Gesture Synthesis Using Normalising Flows

OPENALEX - Publications

Simon Alexanderson Gustav Eje Henter Taras Kucherenko Jonas Beskow

Abstract Automatic synthesis of realistic gestures promises to transform the fields animation, avatars and communicative agents. In off‐line applications, novel tools can alter role an animator that a director, who provides only high‐level input for desired animation; learned network then translates these instructions into appropriate sequence body poses. interactive scenarios, systems generating natural animations on fly are key achieving believable relatable characters. this paper we...

10.1111/cgf.13946 article EN Computer Graphics Forum 2020-05-01

Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models

OPENALEX - Publications

Simon Alexanderson Rajmund Nagy Jonas Beskow Gustav Eje Henter

Diffusion models have experienced a surge of interest as highly expressive yet efficiently trainable probabilistic models. We show that these are an excellent fit for synthesising human motion co-occurs with audio, e.g., dancing and co-speech gesticulation, since is complex ambiguous given calling description. Specifically, we adapt the DiffWave architecture to model 3D pose sequences, putting Conformers in place dilated convolutions improved modelling power. also demonstrate control over...

10.1145/3592458 article EN ACM Transactions on Graphics 2023-07-26

Matcha-TTS: A Fast TTS Architecture with Conditional Flow Matching

OPENALEX - Publications

Shivam Mehta Ruibo Tu Jonas Beskow Éva Székely Gustav Eje Henter

We introduce Matcha-TTS, a new encoder-decoder architecture for speedy TTS acoustic modelling, trained using optimal-transport conditional flow matching (OT-CFM). This yields an ODE-based decoder capable of high output quality in fewer synthesis steps than models score matching. Careful design choices additionally ensure each step is fast to run. The method probabilistic, non-autoregressive, and learns speak from scratch without external alignments. Compared strong pre-trained baseline...

10.1109/icassp48485.2024.10448291 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024-03-18

Exploring Turn-taking Cues in Multi-party Human-robot Discussions about Objects

OPENALEX - Publications

Gabriel Skantze Martin Johansson Jonas Beskow

In this paper, we present a dialog system that was exhibited at the Swedish National Museum of Science and Technology. Two visitors time could play collaborative card sorting game together with robot head Furhat, where three players discuss solution together. The cards are shown on touch table between players, thus constituting target for joint attention. We describe how implemented in order to manage turn-taking attention users objects shared physical space. also multi-modal redundancy...

10.1145/2818346.2820749 article EN 2015-11-06

Speech Synthesis Evaluation — State-of-the-Art Assessment and Suggestion for a Novel Research Program

OPENALEX - Publications

Petra Wagner Jonas Beskow Simon Betz Jens Edlund Joakim Gustafson and 6 more

Speech synthesis applications have become an ubiquity, in navigation systems, digital assistants or as screen audio book readers.Despite their impact on the acceptability of systems which they are embedded, and despite fact that different probably need types TTS voices, evaluation is still largely treated isolated problem.Even though there strong agreement among researchers mainstream approaches to Text-to-Speech (TTS) often insufficient may even be misleading, exist few clear-cut...

10.21437/ssw.2019-19 article EN 2019-09-14

Transflower

OPENALEX - Publications

Guillermo Valle-Pérez Gustav Eje Henter Jonas Beskow André Holzapfel Pierre-Yves Oudeyer and 1 more

Dance requires skillful composition of complex movements that follow rhythmic, tonal and timbral features music. Formally, generating dance conditioned on a piece music can be expressed as problem modelling high-dimensional continuous motion signal, an audio signal. In this work we make two contributions to tackle problem. First, present novel probabilistic autoregressive architecture models the distribution over future poses with normalizing flow previous well context, using multimodal...

10.1145/3478513.3480570 article EN ACM Transactions on Graphics 2021-12-01

Rule-based visual speech synthesis

OPENALEX - Publications

Jonas Beskow

A system for rule based audiovisual text-to-speech synthesis has been created. The is on the KTH which complemented with a three-dimensional parameterized model of human face. face can be animated in real time, synchronized auditory speech. facial controlled by same software as speech synthesizer. set rules that takes coarticulation into account developed. also incorporated spoken man-machine dialogue being developed at department.

10.21437/eurospeech.1995-81 article EN 1995-09-18

Taming Mona Lisa

OPENALEX - Publications

Samer Al Moubayed Jens Edlund Jonas Beskow

The perception of gaze plays a crucial role in human-human interaction. Gaze has been shown to matter for number aspects communication and dialogue, especially managing the flow dialogue participant attention, deictic referencing, attitude. When developing embodied conversational agents (ECAs) talking heads, modeling delivering accurate targets is crucial. Traditionally, systems communicating through heads have displayed human conversant using 2D displays, such as flat monitors. This...

10.1145/2070719.2070724 article EN ACM Transactions on Interactive Intelligent Systems 2012-01-01

THE FURHAT BACK-PROJECTED HUMANOID HEAD–LIP READING, GAZE AND MULTI-PARTY INTERACTION

OPENALEX - Publications

Samer Al Moubayed Gabriel Skantze Jonas Beskow

In this paper, we present Furhat — a back-projected human-like robot head using state-of-the art facial animation. Three experiments are presented where investigate how the might facilitate human–robot face-to-face interaction. First, animated lips increase intelligibility of spoken output, and compare to an agent on flat screen, as well human face. Second, accuracy perception Furhat's gaze in setting typical for situated interaction, sitting around table. The is measured depending eye...

10.1142/s0219843613500059 article EN International Journal of Humanoid Robotics 2013-03-01

Let's Face It

OPENALEX - Publications

Patrik Jonell Taras Kucherenko Gustav Eje Henter Jonas Beskow

To enable more natural face-to-face interactions, conversational agents need to adapt their behavior interlocutors. One key aspect of this is generation appropriate non-verbal for the agent, example facial gestures, here defined as expressions and head movements. Most existing gesture-generating systems do not utilize multi-modal cues from interlocutor when synthesizing behavior. Those that do, typically use deterministic methods risk producing repetitive non-vivid motions. In paper, we...

10.1145/3383652.3423911 preprint EN 2020-10-19

Spontaneous Conversational Speech Synthesis from Found Data

OPENALEX - Publications

Éva Székely Gustav Eje Henter Jonas Beskow Joakim Gustafson

Synthesising spontaneous speech is a difficult task due to disfluencies, high variability and syntactic conventions different from those of written language. Using found data, as opposed lab-rec ...

10.21437/interspeech.2019-2836 article EN Interspeech 2022 2019-09-13

Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation

OPENALEX - Publications

Anna Deichler Shivam Mehta Simon Alexanderson Jonas Beskow

This paper describes a system developed for the GENEA (Generation and Evaluation of Non-verbal Behaviour Embodied Agents) Challenge 2023. Our solution builds on an existing diffusion-based motion synthesis model. We propose contrastive speech pretraining (CSMP) module, which learns joint embedding gesture with aim to learn semantic coupling between these modalities. The output CSMP module is used as conditioning signal in model order achieve semantically-aware co-speech generation. entry...

10.1145/3577190.3616117 article EN cc-by INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION 2023-10-07

Adapt - a multimodal conversational dialogue system in an apartment domain

OPENALEX - Publications

Joakim Gustafson Linda Bell Jonas Beskow Johan Boye Rolf Carlson and 4 more

A general overview of the AdApt project and research that is performed within presented.In this various aspects human-computer interaction in a multimodal conversational dialogue systems are investigated.The will also include studies on integration user/system/dialogue dependent speech recognition synthesis.A domain which highly useful has been chosen, namely, finding available apartments Stockholm.A Wizard-of-Oz data collection described.

10.21437/icslp.2000-227 article EN 4th International Conference on Spoken Language Processing (ICSLP 1996) 2000-10-16

Visual correlates to prominence in several expressive modes

OPENALEX - Publications

Jonas Beskow Björn Granström David House

In this paper, we present measurements of visual, facial parameters obtained from a speech corpus consisting short, read utterances in which focal accent was systematically varied. The utterance ...

10.21437/interspeech.2006-375 article EN Interspeech 2022 2006-09-17

Emotion-awareness for intelligent vehicle assistants

OPENALEX - Publications

Hans J. Vogel Christian Süß Thomas Hubregtsen Viviane S. Ghaderi Ronee Chadowitz and 18 more

EVA1 is describing a new class of emotion-aware autonomous systems delivering intelligent personal assistant functionalities. EVA requires multi-disciplinary approach, combining number critical building blocks into cybernetics systems/software architecture: emotion aware and algorithms, multimodal interaction design, cognitive modelling, decision making recommender systems, sensing as feedback for learning, distributed (edge) computing services.

10.1145/3194085.3194094 article EN 2018-05-28

Reverse Engineering Psychologically Valid Facial Expressions of Emotion into Social Robots

OPENALEX - Publications

Chaona Chen Oliver Garrod Jiayu Zhan Jonas Beskow Philippe G. Schyns and 1 more

Social robots are now part of human society, destined for schools, hospitals, and homes to perform a variety tasks. To engage their users, social must be equipped with the essential skill facial expression communication. Yet, even state-of-the-art limited in this ability because they often rely on restricted set expressions derived from theory well-known limitations such as lacking naturalistic dynamics. With no agreed methodology objectively engineer broader variance more psychologically...

10.1109/fg.2018.00072 article EN 2018-05-01

Equipping social robots with culturally-sensitive facial expressions of emotion using data-driven methods

OPENALEX - Publications

Chaona Chen Laura B. Hensel Yaocong Duan Robin A. A. Ince Oliver Garrod and 3 more

Social robots must be able to generate realistic and recognizable facial expressions engage their human users. Many social are equipped with standardized of emotion that widely considered universally recognized across all cultures. However, mounting evidence shows these not - for example, they elicit significantly lower recognition accuracy in East Asian cultures than do Western Therefore, without culturally sensitive expressions, state-of-the-art restricted ability a diverse range users,...

10.1109/fg.2019.8756570 article EN 2019-05-01

Coming Soon ...