- Speech and dialogue systems
- Social Robot Interaction and HRI
- Speech and Audio Processing
- Speech Recognition and Synthesis
- Face recognition and analysis
- Phonetics and Phonology Research
- Multisensory perception and integration
- Hearing Impairment and Communication
- Human Motion and Animation
- Natural Language Processing Techniques
- Hand Gesture Recognition Systems
- Hearing Loss and Rehabilitation
- Music and Audio Processing
- Language, Metaphor, and Cognition
- Face Recognition and Perception
- Language, Discourse, Communication Strategies
- Gaze Tracking and Assistive Technology
- Robotics and Automated Systems
- Tactile and Sensory Interactions
- Human Pose and Action Recognition
- AI in Service Interactions
- Multimodal Machine Learning Applications
- Advanced Vision and Imaging
- Video Analysis and Summarization
- Subtitles and Audiovisual Media
KTH Royal Institute of Technology
2015-2024
CBot (Sweden)
2018-2019
Royal College of Music in Stockholm
2017
Swedish e-Science Research Centre
1998-2007
Google (United States)
2006
Columbia University
2005
University of California, Santa Cruz
1998
In the speech technology research community there is an increasing trend to use open source solutions. We present a new tool in that spirit, WaveSurfer, which has been developed at Centre for Speech Technology KTH. It designed tasks such as viewing, editing, and labeling of audio data. WaveSurfer built around small core most functionality added form plug-ins. The work on common platforms with aims it should be easy configure extend. provided source, under GPL license explicit goal jointly...
Data-driven modelling and synthesis of motion is an active research area with applications that include animation, games, social robotics. This paper introduces a new class probabilistic, generative, controllable motion-data models based on normalising flows. Models this kind can describe highly complex distributions, yet be trained efficiently using exact maximum likelihood, unlike GANs or VAEs. Our proposed model autoregressive uses LSTMs to enable arbitrarily long time-dependencies....
Abstract Automatic synthesis of realistic gestures promises to transform the fields animation, avatars and communicative agents. In off‐line applications, novel tools can alter role an animator that a director, who provides only high‐level input for desired animation; learned network then translates these instructions into appropriate sequence body poses. interactive scenarios, systems generating natural animations on fly are key achieving believable relatable characters. this paper we...
Diffusion models have experienced a surge of interest as highly expressive yet efficiently trainable probabilistic models. We show that these are an excellent fit for synthesising human motion co-occurs with audio, e.g., dancing and co-speech gesticulation, since is complex ambiguous given calling description. Specifically, we adapt the DiffWave architecture to model 3D pose sequences, putting Conformers in place dilated convolutions improved modelling power. also demonstrate control over...
We introduce Matcha-TTS, a new encoder-decoder architecture for speedy TTS acoustic modelling, trained using optimal-transport conditional flow matching (OT-CFM). This yields an ODE-based decoder capable of high output quality in fewer synthesis steps than models score matching. Careful design choices additionally ensure each step is fast to run. The method probabilistic, non-autoregressive, and learns speak from scratch without external alignments. Compared strong pre-trained baseline...
In this paper, we present a dialog system that was exhibited at the Swedish National Museum of Science and Technology. Two visitors time could play collaborative card sorting game together with robot head Furhat, where three players discuss solution together. The cards are shown on touch table between players, thus constituting target for joint attention. We describe how implemented in order to manage turn-taking attention users objects shared physical space. also multi-modal redundancy...
Speech synthesis applications have become an ubiquity, in navigation systems, digital assistants or as screen audio book readers.Despite their impact on the acceptability of systems which they are embedded, and despite fact that different probably need types TTS voices, evaluation is still largely treated isolated problem.Even though there strong agreement among researchers mainstream approaches to Text-to-Speech (TTS) often insufficient may even be misleading, exist few clear-cut...
Dance requires skillful composition of complex movements that follow rhythmic, tonal and timbral features music. Formally, generating dance conditioned on a piece music can be expressed as problem modelling high-dimensional continuous motion signal, an audio signal. In this work we make two contributions to tackle problem. First, present novel probabilistic autoregressive architecture models the distribution over future poses with normalizing flow previous well context, using multimodal...
A system for rule based audiovisual text-to-speech synthesis has been created. The is on the KTH which complemented with a three-dimensional parameterized model of human face. face can be animated in real time, synchronized auditory speech. facial controlled by same software as speech synthesizer. set rules that takes coarticulation into account developed. also incorporated spoken man-machine dialogue being developed at department.
The perception of gaze plays a crucial role in human-human interaction. Gaze has been shown to matter for number aspects communication and dialogue, especially managing the flow dialogue participant attention, deictic referencing, attitude. When developing embodied conversational agents (ECAs) talking heads, modeling delivering accurate targets is crucial. Traditionally, systems communicating through heads have displayed human conversant using 2D displays, such as flat monitors. This...
In this paper, we present Furhat — a back-projected human-like robot head using state-of-the art facial animation. Three experiments are presented where investigate how the might facilitate human–robot face-to-face interaction. First, animated lips increase intelligibility of spoken output, and compare to an agent on flat screen, as well human face. Second, accuracy perception Furhat's gaze in setting typical for situated interaction, sitting around table. The is measured depending eye...
To enable more natural face-to-face interactions, conversational agents need to adapt their behavior interlocutors. One key aspect of this is generation appropriate non-verbal for the agent, example facial gestures, here defined as expressions and head movements. Most existing gesture-generating systems do not utilize multi-modal cues from interlocutor when synthesizing behavior. Those that do, typically use deterministic methods risk producing repetitive non-vivid motions. In paper, we...
Synthesising spontaneous speech is a difficult task due to disfluencies, high variability and syntactic conventions different from those of written language. Using found data, as opposed lab-rec ...
This paper describes a system developed for the GENEA (Generation and Evaluation of Non-verbal Behaviour Embodied Agents) Challenge 2023. Our solution builds on an existing diffusion-based motion synthesis model. We propose contrastive speech pretraining (CSMP) module, which learns joint embedding gesture with aim to learn semantic coupling between these modalities. The output CSMP module is used as conditioning signal in model order achieve semantically-aware co-speech generation. entry...
A general overview of the AdApt project and research that is performed within presented.In this various aspects human-computer interaction in a multimodal conversational dialogue systems are investigated.The will also include studies on integration user/system/dialogue dependent speech recognition synthesis.A domain which highly useful has been chosen, namely, finding available apartments Stockholm.A Wizard-of-Oz data collection described.
In this paper, we present measurements of visual, facial parameters obtained from a speech corpus consisting short, read utterances in which focal accent was systematically varied. The utterance ...
EVA1 is describing a new class of emotion-aware autonomous systems delivering intelligent personal assistant functionalities. EVA requires multi-disciplinary approach, combining number critical building blocks into cybernetics systems/software architecture: emotion aware and algorithms, multimodal interaction design, cognitive modelling, decision making recommender systems, sensing as feedback for learning, distributed (edge) computing services.
Social robots are now part of human society, destined for schools, hospitals, and homes to perform a variety tasks. To engage their users, social must be equipped with the essential skill facial expression communication. Yet, even state-of-the-art limited in this ability because they often rely on restricted set expressions derived from theory well-known limitations such as lacking naturalistic dynamics. With no agreed methodology objectively engineer broader variance more psychologically...
Social robots must be able to generate realistic and recognizable facial expressions engage their human users. Many social are equipped with standardized of emotion that widely considered universally recognized across all cultures. However, mounting evidence shows these not - for example, they elicit significantly lower recognition accuracy in East Asian cultures than do Western Therefore, without culturally sensitive expressions, state-of-the-art restricted ability a diverse range users,...