Catherine Lai

ORCID: 0000-0003-2411-8954
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech and dialogue systems
  • Emotion and Mood Recognition
  • Speech Recognition and Synthesis
  • Natural Language Processing Techniques
  • Music and Audio Processing
  • Topic Modeling
  • Language, Discourse, Communication Strategies
  • Phonetics and Phonology Research
  • Speech and Audio Processing
  • Sentiment Analysis and Opinion Mining
  • Music Technology and Sound Studies
  • Language, Metaphor, and Cognition
  • Acute Myeloid Leukemia Research
  • Diverse Musicological Studies
  • Social Robot Interaction and HRI
  • Language and cultural evolution
  • Linguistic Variation and Morphology
  • Infant Health and Development
  • Semantic Web and Ontologies
  • Histone Deacetylase Inhibitors Research
  • Advanced Text Analysis Techniques
  • Neutropenia and Cancer Infections
  • Discourse Analysis in Language Studies
  • Algorithms and Data Compression
  • Digital Communication and Language

University of Edinburgh
2016-2025

SpeechTech (Czechia)
2024

University of Pennsylvania
2008-2024

CereProc (United Kingdom)
2023

Cambridge University Press
2023

Advanced Neural Dynamics (United States)
2021

University of Bremen
2018

Fraunhofer Institute of Optronics, System Technologies and Image Exploitation
2018

Worcester Polytechnic Institute
2018

Cornell University
2018

Self-supervised speech models have grown fast during the past few years and proven feasible for use in various downstream tasks. Some recent work has started to look at characteristics of these models, yet many concerns not been fully addressed. In this work, we conduct a study on emotional corpora explore popular self-supervised model - wav2vec 2.0. Via set quantitative analysis, mainly demonstrate that: 1) 2.0 appears discard paralinguistic information that is less useful word recognition...

10.1109/slt54892.2023.10023428 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2023-01-09

Recognizing emotional reactions of movie audiences to affective content is a challenging task in computing. Previous research on induced emotion recognition has mainly focused using audio-visual content. Nevertheless, the relationship between perceptions (perceived emotions) and emotions evoked (induced unexplored. In this work, we studied perceived audiences. Moreover, investigated multimodal modelling approaches predict from based features, as well physiological behavioral To carry out...

10.1109/taffc.2019.2902091 article EN publisher-specific-oa IEEE Transactions on Affective Computing 2019-02-27

Automatic emotion recognition is vital for building natural and engaging human-computer interaction systems. Combining information from multiple modalities typically improves performance. In previous work, features different have generally been fused at the same level with two types of fusion strategies: Feature-Level fusion, which concatenates feature sets before recognition; Decision-Level makes final decision based on outputs unimodal models. However, may describe data time scales or...

10.1109/slt.2016.7846319 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2016-12-01

Alongside acoustic information, linguistic features based on speech transcripts have been proven useful in Speech Emotion Recognition (SER). However, due to the scarcity of emotion labelled data and difficulty recognizing emotional speech, it is hard obtain reliable models this research area. In paper, we propose fuse Automatic (ASR) outputs into pipeline for joint training SER. The relationship between ASR SER understudied, unclear what how benefit By examining various fusion methods, our...

10.1109/icassp43922.2022.9746289 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

10.1109/icassp49660.2025.10890660 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

10.1109/icassp49660.2025.10888591 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

In this work, we compare emotion recognition on two types of speech: spontaneous and acted dialogues. Experiments were conducted the AVEC2012 database dialogues IEMOCAP We studied performance acoustic features for recognition: knowledge-inspired disfluency nonverbal vocalisation (DIS-NV) features, statistical Low-Level Descriptor (LLD) based features. Both Support Vector Machines (SVM) Long Short-Term Memory Recurrent Neural Networks (LSTM-RNN) built using each feature set emotional...

10.1109/acii.2015.7344645 article EN 2015-09-01

Predicting the emotional response of movie audiences to affective content is a challenging task in computing. Previous work has focused on using audiovisual predict induced emotions. However, relationship between audience's perceptions (perceived emotions) and emotions evoked audience (induced remains unexplored. In this work, we address perceived movies, identify features modelling approaches effective for predicting First, extend LIRIS-ACCEDE database by annotating crowd-sourced manner,...

10.1109/acii.2017.8273575 article EN 2017-10-01

Current multimodal sentiment analysis frames score prediction as a general Machine Learning task. However, what the actually represents has often been overlooked. As measurement of opinions and affective states, generally consists two aspects: polarity intensity. We decompose scores into these aspects study how they are conveyed through individual modalities combined models in naturalistic monologue setting. In particular, we build unimodal multi-task learning with main task and/or intensity...

10.18653/v1/w18-3306 article EN cc-by 2018-01-01

Abstract The Lothian Diary Project is an interdisciplinary effort to collect self-recorded audio or video diaries of people’s experiences COVID-19 in and around Edinburgh, Scotland. In this paper we describe how the project emerged from a desire support community members. have been disseminated through public events, website, oral history project, engagement with policymakers. data collection method encouraged participation people disabilities, racialized individuals, immigrants,...

10.1515/lingvan-2021-0053 article EN cc-by Linguistics Vanguard 2022-02-21

10.1109/slt61566.2024.10832240 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2024-12-02

This paper investigates how rising intonation affects the interpretation of cue words in dialogue. Both and express a range speaker attitudes like uncertainty surprise. However, it is unclear perception these relates to dialogue structure belief co-ordination. Perception experiment results suggest that rises reflect difficulty integrating new information rather than signaling lack credibility. leads general analysis as current question under discussion unresolved. interaction with word...

10.21437/interspeech.2010-429 article EN Interspeech 2022 2010-09-26

In this study we present a performance comparison for five pitch extraction algorithms: auto correlation (AC), cross (CC), and sub-harmonic summation (as implemented in PRAAT [Boersma Weenick (2010)]), the robust algorithm tracking (RAPT) ESPS [Talkin (1995)], SWIPE’ [Camacho (2007)]. Recent research showed that SHS outperformed other algorithms on two speech databases with EGG reference values That study, however, used fixed search range of 40–800 Hz all speakers, regardless sex or...

10.1121/1.3508047 article EN The Journal of the Acoustical Society of America 2010-10-01

In this paper we investigate how participant involvement and turn-taking features relate to extractive summarization of meeting dialogues. particular, examine whether automatically derived measures group level involvement, like participation equality freedom, can help detect where relevant segments will be. Results show that classification using performed better than the majority class baseline for data from both AMI ICSI corpora in identifying contain summary dialogue acts. The feature...

10.21437/interspeech.2013-625 article EN Interspeech 2022 2013-08-25

Speech synthesis has improved in both expressiveness and voice quality recent years.However, obtaining full when dealing with large multi-sentential synthesized discourse is still a challenge, since speech synthesizers do not take into account the prosodic differences that have been observed units such as paragraphs.The current study validates extends previous work by analyzing prosody of paragraph diverse corpus TED Talks using automatically extracted F0, intensity timing features.In...

10.21437/speechprosody.2016-235 article EN Speech prosody 2016-05-31

10.1007/s10849-009-9086-9 article EN Journal of Logic Language and Information 2009-05-27

We address the task of automatically predicting group satisfaction in meetings using acoustic, lexical, and turn-taking features. Participant is measured post-meeting ratings from AMI corpus. focus on three aspects satisfaction: overall satisfaction, participant attention information overload. All predictions are made at aggregated level. In general, we find that combining features across modalities improves prediction performance. However, feature ablation significantly Our experiments also...

10.1145/3279810.3279840 article EN 2018-10-16

This work aims to explore the correlation between discourse structure of a spoken monologue and its prosody by predicting relations from different prosodic attributes.For this purpose, corpus semi-spontaneous monologues in English has been automatically annotated according Rhetorical Structure Theory, which models coherence text via rhetorical relations.From corresponding audio files, features such as pitch, intensity, speech rate have extracted contexts relation.Supervised classification...

10.21437/interspeech.2017-710 article EN Interspeech 2022 2017-08-16

Being able to detect topics and speaker stances in conversations is a key requirement for developing spoken language understanding systems that are personalized adaptive.In this work, we explore how topic-oriented stance expressed conversational speech.To do this, present new set of topic annotations the CallHome corpus spontaneous dialogues.Specifically, focus on six stances-positivity, certainty, surprise, amusement, interest, comfort-which useful characterizing important aspects...

10.21437/interspeech.2019-2632 article EN Interspeech 2022 2019-09-13

In Speech Emotion Recognition (SER), textual data is often used alongside audio signals to address their inherent variability.However, the reliance on human annotated text in most research hinders development of practical SER systems.To overcome this challenge, we investigate how Automatic (ASR) performs emotional speech by analyzing ASR performance emotion corpora and examining distribution word errors confidence scores transcripts gain insight into affects ASR.We utilize four systems,...

10.21437/interspeech.2023-2078 article EN Interspeech 2022 2023-08-14
Coming Soon ...