- Speech Recognition and Synthesis
- Speech and Audio Processing
- Music and Audio Processing
- Speech and dialogue systems
- Topic Modeling
- Natural Language Processing Techniques
- Social Robot Interaction and HRI
- Intelligent Tutoring Systems and Adaptive Learning
- IoT-based Smart Home Systems
- AI in Service Interactions
- Multimodal Machine Learning Applications
- Machine Learning and ELM
University of Cambridge
2018-2021
User Simulators are one of the major tools that enable offline training task-oriented dialogue systems. For this task Agenda-Based Simulator (ABUS) is often used. The ABUS based on hand-crafted rules and its output in semantic form. Issues arise from both properties such as limited diversity inability to interface a text-level belief tracker. This paper introduces Neural (NUS) whose behaviour learned corpus which generates natural language, hence needing less labelled dataset than simulators...
Bo-Hsiang Tseng, Yinpei Dai, Florian Kreyssig, Bill Byrne. Proceedings of the 59th Annual Meeting Association for Computational Linguistics and 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021.
In this paper, we propose Discriminative Neural Clustering (DNC) that formulates data clustering with a maximum number of clusters as supervised sequence-to-sequence learning problem. Com-pared to traditional unsupervised algorithms, DNC learns patterns from training without requiring an explicit definition similarity measure. An implementation based on the Transformer architecture is shown be effective speaker diarisation task using challenging AMI dataset. Since contains only 147 complete...
In this paper, we propose Discriminative Neural Clustering (DNC) that formulates data clustering with a maximum number of clusters as supervised sequence-to-sequence learning problem. Compared to traditional unsupervised algorithms, DNC learns patterns from training without requiring an explicit definition similarity measure. An implementation based on the Transformer architecture is shown be effective speaker diarisation task using challenging AMI dataset. Since contains only 147 complete...
Time delay neural networks (TDNNs) are an effective acoustic model for large vocabulary speech recognition. The strength of the can be attributed to its ability effectively long temporal contexts. However, current TDNN models relatively shallow, which limits modelling capability. This paper proposes a method increasing network depth by deepening kernel used in convolutions. best performing consists three fully connected layers with residual (ResNet) connection from output first third....
This paper describes PyHTK, which is a Python-based library and associated pipeline to facilitate the construction of large-scale complex automatic speech recognition (ASR) systems using hidden Markov model toolkit (HTK). PyHTK can be used generate sophisticated artificial neural network (ANN) models with versatile architectures by converting compact configuration file defining ANN, into form HTK tools, as well supporting range capabilities train test ANN models. The ASR divided multiple...
Reinforcement learning (RL) is a promising dialogue policy optimisation approach, but traditional RL algorithms fail to scale large domains. Recently, Feudal Dialogue Management (FDM), has shown increase the scalability domains by decomposing management decision into two steps, making use of domain ontology abstract state in each step. In order space, however, previous work on FDM relies handcrafted feature functions. this work, we show that these functions can be learned jointly with model...
Cross-domain natural language generation (NLG) is still a difficult task within spoken dialogue modelling. Given semantic representation provided by the manager, generator should generate sentences that convey desired information. Traditional template-based generators can produce with all necessary information, but these are not sufficiently diverse. With RNN-based models, diversity of generated be high, however, in process some information lost. In this work, we improve an considering...
In this paper, we propose a semi-supervised learning (SSL) technique for training deep neural networks (DNNs) to generate speaker-discriminative acoustic embeddings (speaker embeddings).Obtaining large amounts of speaker recognition data can be difficult desired target domains, especially under privacy constraints.The proposed reduces requirements labelled by leveraging unlabelled data.The is variant virtual adversarial (VAT) [1] in the form loss that defined as robustness embedding against...
One of the difficulties in training dialogue systems is lack data. We explore possibility creating data through interaction between a system and user simulator. Our goal to develop modelling framework that can incorporate new scenarios self-play two agents. In this framework, we first pre-train agents on collection source domain dialogues, which equips converse with each other via natural language. With further fine-tuning small amount target data, continue interact aim improving their...
User Simulators are one of the major tools that enable offline training task-oriented dialogue systems. For this task Agenda-Based Simulator (ABUS) is often used. The ABUS based on hand-crafted rules and its output in semantic form. Issues arise from both properties such as limited diversity inability to interface a text-level belief tracker. This paper introduces Neural (NUS) whose behaviour learned corpus which generates natural language, hence needing less labelled dataset than simulators...
Cross-domain natural language generation (NLG) is still a difficult task within spoken dialogue modelling. Given semantic representation provided by the manager, generator should generate sentences that convey desired information. Traditional template-based generators can produce with all necessary information, but these are not sufficiently diverse. With RNN-based models, diversity of generated be high, however, in process some information lost. In this work, we improve an considering...
Time delay neural networks (TDNNs) are an effective acoustic model for large vocabulary speech recognition. The strength of the can be attributed to its ability effectively long temporal contexts. However, current TDNN models relatively shallow, which limits modelling capability. This paper proposes a method increasing network depth by deepening kernel used in convolutions. best performing consists three fully connected layers with residual (ResNet) connection from output first third....
In this paper, we propose a semi-supervised learning (SSL) technique for training deep neural networks (DNNs) to generate speaker-discriminative acoustic embeddings (speaker embeddings). Obtaining large amounts of speaker recognition train-ing data can be difficult desired target domains, especially under privacy constraints. The proposed reduces requirements labelled by leveraging unlabelled data. is variant virtual adversarial (VAT) [1] in the form loss that defined as robustness embedding...
Self-supervised learning via masked prediction pre-training (MPPT) has shown impressive performance on a range of speech-processing tasks. This paper proposes method to bias self-supervised towards specific task. The core idea is slightly finetune the model that used obtain target sequence. leads better and substantial increase in training speed. Furthermore, this variant MPPT allows low-footprint streaming models be trained effectively by computing loss unmasked frames. These approaches are...
This paper presents a novel natural gradient and Hessian-free (NGHF) optimisation framework for neural network training that can operate efficiently in distributed manner. It relies on the linear conjugate (CG) algorithm to combine (NG) method with local curvature information from (HF) or other second-order methods. A solution numerical issue CG allows effective parameter updates be generated far fewer iterations than usually used (e.g. 5-8 instead of 200). work also preconditioning approach...