- Topic Modeling
- Natural Language Processing Techniques
- Speech and dialogue systems
- Geographic Information Systems Studies
- Multimodal Machine Learning Applications
- AI in Service Interactions
- Semantic Web and Ontologies
- Advanced Text Analysis Techniques
- Text Readability and Simplification
- Speech Recognition and Synthesis
- Neural Networks and Applications
- Recommender Systems and Techniques
- Linguistic Variation and Morphology
- Software Engineering Research
- Web Data Mining and Analysis
- Translation Studies and Practices
- Data Management and Algorithms
- Biomedical Text Mining and Ontologies
- Geological Modeling and Analysis
- Cognitive Science and Education Research
- Geochemistry and Geologic Mapping
- Software Testing and Debugging Techniques
- Fractal and DNA sequence analysis
- Sentiment Analysis and Opinion Mining
- Data-Driven Disease Surveillance
Huawei Technologies (United Kingdom)
2020-2023
Huawei Technologies (China)
2021
University of Cambridge
2017-2019
Center for Applied Linguistics
2017-2018
Geographical data can be obtained by converting place names from free-format text into geographical coordinates. The ability to geo-locate events in textual reports represents a valuable source of information many real-world applications such as emergency responses, real-time social media event analysis, understanding location instructions auto-response systems and more. However, geoparsing is still widely regarded challenge because domain language diversity, name ambiguity, metonymic...
Abstract Empirical methods in geoparsing have thus far lacked a standard evaluation framework describing the task, metrics and data used to compare state-of-the-art systems. Evaluation is further made inconsistent, even unrepresentative of real world usage by lack distinction between different types toponyms , which necessitates new guidelines, consolidation detailed toponym taxonomy with implications for Named Entity Recognition (NER) beyond. To address these deficiencies, our manuscript...
The purpose of text geolocation is to associate geographic information contained in a document with set (or sets) coordinates, either implicitly by using linguistic features and/or explicitly metadata combined heuristics. We introduce geocoder (location mention disambiguator) that achieves state-of-the-art (SOTA) results on three diverse datasets exploiting the implicit lexical clues. Moreover, we propose new method for systematic encoding generate two distinct views same text. To end, Map...
We present PanGu-Coder, a pretrained decoder-only language model adopting the PanGu-Alpha architecture for text-to-code generation, i.e. synthesis of programming solutions given natural problem description. train PanGu-Coder using two-stage strategy: first stage employs Causal Language Modelling (CLM) to pre-train on raw data, while second uses combination and Masked (MLM) training objectives that focus downstream task generation loosely curated pairs program definitions code functions....
Task-oriented dialogue systems typically rely on large amounts of high-quality training data or require complex handcrafted rules. However, existing datasets are often limited in size con- sidering the complexity dialogues. Additionally, conventional signal in- ference is not suitable for non-deterministic agent behavior, namely, considering multiple actions as valid identical states. We propose Conversation Graph (ConvGraph), a graph-based representation dialogues that can be exploited...
Named entities are frequently used in a metonymic manner. They serve as references to related such people and organisations. Accurate identification interpretation of metonymy can be directly beneficial various NLP applications, Entity Recognition Geographical Parsing. Until now, resolution (MR) methods mainly relied on parsers, taggers, dictionaries, external word lists other handcrafted lexical resources. We show how minimalist neural approach combined with novel predicate window method...
The introduction of transformer-based crosslingual language models brought decisive improvements to multilingual NLP tasks.However, the lack labelled data has necessitated a variety methods that aim close gap high-resource languages.Zero-shot in particular, often use translated task as training signal bridge performance between source and target language(s).We introduce XeroAlign, simple method for taskspecific alignment cross-lingual pretrained transformers such XLM-R.XeroAlign uses...
Creating high-quality annotated data for task-oriented dialog (ToD) is known to be notoriously difficult, and the challenges are amplified when goal create equitable, culturally adapted, large-scale ToD datasets multiple languages. Therefore, current still very scarce suffer from limitations such as translation-based non-native dialogs with translation artefacts, small scale, or lack of cultural adaptation, among others. In this work, we first take stock landscape multilingual datasets,...
Achieving robust language technologies that can perform well across the world’s many languages is a central goal of multilingual NLP. In this work, we take stock and empirically analyse task performance disparities exist between task-oriented dialogue (ToD) systems. We first define new quantitative measures absolute relative equivalence in system performance, capturing within individual languages. Through series controlled experiments, demonstrate depend on number factors: nature ToD at...
Task-oriented personal assistants enable people to interact with a host of devices and services using natural language. One the challenges making neural dialogue systems available more users is lack training data for all but few languages. Zero-shot methods try solve this issue by acquiring task knowledge in high-resource language such as English aim transferring it low-resource language(s). To end, we introduce CrossAligner, principal method variety effective approaches zero-shot...
Language models (LMs) as conversational assistants recently became popular tools that help people accomplish a variety of tasks. These typically result from adapting LMs pretrained on general domain text sequences through further instruction-tuning and possibly preference optimisation methods. The evaluation such would ideally be performed using human judgement, however, this is not scalable. On the other hand, automatic featuring auxiliary judges and/or knowledge-based tasks scalable but...
Code Language Models have been trained to generate accurate solutions, typically with no regard for runtime. On the other hand, previous works that explored execution optimisation observed corresponding drops in functional correctness. To end, we introduce Code-Optimise, a framework incorporates both correctness (passed, failed) and runtime (quick, slow) as learning signals via self-generated preference data. Our is lightweight robust it dynamically selects solutions reduce overfitting while...
The growth in the number of parameters Large Language Models (LLMs) has led to a significant surge computational requirements, making them challenging and costly deploy. Speculative decoding (SD) leverages smaller models efficiently propose future tokens, which are then verified by LLM parallel. Small that utilise activations from currently achieve fastest speeds. However, we identify several limitations SD including lack on-policyness during training partial observability. To address these...
Abstract Creating high-quality annotated data for task-oriented dialog (ToD) is known to be notoriously difficult, and the challenges are amplified when goal create equitable, culturally adapted, large-scale ToD datasets multiple languages. Therefore, current still very scarce suffer from limitations such as translation-based non-native dialogs with translation artefacts, small scale, or lack of cultural adaptation, among others. In this work, we first take stock landscape multilingual...
Transfer learning has become the dominant paradigm for many natural language processing tasks. In addition to models being pretrained on large datasets, they can be further trained intermediate (supervised) tasks that are similar target task. For small Natural Language Inference (NLI) modelling is typically followed by pretraining a (labelled) NLI dataset before fine-tuning with each subtask. this work, we explore Gradient Boosted Decision Trees (GBDTs) as an alternative commonly used...
Empirical methods in geoparsing have thus far lacked a standard evaluation framework describing the task, metrics and data used to compare state-of-the-art systems. Evaluation is further made inconsistent, even unrepresentative of real-world usage by lack distinction between different types toponyms, which necessitates new guidelines, consolidation detailed toponym taxonomy with implications for Named Entity Recognition (NER) beyond. To address these deficiencies, our manuscript introduces...
We undertake the task of comparing lexicon-based sentiment classification film reviews with machine learning approaches. look at existing methodologies and attempt to emulate improve on them using a 'given' lexicon bag-of-words approach. also utilise syntactical information such as part-of-speech dependency relations. will show that simple achieves good results however techniques prove be superior tool. more features do not necessarily deliver better performance well elaborate three further...
Achieving robust language technologies that can perform well across the world's many languages is a central goal of multilingual NLP. In this work, we take stock and empirically analyse task performance disparities exist between task-oriented dialogue (ToD) systems. We first define new quantitative measures absolute relative equivalence in system performance, capturing within individual languages. Through series controlled experiments, demonstrate depend on number factors: nature ToD at...