Milan Gritta

ORCID: 0000-0003-0014-7275
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Topic Modeling
  • Natural Language Processing Techniques
  • Speech and dialogue systems
  • Geographic Information Systems Studies
  • Multimodal Machine Learning Applications
  • AI in Service Interactions
  • Semantic Web and Ontologies
  • Advanced Text Analysis Techniques
  • Text Readability and Simplification
  • Speech Recognition and Synthesis
  • Neural Networks and Applications
  • Recommender Systems and Techniques
  • Linguistic Variation and Morphology
  • Software Engineering Research
  • Web Data Mining and Analysis
  • Translation Studies and Practices
  • Data Management and Algorithms
  • Biomedical Text Mining and Ontologies
  • Geological Modeling and Analysis
  • Cognitive Science and Education Research
  • Geochemistry and Geologic Mapping
  • Software Testing and Debugging Techniques
  • Fractal and DNA sequence analysis
  • Sentiment Analysis and Opinion Mining
  • Data-Driven Disease Surveillance

Huawei Technologies (United Kingdom)
2020-2023

Huawei Technologies (China)
2021

University of Cambridge
2017-2019

Center for Applied Linguistics
2017-2018

Geographical data can be obtained by converting place names from free-format text into geographical coordinates. The ability to geo-locate events in textual reports represents a valuable source of information many real-world applications such as emergency responses, real-time social media event analysis, understanding location instructions auto-response systems and more. However, geoparsing is still widely regarded challenge because domain language diversity, name ambiguity, metonymic...

10.1007/s10579-017-9385-8 article EN cc-by Language Resources and Evaluation 2017-03-07

Abstract Empirical methods in geoparsing have thus far lacked a standard evaluation framework describing the task, metrics and data used to compare state-of-the-art systems. Evaluation is further made inconsistent, even unrepresentative of real world usage by lack distinction between different types toponyms , which necessitates new guidelines, consolidation detailed toponym taxonomy with implications for Named Entity Recognition (NER) beyond. To address these deficiencies, our manuscript...

10.1007/s10579-019-09475-3 article EN cc-by Language Resources and Evaluation 2019-09-19

The purpose of text geolocation is to associate geographic information contained in a document with set (or sets) coordinates, either implicitly by using linguistic features and/or explicitly metadata combined heuristics. We introduce geocoder (location mention disambiguator) that achieves state-of-the-art (SOTA) results on three diverse datasets exploiting the implicit lexical clues. Moreover, we propose new method for systematic encoding generate two distinct views same text. To end, Map...

10.18653/v1/p18-1119 article EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2018-01-01

We present PanGu-Coder, a pretrained decoder-only language model adopting the PanGu-Alpha architecture for text-to-code generation, i.e. synthesis of programming solutions given natural problem description. train PanGu-Coder using two-stage strategy: first stage employs Causal Language Modelling (CLM) to pre-train on raw data, while second uses combination and Masked (MLM) training objectives that focus downstream task generation loosely curated pairs program definitions code functions....

10.48550/arxiv.2207.11280 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Task-oriented dialogue systems typically rely on large amounts of high-quality training data or require complex handcrafted rules. However, existing datasets are often limited in size con- sidering the complexity dialogues. Additionally, conventional signal in- ference is not suitable for non-deterministic agent behavior, namely, considering multiple actions as valid identical states. We propose Conversation Graph (ConvGraph), a graph-based representation dialogues that can be exploited...

10.1162/tacl_a_00352 article EN cc-by Transactions of the Association for Computational Linguistics 2021-02-01

Named entities are frequently used in a metonymic manner. They serve as references to related such people and organisations. Accurate identification interpretation of metonymy can be directly beneficial various NLP applications, Entity Recognition Geographical Parsing. Until now, resolution (MR) methods mainly relied on parsers, taggers, dictionaries, external word lists other handcrafted lexical resources. We show how minimalist neural approach combined with novel predicate window method...

10.18653/v1/p17-1115 article EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2017-01-01

The introduction of transformer-based crosslingual language models brought decisive improvements to multilingual NLP tasks.However, the lack labelled data has necessitated a variety methods that aim close gap high-resource languages.Zero-shot in particular, often use translated task as training signal bridge performance between source and target language(s).We introduce XeroAlign, simple method for taskspecific alignment cross-lingual pretrained transformers such XLM-R.XeroAlign uses...

10.18653/v1/2021.findings-acl.32 article EN cc-by 2021-01-01

Creating high-quality annotated data for task-oriented dialog (ToD) is known to be notoriously difficult, and the challenges are amplified when goal create equitable, culturally adapted, large-scale ToD datasets multiple languages. Therefore, current still very scarce suffer from limitations such as translation-based non-native dialogs with translation artefacts, small scale, or lack of cultural adaptation, among others. In this work, we first take stock landscape multilingual datasets,...

10.48550/arxiv.2307.14031 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Achieving robust language technologies that can perform well across the world’s many languages is a central goal of multilingual NLP. In this work, we take stock and empirically analyse task performance disparities exist between task-oriented dialogue (ToD) systems. We first define new quantitative measures absolute relative equivalence in system performance, capturing within individual languages. Through series controlled experiments, demonstrate depend on number factors: nature ToD at...

10.18653/v1/2023.emnlp-main.422 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2023-01-01

Task-oriented personal assistants enable people to interact with a host of devices and services using natural language. One the challenges making neural dialogue systems available more users is lack training data for all but few languages. Zero-shot methods try solve this issue by acquiring task knowledge in high-resource language such as English aim transferring it low-resource language(s). To end, we introduce CrossAligner, principal method variety effective approaches zero-shot...

10.18653/v1/2022.findings-acl.319 article EN cc-by Findings of the Association for Computational Linguistics: ACL 2022 2022-01-01

Language models (LMs) as conversational assistants recently became popular tools that help people accomplish a variety of tasks. These typically result from adapting LMs pretrained on general domain text sequences through further instruction-tuning and possibly preference optimisation methods. The evaluation such would ideally be performed using human judgement, however, this is not scalable. On the other hand, automatic featuring auxiliary judges and/or knowledge-based tasks scalable but...

10.48550/arxiv.2405.09186 preprint EN arXiv (Cornell University) 2024-05-15

Code Language Models have been trained to generate accurate solutions, typically with no regard for runtime. On the other hand, previous works that explored execution optimisation observed corresponding drops in functional correctness. To end, we introduce Code-Optimise, a framework incorporates both correctness (passed, failed) and runtime (quick, slow) as learning signals via self-generated preference data. Our is lightweight robust it dynamically selects solutions reduce overfitting while...

10.48550/arxiv.2406.12502 preprint EN arXiv (Cornell University) 2024-06-18

The growth in the number of parameters Large Language Models (LLMs) has led to a significant surge computational requirements, making them challenging and costly deploy. Speculative decoding (SD) leverages smaller models efficiently propose future tokens, which are then verified by LLM parallel. Small that utilise activations from currently achieve fastest speeds. However, we identify several limitations SD including lack on-policyness during training partial observability. To address these...

10.48550/arxiv.2410.03804 preprint EN arXiv (Cornell University) 2024-10-04

Abstract Creating high-quality annotated data for task-oriented dialog (ToD) is known to be notoriously difficult, and the challenges are amplified when goal create equitable, culturally adapted, large-scale ToD datasets multiple languages. Therefore, current still very scarce suffer from limitations such as translation-based non-native dialogs with translation artefacts, small scale, or lack of cultural adaptation, among others. In this work, we first take stock landscape multilingual...

10.1162/tacl_a_00609 article EN cc-by Transactions of the Association for Computational Linguistics 2023-01-01

Transfer learning has become the dominant paradigm for many natural language processing tasks. In addition to models being pretrained on large datasets, they can be further trained intermediate (supervised) tasks that are similar target task. For small Natural Language Inference (NLI) modelling is typically followed by pretraining a (labelled) NLI dataset before fine-tuning with each subtask. this work, we explore Gradient Boosted Decision Trees (GBDTs) as an alternative commonly used...

10.18653/v1/2021.findings-acl.26 preprint EN cc-by 2021-01-01

Empirical methods in geoparsing have thus far lacked a standard evaluation framework describing the task, metrics and data used to compare state-of-the-art systems. Evaluation is further made inconsistent, even unrepresentative of real-world usage by lack distinction between different types toponyms, which necessitates new guidelines, consolidation detailed toponym taxonomy with implications for Named Entity Recognition (NER) beyond. To address these deficiencies, our manuscript introduces...

10.48550/arxiv.1810.12368 preprint EN other-oa arXiv (Cornell University) 2018-01-01

We undertake the task of comparing lexicon-based sentiment classification film reviews with machine learning approaches. look at existing methodologies and attempt to emulate improve on them using a 'given' lexicon bag-of-words approach. also utilise syntactical information such as part-of-speech dependency relations. will show that simple achieves good results however techniques prove be superior tool. more features do not necessarily deliver better performance well elaborate three further...

10.48550/arxiv.1905.04727 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Achieving robust language technologies that can perform well across the world's many languages is a central goal of multilingual NLP. In this work, we take stock and empirically analyse task performance disparities exist between task-oriented dialogue (ToD) systems. We first define new quantitative measures absolute relative equivalence in system performance, capturing within individual languages. Through series controlled experiments, demonstrate depend on number factors: nature ToD at...

10.48550/arxiv.2310.12892 preprint EN other-oa arXiv (Cornell University) 2023-01-01
Coming Soon ...