Chenghao Xiao

ORCID: 0000-0001-7623-8232
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Topic Modeling
  • Natural Language Processing Techniques
  • Multimodal Machine Learning Applications
  • Semantic Web and Ontologies
  • Advanced Text Analysis Techniques
  • Domain Adaptation and Few-Shot Learning
  • Music and Audio Processing
  • Online Learning and Analytics
  • Intelligent Tutoring Systems and Adaptive Learning
  • Academic Writing and Publishing
  • Text Readability and Simplification
  • Computational Physics and Python Applications
  • Speech Recognition and Synthesis
  • Neural Networks and Applications
  • Traditional Chinese Medicine Studies
  • Humor Studies and Applications
  • Speech and Audio Processing
  • Advanced Image and Video Retrieval Techniques
  • Machine Learning in Healthcare
  • Interpreting and Communication in Healthcare
  • Radiology practices and education
  • Ideological and Political Education
  • AI-based Problem Solving and Planning
  • Biomedical Text Mining and Ontologies
  • Video Analysis and Summarization

Durham University
2022-2023

Tsinghua University
2022

Beijing Jiaotong University
2022

Text embeddings are typically evaluated on a limited set of tasks, which constrained by language, domain, and task diversity. To address these limitations provide more comprehensive evaluation, we introduce the Massive Multilingual Embedding Benchmark (MMTEB) - large-scale, community-driven expansion MTEB, covering over 500 quality-controlled evaluation tasks across 250+ languages. MMTEB includes diverse challenging, novel such as instruction following, long-document retrieval, code...

10.48550/arxiv.2502.13595 preprint EN arXiv (Cornell University) 2025-02-19

Incorporating contrastive learning objectives in sentence representation (SRL) has yielded significant improvements on many sentence-level NLP tasks. However, it is not well understood why works for semantics. In this paper, we aim to help guide future designs of methods by taking a closer look at SRL through the lens isotropy, contextualization and dynamics. We interpret its successes geometry shifts show that brings drives high intra-sentence similarity: when same sentence, tokens converge...

10.18653/v1/2023.findings-acl.778 article EN cc-by Findings of the Association for Computational Linguistics: ACL 2022 2023-01-01

Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, their enormous parameter size and extremely high requirements for compute power pose challenges practical deployment. Recent research has revealed that specific capabilities LLMs, such as numerical reasoning, can be transferred to smaller models through distillation. Some studies explore the potential leveraging LLMs perform table-based reasoning. there...

10.48550/arxiv.2309.13182 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Multi-modal information retrieval (MMIR) is a rapidly evolving field, where significant progress, particularly in image-text pairing, has been made through advanced representation learning and cross-modality alignment research. However, current benchmarks for evaluating MMIR performance pairing within the scientific domain show notable gap, chart table images described scholarly language usually do not play role. To bridge this we develop specialised (SciMMIR) benchmark by leveraging...

10.48550/arxiv.2401.13478 preprint EN other-oa arXiv (Cornell University) 2024-01-01

Pretrained language models are long known to be subpar in capturing sentence and document-level semantics. Though heavily investigated, transferring perturbation-based methods from unsupervised visual representation learning NLP remains an unsolved problem. This is largely due the discreteness of subword units brought by tokenization models, limiting small perturbations inputs form semantics-preserved positive pairs. In this work, we conceptualize sentence-level textual semantics as a...

10.48550/arxiv.2402.08183 preprint EN arXiv (Cornell University) 2024-02-12

Semantic textual similartiy (STS) and information retrieval tasks (IR) have been the two major avenues to record progress of embedding models in past few years. Under emerging Retrieval-augmented Generation (RAG) paradigm, we envision need evaluate next-level language understanding abilities models, take a conscious look at reasoning stored them. Addressing this, pose question: Can retrievers solve problems? By transforming into tasks, find that without specifically trained for...

10.48550/arxiv.2404.06347 preprint EN arXiv (Cornell University) 2024-04-09

Next-frame prediction is a useful and powerful method for modelling understanding the dynamics of video data. Inspired by empirical success causal language next-token in modelling, we explore extent to which next-frame serves as strong foundational learning strategy (analogous modelling) inducing an visual world. In order quantify specific induced prediction, introduce six diagnostic simulation datasets derived from fundamental physical laws created varying constants such gravity mass. We...

10.48550/arxiv.2405.17450 preprint EN arXiv (Cornell University) 2024-05-21

Radiology Report Generation (RRG) has achieved significant progress with the advancements of multimodal generative models. However, evaluation in domain suffers from a lack fair and robust metrics. We reveal that, high performance on RRG existing lexical-based metrics (e.g. BLEU) might be more mirage - model can get BLEU only by learning template reports. This become an urgent problem for due to highly patternized nature these In this work, we un-intuitively approach proposing Layman's...

10.48550/arxiv.2406.17911 preprint EN arXiv (Cornell University) 2024-06-25

Named entity recognition (NER) stands as a fundamental and pivotal task within the realm of Natural Language Processing. Particularly domain Biomedical Method NER, this presents notable challenges, stemming from continual influx domain-specific terminologies in scholarly literature. Current research (BioMethod) NER suffers scarcity resources, primarily attributed to intricate nature methodological concepts, which necessitate profound understanding for precise delineation. In study, we...

10.48550/arxiv.2406.20038 preprint EN arXiv (Cornell University) 2024-06-28

Rigour is crucial for scientific research as it ensures the reproducibility and validity of results findings. Despite its importance, little work exists on modelling rigour computationally, there a lack analysis whether these criteria can effectively signal or measure papers in practice. In this paper, we introduce bottom-up, data-driven framework to automatically identify define assess their relevance writing. Our includes keyword extraction, detailed definition generation, salient...

10.48550/arxiv.2410.04981 preprint EN arXiv (Cornell University) 2024-10-07

Topic modelling is a pivotal unsupervised machine learning technique for extracting valuable insights from large document collections. Existing neural topic methods often encode contextual information of documents, while ignoring details candidate centroid words, leading to the inaccurate selection words due contextualization gap. In parallel, it found that functional are frequently selected over topical words. To address these limitations, we introduce CAST: Corpus-Aware Self-similarity...

10.48550/arxiv.2410.15136 preprint EN arXiv (Cornell University) 2024-10-19

Multimodal learning, which involves integrating information from various modalities such as text, images, audio, and video, is pivotal for numerous complex tasks like visual question answering, cross-modal retrieval, caption generation. Traditional approaches rely on modality-specific encoders late fusion techniques, can hinder scalability flexibility when adapting to new or modalities. To address these limitations, we introduce a novel framework that extends the concept of task...

10.48550/arxiv.2411.10503 preprint EN arXiv (Cornell University) 2024-11-15

Audio classification plays a crucial role in speech and sound processing tasks with wide range of applications. There still remains challenge striking the right balance between fitting model to training data (avoiding overfitting) enabling it generalise well new domain. Leveraging transferability contrastive learning, we introduce Contrastive-based Fine-tuning (AudioConFit), an efficient approach characterised by robust generalisability. Empirical experiments on variety audio demonstrate...

10.48550/arxiv.2309.11895 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Taking inspiration from human children learning, we pose a question: can "baby language model" gradually internalize concept by exposing itself to the in unlimited, oftentimes irrelevant contexts, and what this means limited pretraining resource (both data-wise GPU-wise).Throughout study, restrict our experiments two data-limited settings, 10M 100M tokens, which are respectively 1/3000 1/300 were available training of RoBERTa.Our best performing recipe performs within 1.2% RoBERTa, on-par...

10.18653/v1/2023.conll-babylm.28 article EN cc-by 2023-01-01

Incorporating contrastive learning objectives in sentence representation (SRL) has yielded significant improvements on many sentence-level NLP tasks. However, it is not well understood why works for semantics. In this paper, we aim to help guide future designs of methods by taking a closer look at SRL through the lens isotropy, contextualization and dynamics. We interpret its successes geometry shifts show that brings drives high intra-sentence similarity: when same sentence, tokens converge...

10.48550/arxiv.2212.09170 preprint EN cc-by-nc-nd arXiv (Cornell University) 2022-01-01

In recent years, contrastive learning (CL) has been extensively utilized to recover sentence and document-level encoding capability from pre-trained language models. this work, we question the length generalizability of CL-based models, i.e., their vulnerability towards length-induced semantic shift. We verify not only that is a significant yet overlooked research gap, but can devise unsupervised CL methods solely depending on signal provided by document length. first derive theoretical...

10.48550/arxiv.2310.16193 preprint EN other-oa arXiv (Cornell University) 2023-01-01

In recent years, contrastive learning (CL) has been extensively utilized to recover sentence and document-level encoding capability from pre-trained language models. this work, we question the length generalizability of CL-based models, i.e., their vulnerability towards length-induced semantic shift. We verify not only that is a significant yet overlooked research gap, but can devise unsupervised CL methods solely depending on signal provided by document length. first derive theoretical...

10.18653/v1/2023.emnlp-main.86 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2023-01-01

Medical triage robots leverage natural language processing algorithms to provide accurate medical information and services, ultimately alleviating the strain on healthcare specialists.However, their effectiveness often hinges quality of topic assignment.This study proposes Knowledge-Constrained Labeled Latent Dirichlet Allocation (KC-LLDA) method, which incorporates domain-specific knowledge constraints with LDA.KC-LLDA was compared other existing similar extraction methods, demonstrated...

10.24846/v32i4y202304 article EN other-oa Studies in Informatics and Control 2023-12-18

With the identification of inequality encoded in information acquisition among social classes, we propose to leverage a powerful concept that has never been studied as linguistic construct, “fun”, deconstruct inequality. Inspired by theories sociology, draw connection between class and cocoon, through lens fun, hypothesize measurement “how fun one’s dominating cocoon is” be an indicator individual. Following this, NLP framework combat issue measuring how is, empower individuals emancipate...

10.18653/v1/2022.nlp4pi-1.12 article EN cc-by 2022-01-01
Coming Soon ...