Yunbo Cao

ORCID: 0009-0005-2558-5206
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Topic Modeling
  • Natural Language Processing Techniques
  • Text and Document Classification Technologies
  • Web Data Mining and Analysis
  • Speech and dialogue systems
  • Multimodal Machine Learning Applications
  • Advanced Text Analysis Techniques
  • Expert finding and Q&A systems
  • Data Quality and Management
  • Text Readability and Simplification
  • Sentiment Analysis and Opinion Mining
  • Information Retrieval and Search Behavior
  • Intelligent Tutoring Systems and Adaptive Learning
  • Semantic Web and Ontologies
  • Online Learning and Analytics
  • Domain Adaptation and Few-Shot Learning
  • Speech Recognition and Synthesis
  • Algorithms and Data Compression
  • Mobile Crowdsensing and Crowdsourcing
  • Spam and Phishing Detection
  • Second Language Acquisition and Learning
  • Human Pose and Action Recognition
  • Data Stream Mining Techniques
  • Video Analysis and Summarization
  • Machine Learning and Data Classification

Tencent (China)
2017-2023

Peking University
2023

Beihang University
2022

Chinese University of Hong Kong
2021

University of Electronic Science and Technology of China
2019

Shanghai Jiao Tong University
2008-2014

Microsoft Research Asia (China)
2002-2013

Agency for Science, Technology and Research
2011-2013

Microsoft (United States)
2003-2012

Dongbei University of Finance and Economics
2012

The paper is concerned with applying learning to rank document retrieval. Ranking SVM a typical method of rank. We point out that there are two factors one must consider when SVM, in general "learning rank" method, First, correctly ranking documents on the top result list crucial for an Information Retrieval system. One conduct training way such ranked results accurate. Second, number relevant can vary from query query. avoid model biased toward queries large documents. Previously, existing...

10.1145/1148170.1148205 article EN 2006-08-06

Chinese Spell Checking (CSC) aims to detect and correct erroneous characters for usergenerated text in language.Most of the spelling errors are misused semantically, phonetically or graphically similar characters.Previous attempts notice this phenomenon try utilize similarity relationship task.However, these methods use either heuristics handcrafted confusion sets predict character.In paper, we propose a spell checker called REALISE, by directly leveraging multimodal information...

10.18653/v1/2021.findings-acl.64 article EN cc-by 2021-01-01

In this paper, we uncover a systematic bias in the evaluation paradigm of adopting large language models~(LLMs), e.g., GPT-4, as referee to score and compare quality responses generated by candidate models. We find that ranking can be easily hacked simply altering their order appearance context. This manipulation allows us skew result, making one model appear considerably superior other, Vicuna-13B could beat ChatGPT on 66 over 80 tested queries with an evaluator. To address issue, propose...

10.48550/arxiv.2305.17926 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Community-based question answering (cQA) services have accumulated millions of questions and their answers over time. In the process accumulation, cQA assume that always unique best answers. However, with an in-depth analysis on services, we find assumption cannot be true. According to analysis, at least 78% are reusable when similar asked again, but no more than 48% them indeed We conduct by proposing taxonomies for To better reuse content, also propose applying automatic summarization...

10.3115/1599081.1599144 article EN 2008-01-01

Recent advances cast the entity-relation extraction to a multi-turn question answering (QA) task and provide an effective solution based on machine reading comprehension (MRC) models. However, they use single characterize meaning of entities relations, which is intuitively not enough because variety context semantics. Meanwhile, existing models enumerate all relation types generate questions, inefficient easily leads confusing questions. In this paper, we improve MRC-based model through...

10.24963/ijcai.2020/546 article EN 2020-07-01

Chinese Spell Checking (CSC) aims to detect and correct spelling errors, which are mainly caused by the phonological or visual similarity. Recently, pre-trained language models (PLMs) promote progress of CSC task. However, there exists a gap between learned knowledge PLMs goal focus on semantics in text tend erroneous characters semantically proper commonly used ones, but these aren’t ground-truth corrections. To address this issue, we propose an Error-driven COntrastive Probability...

10.18653/v1/2022.findings-acl.252 article EN cc-by Findings of the Association for Computational Linguistics: ACL 2022 2022-01-01

Hierarchical text classification (HTC) is a challenging subtask of multi-label due to its complex label hierarchy.Recently, the pretrained language models (PLM)have been widely adopted in HTC through fine-tuning paradigm. However, this paradigm, there exists huge gap between tasks with sophisticated hierarchy and masked model (MLM) pretraining PLMs thus potential cannot be fully tapped.To bridge gap, paper, we propose HPT, Hierarchy-aware Prompt Tuning method handle from MLM...

10.18653/v1/2022.emnlp-main.246 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2022-01-01

Exercise group recommendation plays an important role in many intelligent education tasks. However, existing approaches make recommendations based on the intrinsic features of exercises without considering students' learning abilities, or selections from several pre-built exercise groups at expense flexibility. Furthermore, although cognitive diagnosis have successfully revealed how to leverage results for is hardly explored. To flexibly recommend suitable students, this paper proposes...

10.1109/tetci.2022.3220812 article EN IEEE Transactions on Emerging Topics in Computational Intelligence 2023-01-16

We consider here the problem of Base Noun Phrase translation. propose a new method to perform task. For given NP, we first search its translation candidates from web. next determine possible translation(s) among using one two methods that have developed. In method, employ an ensemble Naïve Bayesian Classifiers constructed with EM Algorithm. other use TF-IDF vectors also Experimental results indicate coverage and accuracy our are significantly better than those baseline relying on existing...

10.3115/1072228.1072239 article EN Proceedings of the 17th international conference on Computational linguistics - 2002-01-01

This paper is concerned with the problem of mining competitors from Web automatically. Nowadays fierce competition in market necessitates every company not only to know which companies are its primary competitors, but also fields company's rivals compete itself and what competitors' strength a specific competitive domain. The task competitor that we address includes all information such as competing strength. A novel algorithm called CoMiner proposed, tries conduct Web-scale...

10.1109/tkde.2008.98 article EN IEEE Transactions on Knowledge and Data Engineering 2008-08-27

Wikification for tweets aims to automatically identify each concept mention in a tweet and link it referent knowledge base (e.g., Wikipedia).Due the shortness of tweet, collective inference model incorporating global evidence from multiple mentions concepts is more appropriate than noncollecitve approach which links at time.In addition, challenging generate sufficient high quality labeled data supervised models with low cost.To tackle these challenges, we propose novel semi-supervised graph...

10.3115/v1/p14-1036 article EN cc-by 2014-01-01

In a multi-turn knowledge-grounded dialog, the difference between knowledge selected at different turns usually provides potential clues to selection, which has been largely neglected in previous research. this paper, we propose difference-aware selection method. It first computes candidate sentences provided current turn and those chosen turns. Then, differential information is fused with or disentangled from contextual facilitate final selection. Automatic, human observational, interactive...

10.18653/v1/2020.findings-emnlp.11 article EN cc-by 2020-01-01

In multi-label text classification (MLTC), each given document is associated with a set of correlated labels.To capture label correlations, previous classifier-chain and sequenceto-sequence models transform MLTC to sequence prediction task.However, they tend suffer from order dependency, combination over-fitting error propagation problems.To address these problems, we introduce novel approach multi-task learning enhance correlation feedback.We first utilize joint embedding (JE) mechanism...

10.18653/v1/2021.findings-acl.101 article EN cc-by 2021-01-01

Math Word Problem (MWP) solving needs to discover the quantitative relationships over natural language narratives. Recent work shows that existing models memorize procedures from context and rely on shallow heuristics solve MWPs. In this paper, we look at issue argue cause is a lack of overall understanding MWP patterns. We first investigate how neural network understands patterns only semantics, observe that, if prototype equations are same, most problems get closer representations those...

10.18653/v1/2022.findings-acl.195 article EN cc-by Findings of the Association for Computational Linguistics: ACL 2022 2022-01-01

Peiyi Wang, Runxin Xu, Tianyu Liu, Qingyu Zhou, Yunbo Cao, Baobao Chang, Zhifang Sui. Proceedings of the 2022 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies. 2022.

10.18653/v1/2022.naacl-main.369 article EN cc-by Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2022-01-01

Addressed in this paper is the issue of 'email data cleaning' for text mining. Many mining applications need take emails as input. Email usually noisy and thus it necessary to clean before Several products offer email cleaning features, however, types noises that can be eliminated are restricted. Despite importance problem, has received little attention research community. A thorough systematic investigation on needed. In paper, formalized a problem non-text filtering normalization. way,...

10.1145/1081870.1081926 article EN 2005-08-21

The paper is concerned with the problem of question recommendation. Specifically, given a as query, we are to retrieve and rank other questions according their likelihood being good recommendations queried question. A recommendation provides alternative aspects around users' interest. We tackle in two steps: first represent graphs topic terms, then on basis graphs. formalize both steps tree-cutting problems employ MDL (Minimum Description Length) for selecting best cuts. Experiments have...

10.1145/1367497.1367509 article EN 2008-04-21

Yixuan Su, Deng Cai, Qingyu Zhou, Zibo Lin, Simon Baker, Yunbo Cao, Shuming Shi, Nigel Collier, Yan Wang. Proceedings of the 59th Annual Meeting Association for Computational Linguistics and 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021.

10.18653/v1/2021.acl-long.137 article EN cc-by 2021-01-01

Cognitive diagnosis is a fundamental issue of intelligent education platforms, whose goal to reveal the mastery students on knowledge concepts. Recently, certain efforts have been made improve precision, by designing deep neural networks-based diagnostic functions or incorporating more rich context features enhance representation and exercises. However, how interpretably infer student's over non-interactive concepts (i.e., not related his/her exercising records) still remains challenging,...

10.1145/3511808.3557372 article EN Proceedings of the 31st ACM International Conference on Information & Knowledge Management 2022-10-16

This paper is concerned with automatic extraction of titles from the bodies HTML documents. Titles documents should be correctly defined in title fields; however, reality are often bogus. It desirable to conduct an issue which does not seem have been investigated previously. In this paper, we take a supervised machine learning approach address problem. We propose specification on titles. utilize format information such as font size, position, and weight features extraction. Our method...

10.1145/1076034.1076079 article EN 2005-08-15

Pre-trained Transformer-based neural language models, such as BERT, have achieved remarkable results on varieties of NLP tasks.Recent works shown that attention-based models can benefit from more focused attention over local regions.Most them restrict the scope within a linear span, or confine to certain tasks machine translation and question answering.In this paper, we propose syntax-aware attention, where scopes are restrained based distances in syntactic structure.The proposed be...

10.18653/v1/2021.findings-acl.57 article EN cc-by 2021-01-01
Coming Soon ...