Xuanhui Wang

ORCID: 0009-0000-1388-1423
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Information Retrieval and Search Behavior
  • Topic Modeling
  • Web Data Mining and Analysis
  • Recommender Systems and Techniques
  • Text and Document Classification Technologies
  • Domain Adaptation and Few-Shot Learning
  • Natural Language Processing Techniques
  • Machine Learning and Algorithms
  • Data Management and Algorithms
  • Machine Learning and Data Classification
  • Advanced Image and Video Retrieval Techniques
  • Advanced Bandit Algorithms Research
  • Expert finding and Q&A systems
  • Explainable Artificial Intelligence (XAI)
  • Image Retrieval and Classification Techniques
  • Mobile Crowdsensing and Crowdsourcing
  • Face and Expression Recognition
  • Multimodal Machine Learning Applications
  • Data Mining Algorithms and Applications
  • Imbalanced Data Classification Techniques
  • Personal Information Management and User Behavior
  • Data Quality and Management
  • Optimization and Search Problems
  • Complex Network Analysis Techniques
  • Neural Networks and Applications

The First Affiliated Hospital, Sun Yat-sen University
2024-2025

China University of Mining and Technology
2020-2025

China Coal Research Institute (China)
2025

China Coal Technology and Engineering Group Corp (China)
2025

Sun Yat-sen University
2024-2025

Google (United States)
2016-2024

University of Waterloo
2023-2024

University of Massachusetts Amherst
2023

Qingdao Agricultural University
2018-2022

Meta (United States)
2012-2013

Contextual bandit algorithms have become popular for online recommendation systems such as Digg, Yahoo! Buzz, and news in general. \emph{Offline} evaluation of the effectiveness new these applications is critical protecting user experiences but very challenging due to their "partial-label" nature. Common practice create a simulator which simulates environment problem at hand then run an algorithm against this simulator. However, creating itself often difficult modeling bias usually...

10.1145/1935826.1935878 preprint EN 2011-02-01

Click-through data has proven to be a critical resource for improving search ranking quality. Though large amount of click can easily collected by engines, various biases make it difficult fully leverage this type data. In the past, many models have been proposed and successfully used estimate relevance individual query-document pairs in context web search. These typically require quantity clicks each pair makes them apply systems where is highly sparse due personalized corpora information...

10.1145/2911451.2911537 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2016-07-07

A well-known challenge in learning from click data is its inherent bias and most notably position bias. Traditional models aim to extract the ‹query, document› relevance estimated usually discarded after extracted. In contrast, recent work on unbiased learning-to-rank can effectively leverage thus focuses estimating rather than [20, 31]. Existing approaches use search result randomization over a small percentage of production traffic estimate This not desired because negatively impact users'...

10.1145/3159652.3159732 article EN 2018-02-02

Software errors are a major cause for system failures. To effectively design tools and support detecting recovering from software failures requires deep understanding of bug characteristics. Recently, its development process have significantly changed in many ways, including more help detection tools, shift towards multi-threading architecture, the open-source paradigm increasing concerns about security user-friendly interface. Therefore, results previous studies may not be applicable to...

10.1145/1181309.1181314 article EN 2006-10-21

Previous work on text mining has almost exclusively focused a single stream. However, we often have available multiple streams indexed by the same set of time points (called coordinated streams), which offer new opportunities for mining. For example, when major event happens, all news articles published different agencies in languages tend to cover certain period, exhibiting correlated bursty topic pattern article streams. In general, patterns from can reveal interesting latent associations...

10.1145/1281192.1281276 article EN 2007-08-12

Effective organization of search results is critical for improving the utility any engine. Clustering an effective way to organize results, which allows a user navigate into relevant documents quickly. However, two deficiencies this approach make it not always work well: (1) clusters discovered do necessarily correspond interesting aspects topic from user's perspective; and (2) cluster labels generated are informative enough allow identify right cluster. In paper, we propose address these by...

10.1145/1277741.1277759 article EN 2007-07-23

Dyadic data arises in many real world applications such as social network analysis and information retrieval. In order to discover the underlying or hidden structure dyadic data, topic modeling techniques were proposed. The typical algorithms include Probabilistic Latent Semantic Analysis (PLSA) Dirichlet Allocation (LDA). probability density functions obtained by both of these two are supported on Euclidean space. However, previous studies have shown naturally occurring may reside close an...

10.1145/1553374.1553388 article EN 2009-06-14

How to optimize ranking metrics such as Normalized Discounted Cumulative Gain (NDCG) is an important but challenging problem, because are either flat or discontinuous everywhere, which makes them hard be optimized directly. Among existing approaches, LambdaRank a novel algorithm that incorporates into its learning procedure. Though empirically effective, it still lacks theoretical justification. For example, the underlying loss optimizes for remains unknown until now. Due this, there no...

10.1145/3269206.3271784 article EN 2018-10-17

To solve the problem where by available on-site input data are too scarce to predict level of groundwater, this paper proposes an algorithm make prediction called canonical correlation forest with a combination random features. assess effectiveness proposed algorithm, groundwater levels and meteorological for Daguhe River source field, in Qingdao, China, were used. First, results comparison among three regressors showed that is superior terms forecasting variations level. Second, experiments...

10.1007/s13201-018-0742-6 article EN cc-by Applied Water Science 2018-07-24

While in a classification or regression setting label value is assigned to each individual document, ranking we determine the relevance ordering of entire input document list. This difference leads notion relative between documents ranking. The majority existing learning-to-rank algorithms model such relativity at loss level using pairwise listwise functions. However, they are restricted univariate scoring functions, i.e., score computed based on itself, regardless other To overcome this...

10.1145/3341981.3344218 preprint EN 2019-09-26

Language model information retrieval depends on accurate estimation of document models. In this paper, we propose a expansion technique to deal with the problem insufficient sampling documents. We construct probabilistic neighborhood for each document, and expand its information. The expanded provides more model, thus improves accuracy. Moreover, since pseudo feedback exploit different corpus structures, they can be combined further improve performance. experiment results several data sets...

10.3115/1220835.1220887 article EN 2006-01-01

Negative relevance feedback is a special case of where we do not have any positive example; this often happens when the topic difficult and search results are poor. Although in principle standard technique can be applied to negative feedback, it may perform well due lack examples. In paper, conduct systematic study methods for feedback. We compare set representative methods, covering vector-space models language models, as several heuristics Evaluating requires test with sufficient topics,...

10.1145/1390334.1390374 article EN 2008-07-20

Search engine logs are an emerging new type of data that offers interesting opportunities for mining. Existing work on mining such has mostly attempted to discover knowledge at the level queries (e.g., query clusters). In this paper, we propose mine search patterns terms through analyzing relations inside a query. We define two novel term association (i.e., context-sensitive substitutions and additions) methods from logs. These can be used address mis-specification under-specification...

10.1145/1458082.1458147 article EN 2008-10-26

With the explosive growth of online news readership, recommending interesting articles to users has become extremely important. While existing Web services such as Yahoo! and Digg attract users' initial clicks by leveraging various kinds signals, how engage algorithmically after their visit is largely under-explored. In this paper, we study problem post-click recommendation. Given that a user perused current article, our idea automatically identify "related" which would like read afterwards....

10.1145/1963405.1963417 article EN 2011-03-28

One of the challenges learning-to-rank for information retrieval is that ranking metrics are not smooth and as such cannot be optimized directly with gradient descent optimization methods. This gap has given rise to a large body research reformulates problem fit into existing machine learning frameworks or defines surrogate, ranking-appropriate loss function. ListNet's which measures cross entropy between distribution over documents obtained from scores another ground-truth labels. was...

10.1145/3341981.3344221 article EN 2019-09-26

This paper describes a machine learning algorithm for document (re)ranking, in which queries and documents are firstly encoded using BERT [1], on top of that learning-to-rank (LTR) model constructed with TF-Ranking (TFR) [2] is applied to further optimize the ranking performance. approach proved be effective public MS MARCO benchmark [3]. Our first two submissions achieve best performance passage re-ranking task [4], second full-ranking as April 10, 2020 [5]. To leverage lately development...

10.48550/arxiv.2004.08476 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Presentation bias is one of the key challenges when learning from implicit feedback in search engines, as it confounds relevance signal. While was recently shown how counterfactual learning-to-rank (LTR) approaches \citeJoachims/etal/17a can provably overcome presentation observation propensities are known, remains to show effectively estimate these propensities. In this paper, we propose first method for producing consistent propensity estimates without manual judgments, disruptive...

10.1145/3289600.3291017 preprint EN 2019-01-30

Modern search engines leverage a variety of sources, beyond the conventional query-document content similarity, to improve their ranking performance. Among them, query context has attracted attention in prior work. Previously, was mainly modeled by user history, either long-term or short-term, help future queries. In this paper, we focus on situational context, i.e., contextual features current request that are independent from both and history. As an example, can depend time location. We...

10.1145/3038912.3052648 article EN 2017-04-03

Existing unbiased learning-to-rank models use counterfactual inference, notably Inverse Propensity Scoring (IPS), to learn a ranking function from biased click data. They handle the incompleteness bias, but usually assume that clicks are noise-free, i.e., clicked document is always assumed be relevant. In this paper, we relax unrealistic assumption and study noise explicitly in setting. Specifically, model as position-dependent trust bias propose noise-aware Position-Based Model, named...

10.1145/3308558.3313697 article EN 2019-05-13

Learning-to-Rank deals with maximizing the utility of a list examples presented to user, items higher relevance being prioritized. It has several practical applications such as large-scale search, recommender systems, document summarization and question answering. While there is widespread support for classification regression based learning, learning-to-rank in deep learning been limited. We introduce TensorFlow Ranking, first open source library solving ranking problems framework. highly...

10.1145/3292500.3330677 preprint EN 2019-07-25

Pretrained language models such as BERT have been shown to be exceptionally effective for text ranking. However, there are limited studies on how leverage more powerful sequence-to-sequence T5. Existing attempts usually formulate ranking a classification problem and rely postprocessing obtain ranked list. In this paper, we propose RankT5 study two T5-based model structures, an encoder-decoder encoder-only one, so that they not only can directly output scores each query-document pair, but...

10.1145/3539618.3592047 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2023-07-18
Coming Soon ...