- Information Retrieval and Search Behavior
- Topic Modeling
- Web Data Mining and Analysis
- Recommender Systems and Techniques
- Text and Document Classification Technologies
- Domain Adaptation and Few-Shot Learning
- Natural Language Processing Techniques
- Machine Learning and Algorithms
- Data Management and Algorithms
- Machine Learning and Data Classification
- Advanced Image and Video Retrieval Techniques
- Advanced Bandit Algorithms Research
- Expert finding and Q&A systems
- Explainable Artificial Intelligence (XAI)
- Image Retrieval and Classification Techniques
- Mobile Crowdsensing and Crowdsourcing
- Face and Expression Recognition
- Multimodal Machine Learning Applications
- Data Mining Algorithms and Applications
- Imbalanced Data Classification Techniques
- Personal Information Management and User Behavior
- Data Quality and Management
- Optimization and Search Problems
- Complex Network Analysis Techniques
- Neural Networks and Applications
The First Affiliated Hospital, Sun Yat-sen University
2024-2025
China University of Mining and Technology
2020-2025
China Coal Research Institute (China)
2025
China Coal Technology and Engineering Group Corp (China)
2025
Sun Yat-sen University
2024-2025
Google (United States)
2016-2024
University of Waterloo
2023-2024
University of Massachusetts Amherst
2023
Qingdao Agricultural University
2018-2022
Meta (United States)
2012-2013
Contextual bandit algorithms have become popular for online recommendation systems such as Digg, Yahoo! Buzz, and news in general. \emph{Offline} evaluation of the effectiveness new these applications is critical protecting user experiences but very challenging due to their "partial-label" nature. Common practice create a simulator which simulates environment problem at hand then run an algorithm against this simulator. However, creating itself often difficult modeling bias usually...
Click-through data has proven to be a critical resource for improving search ranking quality. Though large amount of click can easily collected by engines, various biases make it difficult fully leverage this type data. In the past, many models have been proposed and successfully used estimate relevance individual query-document pairs in context web search. These typically require quantity clicks each pair makes them apply systems where is highly sparse due personalized corpora information...
A well-known challenge in learning from click data is its inherent bias and most notably position bias. Traditional models aim to extract the ‹query, document› relevance estimated usually discarded after extracted. In contrast, recent work on unbiased learning-to-rank can effectively leverage thus focuses estimating rather than [20, 31]. Existing approaches use search result randomization over a small percentage of production traffic estimate This not desired because negatively impact users'...
Software errors are a major cause for system failures. To effectively design tools and support detecting recovering from software failures requires deep understanding of bug characteristics. Recently, its development process have significantly changed in many ways, including more help detection tools, shift towards multi-threading architecture, the open-source paradigm increasing concerns about security user-friendly interface. Therefore, results previous studies may not be applicable to...
Previous work on text mining has almost exclusively focused a single stream. However, we often have available multiple streams indexed by the same set of time points (called coordinated streams), which offer new opportunities for mining. For example, when major event happens, all news articles published different agencies in languages tend to cover certain period, exhibiting correlated bursty topic pattern article streams. In general, patterns from can reveal interesting latent associations...
Effective organization of search results is critical for improving the utility any engine. Clustering an effective way to organize results, which allows a user navigate into relevant documents quickly. However, two deficiencies this approach make it not always work well: (1) clusters discovered do necessarily correspond interesting aspects topic from user's perspective; and (2) cluster labels generated are informative enough allow identify right cluster. In paper, we propose address these by...
Dyadic data arises in many real world applications such as social network analysis and information retrieval. In order to discover the underlying or hidden structure dyadic data, topic modeling techniques were proposed. The typical algorithms include Probabilistic Latent Semantic Analysis (PLSA) Dirichlet Allocation (LDA). probability density functions obtained by both of these two are supported on Euclidean space. However, previous studies have shown naturally occurring may reside close an...
How to optimize ranking metrics such as Normalized Discounted Cumulative Gain (NDCG) is an important but challenging problem, because are either flat or discontinuous everywhere, which makes them hard be optimized directly. Among existing approaches, LambdaRank a novel algorithm that incorporates into its learning procedure. Though empirically effective, it still lacks theoretical justification. For example, the underlying loss optimizes for remains unknown until now. Due this, there no...
To solve the problem where by available on-site input data are too scarce to predict level of groundwater, this paper proposes an algorithm make prediction called canonical correlation forest with a combination random features. assess effectiveness proposed algorithm, groundwater levels and meteorological for Daguhe River source field, in Qingdao, China, were used. First, results comparison among three regressors showed that is superior terms forecasting variations level. Second, experiments...
While in a classification or regression setting label value is assigned to each individual document, ranking we determine the relevance ordering of entire input document list. This difference leads notion relative between documents ranking. The majority existing learning-to-rank algorithms model such relativity at loss level using pairwise listwise functions. However, they are restricted univariate scoring functions, i.e., score computed based on itself, regardless other To overcome this...
Language model information retrieval depends on accurate estimation of document models. In this paper, we propose a expansion technique to deal with the problem insufficient sampling documents. We construct probabilistic neighborhood for each document, and expand its information. The expanded provides more model, thus improves accuracy. Moreover, since pseudo feedback exploit different corpus structures, they can be combined further improve performance. experiment results several data sets...
Negative relevance feedback is a special case of where we do not have any positive example; this often happens when the topic difficult and search results are poor. Although in principle standard technique can be applied to negative feedback, it may perform well due lack examples. In paper, conduct systematic study methods for feedback. We compare set representative methods, covering vector-space models language models, as several heuristics Evaluating requires test with sufficient topics,...
Search engine logs are an emerging new type of data that offers interesting opportunities for mining. Existing work on mining such has mostly attempted to discover knowledge at the level queries (e.g., query clusters). In this paper, we propose mine search patterns terms through analyzing relations inside a query. We define two novel term association (i.e., context-sensitive substitutions and additions) methods from logs. These can be used address mis-specification under-specification...
With the explosive growth of online news readership, recommending interesting articles to users has become extremely important. While existing Web services such as Yahoo! and Digg attract users' initial clicks by leveraging various kinds signals, how engage algorithmically after their visit is largely under-explored. In this paper, we study problem post-click recommendation. Given that a user perused current article, our idea automatically identify "related" which would like read afterwards....
One of the challenges learning-to-rank for information retrieval is that ranking metrics are not smooth and as such cannot be optimized directly with gradient descent optimization methods. This gap has given rise to a large body research reformulates problem fit into existing machine learning frameworks or defines surrogate, ranking-appropriate loss function. ListNet's which measures cross entropy between distribution over documents obtained from scores another ground-truth labels. was...
This paper describes a machine learning algorithm for document (re)ranking, in which queries and documents are firstly encoded using BERT [1], on top of that learning-to-rank (LTR) model constructed with TF-Ranking (TFR) [2] is applied to further optimize the ranking performance. approach proved be effective public MS MARCO benchmark [3]. Our first two submissions achieve best performance passage re-ranking task [4], second full-ranking as April 10, 2020 [5]. To leverage lately development...
Presentation bias is one of the key challenges when learning from implicit feedback in search engines, as it confounds relevance signal. While was recently shown how counterfactual learning-to-rank (LTR) approaches \citeJoachims/etal/17a can provably overcome presentation observation propensities are known, remains to show effectively estimate these propensities. In this paper, we propose first method for producing consistent propensity estimates without manual judgments, disruptive...
Modern search engines leverage a variety of sources, beyond the conventional query-document content similarity, to improve their ranking performance. Among them, query context has attracted attention in prior work. Previously, was mainly modeled by user history, either long-term or short-term, help future queries. In this paper, we focus on situational context, i.e., contextual features current request that are independent from both and history. As an example, can depend time location. We...
Existing unbiased learning-to-rank models use counterfactual inference, notably Inverse Propensity Scoring (IPS), to learn a ranking function from biased click data. They handle the incompleteness bias, but usually assume that clicks are noise-free, i.e., clicked document is always assumed be relevant. In this paper, we relax unrealistic assumption and study noise explicitly in setting. Specifically, model as position-dependent trust bias propose noise-aware Position-Based Model, named...
Learning-to-Rank deals with maximizing the utility of a list examples presented to user, items higher relevance being prioritized. It has several practical applications such as large-scale search, recommender systems, document summarization and question answering. While there is widespread support for classification regression based learning, learning-to-rank in deep learning been limited. We introduce TensorFlow Ranking, first open source library solving ranking problems framework. highly...
Pretrained language models such as BERT have been shown to be exceptionally effective for text ranking. However, there are limited studies on how leverage more powerful sequence-to-sequence T5. Existing attempts usually formulate ranking a classification problem and rely postprocessing obtain ranked list. In this paper, we propose RankT5 study two T5-based model structures, an encoder-decoder encoder-only one, so that they not only can directly output scores each query-document pair, but...