Michael Bendersky

ORCID: 0000-0002-2941-6240
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Topic Modeling
  • Information Retrieval and Search Behavior
  • Natural Language Processing Techniques
  • Domain Adaptation and Few-Shot Learning
  • Web Data Mining and Analysis
  • Text and Document Classification Technologies
  • Multimodal Machine Learning Applications
  • Advanced Text Analysis Techniques
  • Recommender Systems and Techniques
  • Data Quality and Management
  • Machine Learning and Algorithms
  • Advanced Image and Video Retrieval Techniques
  • Image Retrieval and Classification Techniques
  • Personal Information Management and User Behavior
  • Machine Learning and Data Classification
  • Sentiment Analysis and Opinion Mining
  • Explainable Artificial Intelligence (XAI)
  • Semantic Web and Ontologies
  • Optimization and Search Problems
  • Data Management and Algorithms
  • Mobile Crowdsensing and Crowdsourcing
  • Expert finding and Q&A systems
  • Advanced Bandit Algorithms Research
  • Speech and dialogue systems
  • Video Analysis and Summarization

Google (United States)
2015-2024

University of Waterloo
2023-2024

University of Michigan–Ann Arbor
2024

Holon Institute of Technology
2015-2021

University of Massachusetts Amherst
2008-2013

Click-through data has proven to be a critical resource for improving search ranking quality. Though large amount of click can easily collected by engines, various biases make it difficult fully leverage this type data. In the past, many models have been proposed and successfully used estimate relevance individual query-document pairs in context web search. These typically require quantity clicks each pair makes them apply systems where is highly sparse due personalized corpora information...

10.1145/2911451.2911537 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2016-07-07

A well-known challenge in learning from click data is its inherent bias and most notably position bias. Traditional models aim to extract the ‹query, document› relevance estimated usually discarded after extracted. In contrast, recent work on unbiased learning-to-rank can effectively leverage thus focuses estimating rather than [20, 31]. Existing approaches use search result randomization over a small percentage of production traffic estimate This not desired because negatively impact users'...

10.1145/3159652.3159732 article EN 2018-02-02

The milestone improvements brought about by deep representation learning and pre-training techniques have led to large performance gains across downstream NLP, IR Vision tasks. Multimodal modeling aim leverage high-quality visio-linguistic datasets for complementary information image text modalities. In this paper, we introduce the Wikipedia-based Image Text (WIT) Dataset better facilitate multimodal, multilingual learning. WIT is composed of a curated set 37.5 million entity rich image-text...

10.1145/3404835.3463257 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2021-07-11

Current search engines do not, in general, perform well with longer, more verbose queries. One of the main issues processing these queries is identifying key concepts that will have most impact on effectiveness. In this paper, we develop and evaluate a technique uses query-dependent, corpus-dependent, corpus-independent features for automatic extraction from We show our method achieves higher accuracy identification than standard weighting methods such as inverse document frequency. Finally,...

10.1145/1390334.1390419 article EN 2008-07-20

Modeling query concepts through term dependencies has been shown to have a significant positive effect on retrieval performance, especially for tasks such as web search, where relevance at high ranks is particularly critical. Most previous work, however, treats all equally important, an assumption that often does not hold, longer, more complex queries. In this paper, we show one of the most effective existing dependence models can be naturally extended by assigning weights concepts. We...

10.1145/1718487.1718492 article EN 2010-02-04

Many existing retrieval approaches do not take into account the content quality of retrieved documents, although link-based measures such as PageRank are commonly used a form document prior. In this paper, we present quality-biased ranking method that promotes documents containing high-quality content, and penalizes low-quality documents. The can be determined by its readability, layout ease-of-navigation, among other factors. Accordingly, instead using single estimate for quality, consider...

10.1145/1935826.1935849 article EN 2011-02-01

How to optimize ranking metrics such as Normalized Discounted Cumulative Gain (NDCG) is an important but challenging problem, because are either flat or discontinuous everywhere, which makes them hard be optimized directly. Among existing approaches, LambdaRank a novel algorithm that incorporates into its learning procedure. Though empirically effective, it still lacks theoretical justification. For example, the underlying loss optimizes for remains unknown until now. Due this, there no...

10.1145/3269206.3271784 article EN 2018-10-17

Semantic text matching is one of the most important research problems in many domains, including, but not limited to, information retrieval, question answering, and recommendation. Among different types semantic matching, long-document-to-long-document has applications, rarely been studied. Most existing approaches for have success this setting, due to their inability capture distill main ideas topics from long-form text.

10.1145/3308558.3313707 article EN 2019-05-13

While in a classification or regression setting label value is assigned to each individual document, ranking we determine the relevance ordering of entire input document list. This difference leads notion relative between documents ranking. The majority existing learning-to-rank algorithms model such relativity at loss level using pairwise listwise functions. However, they are restricted univariate scoring functions, i.e., score computed based on itself, regardless other To overcome this...

10.1145/3341981.3344218 preprint EN 2019-09-26

The majority of the current information retrieval models weight query concepts (e.g., terms or phrases) in an unsupervised manner, based solely on collection statistics. In this paper, we go beyond estimation concept weights, and propose a parameterized weighting model. our model, each is determined using combination diverse importance features. Unlike existing supervised ranking methods, model learns weights not only for explicit concepts, but also latent that are associated with through...

10.1145/2009916.2009998 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2011-07-24

We propose to use the search log study long queries, in order understand types of information needs that are behind them, and design techniques improve effectiveness when they used. Long queries arise many different applications, such as CQA (community-based question answering) literature search, have been studied some extent using TREC data. They also, however, quite common web can be seen by looking at distribution query lengths a large scale log.

10.1145/1507509.1507511 article EN 2009-02-09

Most standard information retrieval models use a single source of (e.g., the corpus) for query formulation tasks such as term and phrase weighting expansion. In contrast, in this paper, we present unified framework that automatically optimizes combination sources used effective formulation. The proposed produces fully weighted expanded queries are both more compact than those produced by current state-of-the-art expansion methods. We conduct an empirical evaluation our newswire web corpora....

10.1145/2124295.2124349 article EN 2012-02-08

One of the challenges learning-to-rank for information retrieval is that ranking metrics are not smooth and as such cannot be optimized directly with gradient descent optimization methods. This gap has given rise to a large body research reformulates problem fit into existing machine learning frameworks or defines surrogate, ranking-appropriate loss function. ListNet's which measures cross entropy between distribution over documents obtained from scores another ground-truth labels. was...

10.1145/3341981.3344221 article EN 2019-09-26

Recently, the focus of many novel search applications shifted from short keyword queries to verbose natural language queries. Examples include question answering systems and dialogue systems, voice on mobile devices entity engines like Facebook's Graph Search or Google's Knowledge Graph. However performance textbook information retrieval techniques for such is not as good that their shorter counterparts. Thus, effective handling has become a critical factor adoption in this new breed...

10.1561/1500000050 article EN Foundations and Trends® in Information Retrieval 2015-01-01

Learning-to-Rank is a branch of supervised machine learning that seeks to produce an ordering list items such the utility ranked maximized. Unlike most techniques, however, objective cannot be directly optimized using gradient descent methods as it either discontinuous or flat everywhere. As such, learning-to-rank often optimize loss function loosely related upper-bounds ranking instead. A notable exception approximation framework originally proposed by Qin et al. facilitates more direct...

10.1145/3331184.3331347 article EN 2019-07-18

Modern search engines leverage a variety of sources, beyond the conventional query-document content similarity, to improve their ranking performance. Among them, query context has attracted attention in prior work. Previously, was mainly modeled by user history, either long-term or short-term, help future queries. In this paper, we focus on situational context, i.e., contextual features current request that are independent from both and history. As an example, can depend time location. We...

10.1145/3038912.3052648 article EN 2017-04-03

Existing unbiased learning-to-rank models use counterfactual inference, notably Inverse Propensity Scoring (IPS), to learn a ranking function from biased click data. They handle the incompleteness bias, but usually assume that clicks are noise-free, i.e., clicked document is always assumed be relevant. In this paper, we relax unrealistic assumption and study noise explicitly in setting. Specifically, model as position-dependent trust bias propose noise-aware Position-Based Model, named...

10.1145/3308558.3313697 article EN 2019-05-13

Learning-to-Rank deals with maximizing the utility of a list examples presented to user, items higher relevance being prioritized. It has several practical applications such as large-scale search, recommender systems, document summarization and question answering. While there is widespread support for classification regression based learning, learning-to-rank in deep learning been limited. We introduce TensorFlow Ranking, first open source library solving ranking problems framework. highly...

10.1145/3292500.3330677 preprint EN 2019-07-25

Learning to Rank, a central problem in information retrieval, is class of machine learning algorithms that formulate ranking as an optimization task. The objective learn function produces ordering set documents such way the utility entire ordered list maximized. Learning-to-rank methods do so by computes score for each document set. A ranked then compiled sorting according their scores. While deterministic mapping scores permutations makes sense during inference where stability lists...

10.1145/3336191.3371844 article EN 2020-01-20

Market sentiment analysis on social media content requires knowledge of both financial markets and jargon, which makes it a challenging task for human raters. The resulting lack high-quality labeled data stands in the way conventional supervised learning methods. Instead, we approach this problem using semi-supervised with large language model (LLM). Our pipeline generates weak labels Reddit posts an LLM then uses that to train small can be served production. We find prompting produce...

10.1145/3543873.3587324 article EN 2023-04-28

Pretrained language models such as BERT have been shown to be exceptionally effective for text ranking. However, there are limited studies on how leverage more powerful sequence-to-sequence T5. Existing attempts usually formulate ranking a classification problem and rely postprocessing obtain ranked list. In this paper, we propose RankT5 study two T5-based model structures, an encoder-decoder encoder-only one, so that they not only can directly output scores each query-document pair, but...

10.1145/3539618.3592047 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2023-07-18

Market sentiment analysis on social media content requires knowledge of both financial markets and jargon, which makes it a challenging task for human raters. The resulting lack high-quality labeled data stands in the way conventional supervised learning methods. In this work, we conduct case study approaching problem with semi-supervised using large language model (LLM). We select Reddit as target platform due to its broad coverage topics types. Our pipeline first generates weak labels...

10.1145/3543873.3587605 article EN 2023-04-28

10.1145/3626772.3657923 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2024-07-10
Coming Soon ...