Corby Rosset

ORCID: 0000-0001-9167-6214
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Topic Modeling
  • Natural Language Processing Techniques
  • Information Retrieval and Search Behavior
  • Speech and dialogue systems
  • Advanced Graph Neural Networks
  • Optimization and Search Problems
  • Web Data Mining and Analysis
  • Domain Adaptation and Few-Shot Learning
  • Image Retrieval and Classification Techniques
  • Semantic Web and Ontologies
  • Multi-Agent Systems and Negotiation
  • Explainable Artificial Intelligence (XAI)
  • Expert finding and Q&A systems
  • Advanced Data Compression Techniques
  • Advanced Image and Video Retrieval Techniques
  • Intelligent Tutoring Systems and Adaptive Learning
  • Spam and Phishing Detection
  • Multimodal Machine Learning Applications
  • Recommender Systems and Techniques
  • Speech Recognition and Synthesis
  • Advanced Bandit Algorithms Research
  • Digital Rights Management and Security
  • Cognitive and developmental aspects of mathematical skills
  • Advanced Wireless Communication Techniques
  • Scheduling and Timetabling Solutions

Microsoft (United States)
2018-2024

Microsoft Research (United Kingdom)
2019-2023

Bellevue Hospital Center
2018-2019

We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such Mixtral 8x7B GPT-3.5 (e.g., phi-3-mini achieves 69% MMLU 8.38 MT-bench), despite being small enough to be deployed phone. The innovation lies entirely in our dataset for training, scaled-up version the one used phi-2, composed heavily filtered web data synthetic data. is also further...

10.48550/arxiv.2404.14219 preprint EN arXiv (Cornell University) 2024-04-22

Orca 1 learns from rich signals, such as explanation traces, allowing it to outperform conventional instruction-tuned models on benchmarks like BigBench Hard and AGIEval. In 2, we continue exploring how improved training signals can enhance smaller LMs' reasoning abilities. Research small LMs has often relied imitation learning replicate the output of more capable models. We contend that excessive emphasis may restrict potential seek teach employ different solution strategies for tasks,...

10.48550/arxiv.2311.11045 preprint EN cc-by arXiv (Cornell University) 2023-01-01

How much knowledge do pretrained language models hold? Recent research observed that transformers are adept at modeling semantics but it is unclear to what degree they grasp human knowledge, or how ensure so. In this paper we incorporate knowledge-awareness in model pretraining without changing the transformer architecture, inserting explicit layers, adding external storage of semantic information. Rather, simply signal existence entities input pretraining, with an entity-extended tokenizer;...

10.48550/arxiv.2007.00655 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Recent success in large multimodal models (LMMs) has sparked promising applications of agents capable autonomously completing complex web tasks. While open-source LMM have made significant advances offline evaluation benchmarks, their performance still falls substantially short human-level capabilities more realistic online settings. A key bottleneck is the lack diverse and large-scale trajectory-level datasets across various domains, which are expensive to collect. In this paper, we address...

10.48550/arxiv.2502.11357 preprint EN arXiv (Cornell University) 2025-02-16

A long-standing challenge for search and conversational assistants is query intention detection in ambiguous queries. Asking clarifying questions has been widely studied considered an effective solution to resolve ambiguity. Existing work have explored various approaches question ranking generation. However, due the lack of real data, they use artificial datasets training, which limits their generalizability real-world scenarios. As a result, industry shown reluctance implement them reality,...

10.1145/3543507.3583420 article EN cc-by Proceedings of the ACM Web Conference 2022 2023-04-26

This paper presents GEneric iNtent Encoder (GEN Encoder) which learns a distributed representation space for user intent in search. Leveraging large scale clicks from Bing search logs as weak supervision of intent, GEN to map queries with shared into similar embeddings end-to-end and then fine-tunes on multiple paraphrase tasks. Experimental results an intrinsic evaluation task - query similarity modeling demonstrate Encoder's robust significant advantages over previous methods. Ablation...

10.1145/3331184.3331198 preprint EN 2019-07-18

In web search, typically a candidate generation step selects small set of documents---from collections containing as many billions pages---that are subsequently ranked and pruned before being presented to the user. Bing, involves scanning index using statically designed match plans that prescribe sequences different criteria stopping conditions. this work, we pose planning reinforcement learning task observe up 20% reduction in blocks accessed, with or no degradation quality sets.

10.1145/3209978.3210127 article EN 2018-06-27

Mathematical word problem-solving has long been recognized as a complex task for small language models (SLMs). A recent study hypothesized that the smallest model size, needed to achieve over 80% accuracy on GSM8K benchmark, is 34 billion parameters. To reach this level of performance with smaller models, researcher often train SLMs generate Python code or use tools help avoid calculation errors. Additionally, they employ ensembling, where outputs up 100 runs are combined arrive at more...

10.48550/arxiv.2402.14830 preprint EN arXiv (Cornell University) 2024-02-16

Axiomatic information retrieval (IR) seeks a set of principle properties desirable in IR models. These when formally expressed provide guidance the search for better relevance estimation functions. Neural ranking models typically contain many learnable parameters. The training these involves appropriate parameter values based on large quantities labeled examples. Intuitively, axioms that can guide traditional should also help machine learning rankers. This work explores use to augment direct...

10.1145/3331184.3331296 article EN 2019-07-18

In this work, we focus on the contextual document ranking task, which deals with challenge of user interaction modeling for conversational search. Given a history feedback behaviors, such as issuing query, clicking document, and skipping propose to introduce behavior awareness neural ranker, resulting in Hierarchical Behavior Aware Transformers (HBA-Transformers) model. The hierarchy is composed an intra-behavior attention layer inter-behavior let system effectively distinguish model...

10.1145/3397271.3401276 article EN 2020-07-25

Recent breakthroughs in large models have highlighted the critical significance of data scale, labels and modals. In this paper, we introduce MS MARCO Web Search, first large-scale information-rich web dataset, featuring millions real clicked query-document labels. This dataset closely mimics real-world document query distribution, provides rich information for various kinds downstream tasks encourages research areas, such as generic end-to-end neural indexer models, embedding next...

10.1145/3589335.3648327 preprint EN other-oa 2024-05-12

Synthetic data is becoming increasingly important for accelerating the development of language models, both large and small. Despite several successful use cases, researchers also raised concerns around model collapse drawbacks imitating other models. This discrepancy can be attributed to fact that synthetic varies in quality diversity. Effective usually requires significant human effort curating data. We focus on using post-training, specifically creating by powerful models teach a new...

10.48550/arxiv.2407.03502 preprint EN arXiv (Cornell University) 2024-07-03

Existing question answering (QA) datasets are no longer challenging to most powerful Large Language Models (LLMs). Traditional QA benchmarks like TriviaQA, NaturalQuestions, ELI5 and HotpotQA mainly study ``known unknowns'' with clear indications of both what information is missing, how find it answer the question. Hence, good performance on these provides a false sense security. A yet unmet need NLP community bank non-factoid, multi-perspective questions involving great deal unclear needs,...

10.48550/arxiv.2402.17896 preprint EN arXiv (Cornell University) 2024-02-27

This paper introduces a framework for the automated evaluation of natural language texts. A manually constructed rubric describes how to assess multiple dimensions interest. To evaluate text, large model (LLM) is prompted with each question and produces distribution over potential responses. The LLM predictions often fail agree well human judges -- indeed, humans do not fully one another. However, distributions can be $\textit{combined}$ $\textit{predict}$ judge's annotations on all...

10.18653/v1/2024.acl-long.745 preprint EN 2024-01-01

In this paper we improve the zero-shot generalization ability of language models via Mixture-Of-Memory Augmentation (MoMA), a mechanism that retrieves augmentation documents from multiple information corpora ("external memories"), with option to "plug in" new memory at inference time. We develop joint learning trains component latent labels derived end retrieval task, paired hard negatives mixture. instantiate model in dense setting by augmenting strong T5-based retriever MoMA. Our model,...

10.48550/arxiv.2302.03754 preprint EN cc-by arXiv (Cornell University) 2023-01-01

This paper studies post-training large language models (LLMs) using preference feedback from a powerful oracle to help model iteratively improve over itself. The typical approach for LLMs involves Reinforcement Learning Human Feedback (RLHF), which traditionally separates reward learning and subsequent policy optimization. However, such maximization is limited by the nature of "point-wise" rewards (such as Bradley-Terry model), fails express complex intransitive or cyclic relations. While...

10.48550/arxiv.2404.03715 preprint EN arXiv (Cornell University) 2024-04-04

Reinforcement learning from human feedback (RLHF) has emerged as a central tool for language model alignment. We consider online exploration in RLHF, which exploits interactive access to or AI by deliberately encouraging the produce diverse, maximally informative responses. By allowing RLHF confidently stray pre-trained model, offers possibility of novel, potentially super-human capabilities, but its full potential paradigm training yet be realized, owing computational and statistical...

10.48550/arxiv.2405.21046 preprint EN arXiv (Cornell University) 2024-05-31

This is the first year of TREC Product search track. The focus this was creation a reusable collection and evaluation impact use metadata multi-modal data on retrieval accuracy. we leverage new product corpus, which includes contextual metadata. Our analysis shows that in domain, traditional systems are highly effective commonly outperform general-purpose pretrained embedding models. also evaluates using simplified metadata-enhanced collections, finding no clear trend expanded collection. We...

10.48550/arxiv.2311.07861 preprint EN public-domain arXiv (Cornell University) 2023-01-01

In this paper we improve the zero-shot generalization ability of language models via Mixture-Of-Memory Augmentation (MoMA), a mechanism that retrieves augmentation documents from multiple information corpora (external memories), with option to “plug in” unseen memory at inference time. We develop joint learning trains component latent labels derived end retrieval task, paired hard negatives mixture. instantiate model in dense setting by augmenting strong T5-based retrievers MoMA. With only...

10.18653/v1/2023.emnlp-main.111 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2023-01-01
Coming Soon ...