NFDI4DS | UHH-SEMS - Publication Details

Anytime-valid off-policy Inference for Contextual Bandits

FOS: Computer and information sciences Computer Science - Machine Learning Mathematics - Statistics Theory Machine Learning (stat.ML) Statistics Theory (math.ST) 01 natural sciences Machine Learning (cs.LG) Methodology (stat.ME) Statistics - Machine Learning FOS: Mathematics 0101 mathematics Statistics - Methodology

DOI: 10.1145/3643693 Publication Date: 2024-01-31T12:03:41Z

Abstract Supplemental Material References Cited by

AUTHORS (5)

Ian Waudby-Smith

Lili Wu

Aaditya Ramdas

Nikos Karampatziakis

Paul Mineiro

ABSTRACT

Contextual bandit algorithms are ubiquitous tools for active sequential experimentation in healthcare and the tech industry. They involve online learning that adaptively learn policies over time to map observed contexts X t actions A an attempt maximize stochastic rewards R . This adaptivity raises interesting but hard statistical inference questions, especially counterfactual ones: example, it is often of interest estimate properties a hypothetical policy different from logging was used collect data—a problem known as “off-policy evaluation” (OPE). Using modern martingale techniques, we present comprehensive framework OPE relaxes unnecessary conditions made some past works (such performing at prespecified sample sizes, uniformly bounded importance weights, constant policies, values, among others), significantly improving on them both theoretically empirically. Importantly, our methods can be employed while original experiment still running (that is, not necessarily post hoc), when may itself changing (due learning), even if context distributions highly dependent series they drifting time). More concretely, derive confidence sequences various functionals OPE. These include doubly robust ones time-varying off-policy mean reward also bands entire cumulative distribution function distribution. All (a) valid arbitrary stopping times; (b) only make nonparametric assumptions; (c) do require weights bounded, are, need know these bounds; (d) adapt empirical variance estimators. In summary, enable anytime-valid using collected contextual data.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES (65)

CITATIONS (0)

EXTERNAL LINKS

OPENAIRE - Products OPENALEX - Publications CROSSREF - Publications

PlumX Metrics

Anytime-valid off-policy Inference for Contextual Bandits

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....