Anytime-valid off-policy Inference for Contextual Bandits

FOS: Computer and information sciences Computer Science - Machine Learning Mathematics - Statistics Theory Machine Learning (stat.ML) Statistics Theory (math.ST) 01 natural sciences Machine Learning (cs.LG) Methodology (stat.ME) Statistics - Machine Learning FOS: Mathematics 0101 mathematics Statistics - Methodology
DOI: 10.1145/3643693 Publication Date: 2024-01-31T12:03:41Z
ABSTRACT
Contextual bandit algorithms are ubiquitous tools for active sequential experimentation in healthcare and the tech industry. They involve online learning that adaptively learn policies over time to map observed contexts X t actions A an attempt maximize stochastic rewards R . This adaptivity raises interesting but hard statistical inference questions, especially counterfactual ones: example, it is often of interest estimate properties a hypothetical policy different from logging was used collect data—a problem known as “off-policy evaluation” (OPE). Using modern martingale techniques, we present comprehensive framework OPE relaxes unnecessary conditions made some past works (such performing at prespecified sample sizes, uniformly bounded importance weights, constant policies, values, among others), significantly improving on them both theoretically empirically. Importantly, our methods can be employed while original experiment still running (that is, not necessarily post hoc), when may itself changing (due learning), even if context distributions highly dependent series they drifting time). More concretely, derive confidence sequences various functionals OPE. These include doubly robust ones time-varying off-policy mean reward also bands entire cumulative distribution function distribution. All (a) valid arbitrary stopping times; (b) only make nonparametric assumptions; (c) do require weights bounded, are, need know these bounds; (d) adapt empirical variance estimators. In summary, enable anytime-valid using collected contextual data.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (65)
CITATIONS (0)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....