Anytime-valid off-policy Inference for Contextual Bandits
FOS: Computer and information sciences
Computer Science - Machine Learning
Mathematics - Statistics Theory
Machine Learning (stat.ML)
Statistics Theory (math.ST)
01 natural sciences
Machine Learning (cs.LG)
Methodology (stat.ME)
Statistics - Machine Learning
FOS: Mathematics
0101 mathematics
Statistics - Methodology
DOI:
10.1145/3643693
Publication Date:
2024-01-31T12:03:41Z
AUTHORS (5)
ABSTRACT
Contextual bandit algorithms are ubiquitous tools for active sequential experimentation in healthcare and the tech industry. They involve online learning that adaptively learn policies over time to map observed contexts X t actions A an attempt maximize stochastic rewards R . This adaptivity raises interesting but hard statistical inference questions, especially counterfactual ones: example, it is often of interest estimate properties a hypothetical policy different from logging was used collect data—a problem known as “off-policy evaluation” (OPE). Using modern martingale techniques, we present comprehensive framework OPE relaxes unnecessary conditions made some past works (such performing at prespecified sample sizes, uniformly bounded importance weights, constant policies, values, among others), significantly improving on them both theoretically empirically. Importantly, our methods can be employed while original experiment still running (that is, not necessarily post hoc), when may itself changing (due learning), even if context distributions highly dependent series they drifting time). More concretely, derive confidence sequences various functionals OPE. These include doubly robust ones time-varying off-policy mean reward also bands entire cumulative distribution function distribution. All (a) valid arbitrary stopping times; (b) only make nonparametric assumptions; (c) do require weights bounded, are, need know these bounds; (d) adapt empirical variance estimators. In summary, enable anytime-valid using collected contextual data.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (65)
CITATIONS (0)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....