The Queue Method: Handling Delay, Heuristics, Prior Data, and Evaluation in Bandits
Current algorithms for the standard multi-armed bandit problem have good empirical performance and optimal regret bounds. However, real-world problems often differ from the standard formulation in several ways. First, feedback may be delayed instead of arriving immediately. Second, the real world often contains structure which suggests heuristics, which we wish to incorporate while retaining strong theoretical guarantees. Third, we may wish to make use of an arbitrary prior dataset without negatively impacting performance. Fourth, we may wish to efficiently evaluate algorithms using a previously collected dataset. Surprisingly, these seemingly-disparate problems can be addressed using algorithms inspired by a recently-developed queueing technique. We present the Stochastic Delayed Bandits (SDB) algorithm as a solution to these four problems, which takes black-box bandit algorithms (including heuristic approaches) as input while achieving good theoretical guarantees. We present empirical results from both synthetic simulations and real-world data drawn from an educational game. Our results show that SDB outperforms state-of-the-art approaches to handling delay, heuristics, prior data, and evaluation.
Travis Mandel
Yun-En Liu
Emma Brunskill
Zoran Popović
The Queue Method: Handling Delay, Heuristics, Prior Data, and Evaluation in Bandits
Travis Mandel, Yun-En Liu, Emma Brunskill, Zoran Popović
AAAI Conference on Artificial Intelligence (AAAI 2015)
[Full Paper (1.8 MB PDF)]
[Appendix (Contains proofs and additional empirical results) (1.1 MB PDF)]
Note: Paper updated in July 2016 to correct a small error in the statement of the pseudocode.
Treefrog Treasure (educational numberline game used in our experiments)
Project Website
[Play Treefrog Treasure! (Requires Adobe Flash Player)]