On Tuesday, May 1st, I'll be talking about stochastic bandits with delayed feedback at the University of Washington. It's a real honor and pleasure to speak there and I truly thank Joseph Salmon and Zaid Harchaoui for their invitation.
This is a joint work with Olivier Cappé and Vianney Perchet that was accepted at UAI 2017.
Almost all real world implementations of bandit algorithms actually deal with bandit feedback: after a customer is presented an ad, his click (if any) is not sent within milliseconds but rather minutes or even hours or days, depending on the application. Moreover, this problem is coupled with an observation ambiguity: while the system is waiting for a click feedback, the customer might already have decided not to click at all and the learner will never get the awaited reward.
In this talk we introduce a delayed feedback model for stochastic bandits. We first consider the situation when the learner has an infinite patience and show that in that case the problem is actually not harder than the non-delayed one and can be solved similarly. However, this comes at a huge memory cost in O(T), T being the length of the experiment. Thus, we introduce a short-memory setting that mitigates the previously mentioned issue at the price of an additional censoring effect on the feedback that we carefully handle. We present an asymptotically optimal algorithm together with a regret bound and demonstrate empirically its behavior on simulated data.