Thompson Sampling for A/B testing

Assume you are facing a binary choice problem, basically betting on A outcome versus B outcome. A/B can be 2 webpage designs, 2 adverstisment websites ... or two "Bandits" in a casino.
Obviously if you knew average return-on-investment of A and B you would pick the more rewarding option.
But you don't have this information and the only way to get this information is to try both A and B a few times.
Your decision is now the following : how many times do I need to try A and B before I can confidently know determine which is best? In a real life problem like adverstising both A and B have a cost so you need a reliable testing strategy.
Chose $\Theta_{a}=\left(\alpha_{a},\beta_{a}\right)$ and $\Theta_{b}=\left(\alpha_{b},\beta_{b}\right)$. In our case we set the correct distribution of Ad campaign to be Bernoulli. At each round we will select 1 ad' and thus we will only update $\Theta_{a}$ or $\Theta_{b}$. Note that at each round we simulate the 2 Betas to get Bernouilli probabilities $P_{a}$ and $P_{b}$. This simulation gives us our optimal action : chose the action a/b that has the highest probability.

Strategy / Algorithm

Sample estimated $\hat{P_{a}}$ and $\hat{P_{b}}$ from $B_{a}$ and $B_{b}$ distributions
Select the action $x$ that has the highest $\hat{P_{x}}$ value
Perform action $x$
If $x$ succed increment $\alpha_{x}$
Otherwise increment $\beta_{x}$
REPEAT ABOVE STEPS

Notes:
- This loop does not need to be stopped : as a winning action emerges the other be sampled more rarely.
- As the losing action will be sampled less and less often, thus its estimate will note converge toward its real value.

Example

For this experiment we have action A rewarding 30% of the time and action B rewarding 40% of the time.

1st chart

2nd chart : a posteriori density of A and B success rate

3rd chart : probability to picking the best action

Here we use the running $\Theta_{a,b}$ parameters to compute $P\left(B_{b}\geq B_{a}\right)$.
Remark : There is no closed-formula for this but this is a simple 1d numerical integration.

Thompson Sampling for A/B testing

Also called the "Bandit" problem or the "Exploit vs Exploit" problem

Mathieu ZARADZKI - 2017

Strategy / Algorithm

Example