This document contains an overview on how you can use beta distributions for certain inference tasks concerning AB-tests. The goal is to provide an overview of the problem from a distribution perspective as well as to provide some basic R code for the reader to play with.

The problem

People often wonder for how long they need to run their AB test before they know which version is better. This document contains sensible guidelines as well as some R code to help make this decision easier. We will focus the blogpost on the use of the beta distribution to help us deal with uncertainty.

Example Data

Let's assume that we've got some results from an AB test from two recommenders that were live on our website. The results of these two recommenders are listed below.

One engine named foo has a slightly lower click through rate (CTR), but is this enough to argue that bar is better? The difference between the two might just be due to chance and we would like to keep this into account before we decide that one alternative is better over another.

Describing Uncertainty via Simulations

In order to get an intuitive feeling on the problem at hand, we might first go ahead and simulate data. I'll pretend that we have a process that has the probability of causing a click with probability $0.015$ and another process that has the probability of causing a click with probability $0.02$. I will then simulate this process 10000 times to help us get an impression of what values we can expect. The results of this simulation are shown in the histogram below.

sample_func<-function(prob_hit=0.01,n_draws=1000){sample(c(0,1),n_draws,replace=TRUE,prob=c(1-prob_hit,prob_hit))%>%sum()%>%(function(x)x/n_draws)}to_plot<-data.frame(ctr1=1:5000%>%map_dbl(~sample_func(0.015,1000)),ctr2=1:5000%>%map_dbl(~sample_func(0.02,1000)))%>%gather(k,v)ggplot()+geom_histogram(data=to_plot,aes(v),binwidth=0.001)+facet_grid(k~.)+ggtitle("Notice the overlap")

Describing Uncertainty via Beta Distributions

We notice that there is less overlap when we allow for more samples. We can show this via simulation, but simulations can be rather slow. Turns out that there is a very nice distribution called the Beta distribution that calculates the curve we're interested in such that we don't need to perform all these simulations ourselves. To prevent this blog post from being very mathy, instead of showing all the involved math I'll instead show that the dbeta function in R does exactly what we want.

I'll first defined two helper functions in R that accept a dataframe.

calc_quantiles<-function(df){qtiles<-seq(0.0,0.005,0.00001)gen_betas<-function(shown,click){dbeta(qtiles,click,shown-click)}to_plot<-data.frame(x=qtiles)for(iin1:nrow(df)){name<-as.character(df[i,'recommender'])to_plot[name]<-gen_betas(df[i,'show'],df[i,'click'])}to_plot%>%gather(k,v,-x)}plot_quantiles<-function(to_plot){ggplot()+geom_ribbon(data=to_plot,aes(ymin=0,x=x,ymax=v,fill=k),alpha=0.5)+ggtitle("belief over two recommenders")}

These helper functions allow use to make pretty plots that help us prove a point. For example, for our AB results we can plot the curves directly now without the need of any simulation. We can increase the size of one of the A or B groups or CTR values to also glimpse at the overlap.

The right choice for sample size depends on your willingness of overlap between two distributions. When in doubt it seems the safest option to sample more! It is safer to make few decisions that are correct than make many that might be wrong.

A normal attitude is to predetermine beforehand how big the samples are before you make any conclusion based on the data. To determine an appropriate sample size you'll first need to determine how much overlap you are willig to have. Then, depending on the click rate you think you'll expect you can perform a calculation. The math for this is rather involved, so you may feel free to instead use the following formula/tool instead. It will calculate the probability that the result for group B are better than for group B given the results of an AB test.

You might be tempted to think that 84.7% is quite high and that you might decide that B is better than A. Realize that in practice this might mean that 15% of your conclusions will be wrong. It usually is better to be conservative and pick a larger sample size instead.

Determining Sample Size

The last function works in hindsight, but how do we determine the size of an A/B group beforehand?

I'll assume that you know the following;

the base rate, the CTR of what you currently have which is an estimate of how well your A group will be doing

the level of overlap that you're willing to accept

the division of traffic between group A and group B

For example. Suppose that my current CTR is 0.01, that I want to be able to measure a 0.1% difference with accuracy and that half my traffic will go to A and the other half will go to B. Then the following script visualises how many interactions I'll need to measure.

ctr_a<-0.01ctr_diff<-0.001ratio_a<-0.5num_samples<-seq(1,300000,1000)probs<-num_samples%>%map_dbl(~p_b_better_a(hits_a=ceiling(ctr_a*. *ratio_a)+1,shows_a=. *ratio_a,hits_b=ceiling(. *(1-ratio_a)*(ctr_a+ctr_diff))+1,shows_b=. *(1-ratio_a)))c95<-data.frame(p=probs,s=num_samples)%>%filter(p>=0.95)%>%head(1)%>%.$sc975<-data.frame(p=probs,s=num_samples)%>%filter(p>=0.975)%>%head(1)%>%.$sc99<-data.frame(p=probs,s=num_samples)%>%filter(p>=0.99)%>%head(1)%>%.$sggplot()+geom_line(data=data.frame(p=probs,s=num_samples),aes(s,p),colour='steelblue')+geom_vline(aes(xintercept=c95))+geom_vline(aes(xintercept=c975))+geom_vline(aes(xintercept=c99))+ggtitle("playground for sample size")

Conclusion

In this document we've described a mathematical view into how big your samples need to be. Be mindful that this should be considered as a mere lower bound.

Preferably you would have an AB-test run two weeks independant of the number of users that it hits simply because you want to make sure that you have some estimate of what the effect of weekend is. We've seen recommenders that perform very well on weekends and very poorly during weekdays.

Another thing that makes it hard to do AB testing in the real world is that the world usually changes. Your recommender might perform very well in the holiday season but very poorly during spring. You should always be testing against these time effects.