Sample Sizes and Beta Distributions


by Vincent WARMERDAM, NPO on 04 Aug 2016

Sample Sizes and Beta Distributions

Vincent D. Warmerdam - GoDataDriven - 2016-07-26

This document contains an overview on how you can use beta distributions for certain inference tasks concerning AB-tests. The goal is to provide an overview of the problem from a distribution perspective as well as to provide some basic R code for the reader to play with.

The problem

People often wonder for how long they need to run their AB test before they know which version is better. This document contains sensible guidelines as well as some R code to help make this decision easier. We will focus the blogpost on the use of the beta distribution to help us deal with uncertainty.

Example Data

Let's assume that we've got some results from an AB test from two recommenders that were live on our website. The results of these two recommenders are listed below.

library(dplyr)               
library(tidyr)               
library(purrr)               
library(ggplot2)               

df <- data.frame(               
  recommender = c("foo", "bar"),                
  show = c(1100, 1200),               
  click = c(40, 54)               
) %>% mutate(ctr = click/show)               

df %>% print               

This script outputs the following;

recommender show click ctr              
1 foo 1100 40 0.03636364              
2 bar 1200 54 0.04500000              

One engine named foo has a slightly lower click through rate (CTR), but is this enough to argue that bar is better? The difference between the two might just be due to chance and we would like to keep this into account before we decide that one alternative is better over another.

Describing Uncertainty via Simulations

In order to get an intuitive feeling on the problem at hand, we might first go ahead and simulate data. I'll pretend that we have a process that has the probability of causing a click with probability $0.015$ and another process that has the probability of causing a click with probability $0.02$. I will then simulate this process 10000 times to help us get an impression of what values we can expect. The results of this simulation are shown in the histogram below.

sample_func <- function(prob_hit = 0.01, n_draws = 1000){               
  sample(c(0,1), n_draws, replace=TRUE, prob = c(1 - prob_hit, prob_hit)) %>%                
    sum() %>%                
    (function(x) x/n_draws)               
}               

to_plot <- data.frame(               
  ctr1 = 1:5000 %>% map_dbl(~ sample_func(0.015, 1000)),                
  ctr2 = 1:5000 %>% map_dbl(~ sample_func(0.02, 1000))               
) %>% gather(k, v)               

ggplot() +                
  geom_histogram(data=to_plot, aes(v), binwidth = 0.001) +                
  facet_grid(k ~ .) +                
  ggtitle("Notice the overlap")               

http://ebu.io/media/uploads/uploads/ebu1.png

Describing Uncertainty via Beta Distributions

We notice that there is less overlap when we allow for more samples. We can show this via simulation, but simulations can be rather slow. Turns out that there is a very nice distribution called the Beta distribution that calculates the curve we're interested in such that we don't need to perform all these simulations ourselves. To prevent this blog post from being very mathy, instead of showing all the involved math I'll instead show that the dbeta function in R does exactly what we want.

I'll first defined two helper functions in R that accept a dataframe.

calc_quantiles <- function(df){               
  qtiles <- seq(0.0, 0.005, 0.00001)               

  gen_betas <- function(shown, click){               
    dbeta(qtiles, click, shown - click)               
  }               

  to_plot <- data.frame(x = qtiles)               
  for(i in 1:nrow(df)){               
    name <- as.character(df[i, 'recommender'])               
    to_plot[name] <- gen_betas(df[i, 'show'], df[i, 'click'])               
  }               

  to_plot %>% gather(k, v, -x)               
}               

plot_quantiles <- function(to_plot){               
  ggplot() +                
    geom_ribbon(data=to_plot, aes(ymin = 0, x = x, ymax = v, fill = k), alpha = 0.5) +                
    ggtitle("belief over two recommenders")               
}               

http://ebu.io/media/uploads/uploads/ebu2.png
These helper functions allow use to make pretty plots that help us prove a point. For example, for our AB results we can plot the curves directly now without the need of any simulation. We can increase the size of one of the A or B groups or CTR values to also glimpse at the overlap.

s <- 90000               
ctr_a <- 0.001               
ctr_b <- 0.0015               
size_a <- s * 0.5               
size_b <- s - size_a                
data.frame(               
  recommender = c("foo", "bar"),                
  show = c(size_a, size_b),               
  click = c(size_a*ctr_a, size_b*ctr_b)               
) %>% calc_quantiles() %>% plot_quantiles()               

Maths

The right choice for sample size depends on your willingness of overlap between two distributions. When in doubt it seems the safest option to sample more! It is safer to make few decisions that are correct than make many that might be wrong.

A normal attitude is to predetermine beforehand how big the samples are before you make any conclusion based on the data. To determine an appropriate sample size you'll first need to determine how much overlap you are willig to have. Then, depending on the click rate you think you'll expect you can perform a calculation. The math for this is rather involved, so you may feel free to instead use the following formula/tool instead. It will calculate the probability that the result for group B are better than for group B given the results of an AB test.

p_b_better_a <- function(hits_a, shows_a, hits_b, shows_b){               
  a_A <- hits_a               
  b_A <- shows_a - hits_a               
  a_B <- hits_b               
  b_B <- shows_b - hits_b               
  total <- 0.0               
  for(i in 0:(a_B-1)){               
    total <- total + exp(lbeta(a_A+i, b_A+b_B) - log(b_B+i) - lbeta(1+i, b_B) - lbeta(a_A, b_A))               
  }               
  total               
}               

For example;

p_b_better_a(hits_a = 10, shows_a = 1000,                
             hits_b = 15, shows_b = 1000) # = 0.8477431               
p_b_better_a(hits_a = 10, shows_a = 1000,                
             hits_b = 15, shows_b = 1100) # = 0.7852795               
p_b_better_a(hits_a = 10, shows_a = 1000,                
             hits_b = 15, shows_b = 2000) # = 0.2526912               
p_b_better_a(hits_a = 100, shows_a = 10000,                
             hits_b = 150, shows_b = 10000) # = 0.9993086               

You might be tempted to think that 84.7% is quite high and that you might decide that B is better than A. Realize that in practice this might mean that 15% of your conclusions will be wrong. It usually is better to be conservative and pick a larger sample size instead.

Determining Sample Size

The last function works in hindsight, but how do we determine the size of an A/B group beforehand?

I'll assume that you know the following;

  • the base rate, the CTR of what you currently have which is an estimate of how well your A group will be doing
  • the level of overlap that you're willing to accept
  • the division of traffic between group A and group B

For example. Suppose that my current CTR is 0.01, that I want to be able to measure a 0.1% difference with accuracy and that half my traffic will go to A and the other half will go to B. Then the following script visualises how many interactions I'll need to measure.

ctr_a <- 0.01               
ctr_diff <- 0.001               
ratio_a <- 0.5               

num_samples <- seq(1, 300000, 1000)               

probs <- num_samples %>%                
  map_dbl(~ p_b_better_a(hits_a = ceiling(ctr_a * . * ratio_a)+1,                
                         shows_a = . * ratio_a,                
                         hits_b = ceiling(. * (1-ratio_a) * (ctr_a + ctr_diff))+1,                
                         shows_b = . * (1-ratio_a)))               

c95 <- data.frame(p = probs, s = num_samples) %>% filter(p >= 0.95) %>% head(1) %>% .$s               
c975 <- data.frame(p = probs, s = num_samples) %>% filter(p >= 0.975) %>% head(1) %>% .$s               
c99 <- data.frame(p = probs, s = num_samples) %>% filter(p >= 0.99) %>% head(1) %>% .$s               

ggplot() +                
  geom_line(data=data.frame(p = probs, s = num_samples),               
             aes(s, p), colour = 'steelblue') +                
  geom_vline(aes(xintercept = c95)) +                
  geom_vline(aes(xintercept = c975)) +                
  geom_vline(aes(xintercept = c99)) +                
  ggtitle("playground for sample size")               

http://ebu.io/media/uploads/uploads/ebu3.png

Conclusion

In this document we've described a mathematical view into how big your samples need to be. Be mindful that this should be considered as a mere lower bound.

Preferably you would have an AB-test run two weeks independant of the number of users that it hits simply because you want to make sure that you have some estimate of what the effect of weekend is. We've seen recommenders that perform very well on weekends and very poorly during weekdays.

Another thing that makes it hard to do AB testing in the real world is that the world usually changes. Your recommender might perform very well in the holiday season but very poorly during spring. You should always be testing against these time effects.


Behaviour Driven Development (BDD) and testing in Python


by Gil BRECHBUEHLER, EBU on 07 Jul 2016

Behaviour Driven Development (BDD) and testing in Python

Introduction

First we need to introduce the concepts of unit testing and functional testing, along with their differences:

  • Unit testing means testing each component of a feature in isolation.
  • Functional testing means testing a full feature, with all its components and their interactions.

The tests we present here are functional tests.

Behaviour driven development is the process of defining scenarios for your features and test that the features you are testing are behaving correctly under each scenario.
Actually this is only one part of behaviour driven development, it is also a methodology. Normally for each feature you have to:

  • write a test scenario for the feature
  • implement the feature
  • run the tests and update the code until the tests on the feature pass

As you can see the methodology of behaviour driven development is close to the methodology of test driven development.

The goal of this blog however is not to explain behaviour driven development, but to show how it can be implemented in Python. If you want to learn more about behaviour driven development you can read here for example.

Before starting: Python itself does not give us any BDD tools, so to be able to use BDD in Python we use the following packages:

Finally, here is the example Python function we will test with BDD:

def foo(a, b):    
    if a > 10 or b > 10:    
        raise ValueError    
    if (a * b) % 2 == 0:    
        return "foo"    
    else:    
        return "bar"    

It is a simple example, but I think it is enough to explain how to do behaviour driven testing in Python. If you want to follow BDD methodology strictly, you have to write your functional tests before implementing the feature, however for an example it is easier to first introduce the functionality we want to test.

Note : the fact that the function "foo" does not accept numbers strictly bigger than ten is just for example purposes.

Gherkin language (link)

BDD has one great feature : it allows to define tests by writing scenarios using the Gherkin language.

Here are the scenarios we will use for our example :

Feature: Foo function    

    A function for foo-bar    

    Scenario: Should work    
        Given <a> and <b>    
        Then foo answers <c>    

        Examples:    
        | a | b | c   |    
        | 2 | 3 | foo |    
        | 5 | 3 | bar |    

    Scenario: Should raise an error    
        Given <a> and <b>    
        Then foo raises an error    

        Examples:    
        | a   | b  |    
        | 2   | 15 |    
        | 21  |  2 |    
        | 45  | 11 |    

A feature in Gherkin represents a feature of our project. Each feature has a set of scenarios we want to test. In scenarios we can define variables, such as <a> and <b> in this case, and examples that define values for these variables. Each scenario will be run once for each line of its Examples table.
Given lines allow us to define context for our scenarios. Then lines allow us to define the behaviour that our function should have in the defined context.

Features and scenarios are defined in .feature files.

Tests definition

Scenarios are great to describe functionalities in a largely understandable way, but scenarios themselves are not enough to have working tests. Along with our feature file we need a test file in which we define functions that correspond to each line of the scenarios. We will first show the full Python file and then explain it in details :

from moduleA import foo
from pytest_bdd import scenarios, given, then
import pytest

scenarios('foo.feature', example_converters=dict(a=int, b=int, c=str))    

@given('<a> and <b>')    
def numbers(a, b):    
    return [a, b]    

@then('foo answers <c>')    
def compare_answers(numbers, c):    
    assert foo(numbers[0], numbers[1]) == c    

@then('foo raises an error')    
def raise_error(numbers):    
    with pytest.raises(ValueError):    
        foo(numbers[0], numbers[1])    

In our case this file is named test_foo.py. Note that for pytest to be able to automatically find your test files, they have to be named with the pattern test_*.py.

scenarios('foo.feature', example_converters=dict(a=int, b=int, c=str))    

This line tells pytest that the functions defined in this file have to be mapped with the scenarios in foo.feature file. The example_converters parameter indicates pytest to which type each variables from the Examples should be converted. This argument is optional; if omitted pytest will give us each variable as a string of characters (str).

Then :

@given('<a> and <b>')    
def numbers(a, b):    
    return [a, b]    

@then('foo answers <c>')    
def normal_behaviour(numbers, c):    
    assert foo(numbers[0], numbers[1]) == c    

@then('foo raises an error')    
def should_raise_error(numbers):    
    with pytest.raises(ValueError):    
        foo(numbers[0], numbers[1])    

In these three functions we define what has to be done for each line of the scenarios, the mapping is done with the tags used before each function. We get the values of the a, b and c variables by giving arguments with the same name to the functions.

Pytest-bdd also makes use of fixtures, a feature of pytest: giving the numbers function as an argument to the compare_answers and raise_error functions allows us to directly access anything the numbers function returned. Here it is an array containing the two integers to pass to the foo function. For more details on how fixtures work in pytest see pytest documentation.

Running the tests

To run the tests we simply call the py.test command :

$ py.test -v    
============================== test session starts ==============================    
platform linux2 -- Python 2.7.11, pytest-2.9.2, py-1.4.31, pluggy-0.3.1 -- /home/gil/.pyenv/versions/2.7.11/envs/evaluate-tests/bin/python2.7    
cachedir: .cache    
rootdir: /home/gil/work/blog, inifile:    
plugins: cov-2.2.1, bdd-2.16.1    
collected 5 items    

test_foo.py::test_normal_behavior[2-3-foo] PASSED    
test_foo.py::test_normal_behavior[5-3-bar] PASSED    
test_foo.py::test_should_raise_an_error[2-15] PASSED    
test_foo.py::test_should_raise_an_error[21-2] PASSED    
test_foo.py::test_should_raise_an_error[45-11] PASSED    

============================== 5 passed in 0.02 seconds ==============================    

If we launch pytest without giving any file to it, it searches for files names with the pattern test_*.py in the current folder and recursively in any subfolder.
We see that five tests have actually run, one for each line of the Examples section of the scenarios.

Conclusion

Behaviour Driven Development is a great tool especially because it allows us to define functionalities and their behaviour in a really easy and largely understandable way. Moreover writing BDD tests in Python is easy with pytest-bdd.

Note that pytest-bdd is not the only Python package that brings BDD to Python, there is also planterbox for Nose2, another testing framework for Python. Behave is another framework for behaviour driven development in Python.


BDD python Test Testing

Version support


by Frans De Jong, EBU on 15 Jan 2015

Profiles now support versions

We have added new functionality to the Profiles you can create in EBU.IO/QC.

From now on each Test has a version indicated next to its ID.

If a newer version is available, a little warning sign appears next to it.

This way users can manage their profiles as they like (there is no forced update to the latest Test version), but at the same time they are encouraged to check out newer versions of the Tests they are using.

Version selection

Users can decide the version to use in the Profile Manager, using a simple drop-down list.

Visibility

The version information is visible to users regardless of their log in status, but as currently there is only one single version for each Test published publicly, the general audience will not really make use of it yet.

However, for editors (which are working with many different draft versions of the Tests), the new functionality is already practically relevant.

A large batch of updates to all Tests is expected in Q2, when the EBU QC Output subgroup has completed its work.


QC quality Quality Control version versions

Quality Wishes for 2015


by Frans De Jong, EBU on 03 Jan 2015


Adding ranges with Glühwein


by Frans De Jong, EBU on 22 Dec 2014

Developing under the Xmas tree...

With a warm chocolate and a glühwein* next to our laptops, we've been working on improvements of the EBU.IO/QC back-end.

* It is ~+12 degrees Celcius in Geneva...

Linked parameter types

We now have added linked definitions of parameter types:

  • Type (e.g. integer);
  • Representation (e.g. a single digit);
  • Range (e.g. [0-5));
  • Units (e.g. m/s)

Goal

The idea is that we want to help EBU QC Test editors to be stricter in the way they define parameter in- and outputs.

By facilitating managed lists of types, representations, ranges and units, we encourage reuse and minimize mistakes.

Regexes

We've decided to use regexes to help check correct instantiation of ranges.

For example [0,4) is a valid instantation of [a,b).

We also use Regexes to check the names of the managed 'types'.

But we did not dare to go so far as to ask editors to specify their ranges as regexes directly... :-o

In a next step we plan to check default (and later user) values against the instantiated ranges.

Merry Christmas!

Frans and Julien


Christmas development QC QC Quality Quality Control range regex software types units