Roll out of the Amazon Alexa Eurovision Song Contest Skill !

by Michael Barroco, EBU on 12 May 2017

The European Broadcasting Union (EBU) has launched a new skill for Amazon's Alexa Voice Service, which allows users to discover and listen to every Eurovision Song Contest winner on devices including Amazon Echo and Echo Dot. Amazon Echo and Echo Dot are voice-controlled speakers powered by Alexa.

The Eurovision Song Contest skill was jointly developed by the EBU's Technology & Innovation and Media departments and allows users in the UK, Germany, Austria and the US to easily discover who has won every Eurovision Song Contest since the event began in 1956.


Users can simply ask in English or German: "Alexa, ask Eurovision who won in" a particular year. Alexa will then ask users if they want to hear the winner and can play the winning entry. Users can also "Ask Eurovision, when did France (or other nations) last win", or "When is the Grand Final" as well as "Who has won the most" and "Which countries have never won", amongst other combinations.

Amazon Echo owners in the UK will also be able to listen to a live stream of the Eurovision Song Contest Grand Final through the skill via EBU Member BBC.

Echo owners in the UK, Germany, Austria and the US with an Amazon account can enable the skill at these links: UK, DE/AT, US.

Original article

alexa eurovision song contest

The role of diversity in recommender systems for public broadcasters

by Veronika EICKHOFF, BR on 27 Apr 2017

Based on:

  1. "Understanding the role of latent feature diversification on choice difficulty and satisfaction" by Martijn C. Willemsen, Mark P. Graus and Bart P. Knijnenburg.
  2. "Recommender Systems for Self-Actualization" by Bart P. Knijnenburg, Saadhika Sivakumar and Daricia Wilkinson.
  3. “Matrix Factorization Techniques for Recommender Systems” by Yehuda Koren, Robert Bell and Chris Volinsky.
  4. “Recommender Systems Handbook”, Section 8.3.8, by Guy Shani and Asela Gunawardana

Why to diversify recommendations

People like to choose from large sets, but large sets often contain similar items causing choice overload. Increasing diversity can lead to more attractive and satisfactory results even with smaller sets, thereby reducing choice difficulty.

To be precise, the goal is to diversify a set of recommended items while controlling for the overall quality of the set.

Tests of the diversification algorithm against traditional Top-N recommenders conducted in [1] show that diverse, small item sets are just as satisfying and less effortful to choose from than Top-N recommendations. While diversification might reduce the average predicted quality of recommendation lists, the increased diversity might still result in higher satisfaction because of the reduced difficulty. Additionally, relying on highest predicted relevance can result in ignoring other factors that influence user satisfaction. Experiments with user surveys show that despite the lower precision and recall of the diversified recommendations, diversification has a positive effect on users’ perception of the quality of item sets produced by the recommender algorithm.

Why diversity is especially important in recommendations for public broadcasters

As public broadcasters, our mission is to educate our audiences, to extend their potential areas of interest and to inform in a balanced way. Recommending content which suits a user's taste the most might lead to constraining them into a filter bubble, therefore failing on all three goals (an extreme example would be recommending only content confirming a person's existing political leaning).

Stated positively, diversifying recommendations would be a good first step towards conforming to our values, which we aim to confirm with user surveys.

In addition, while commercial organisations depend on user retention and might thus be more careful with offering not the most seemingly suitable content, we can afford and are even expected to do that.

While recommendation techniques are being employed in more and more online services, most research is concentrated on developing top-N style algorithms with very little emphasis being put to goals beyond accuracy. The authors of [2] propose the development of Recommender Systems for Self-Actualisation: “personalised systems that have the explicit goal to not just present users with the best possible items, but to support users in developing, exploring, and understanding their own unique tastes and preferences”. In particular, the authors give a sociological motivation for our goals as public broadcasters: “a deep understanding of one’s own tastes is important for cultural diversity—we want people to make lifestyle choices (e.g., music, movies and fashion) based on carefully developed personal tastes, rather than blindly followed recommendations”.

Diversification of matrix-factorisation-based collaborative filtering

The idea is to diversify item latent features*. Because latent feature diversification provides maximum control over item quality and item set variety on an individual level, it can increase the diversity (and thus reduce the choice difficulty) of an item set while maintaining perceived attractiveness and satisfaction.

*Latent features are the per-item output of matrix-factorisation-based collaborative filtering. They make it possible to relate items to each other along several abstract (latent) dimensions, which are supposed to approximate true characteristics of the item. See Figure 2 in [3] for a simplified illustration of latent factors approach.

Like classical collaborative filtering, the diversification algorithm also gives higher scores to content liked by users similar to the subject user (in case of implicit feedback, "watched the longest" implies "liked"). The difference is that the diverse algorithm selects a subset of the most different or distant items (see later) among the most high-scored ones, so the resulting set is at the same time diverse and still feels relevant.

The diversification algorithm can reuse the same model that matrix-factorisation-based collaborative filtering learned. Upon a recommendation request, the latent features of items and the user are retrieved from the model and a predicted rating of each item for the user is calculated. Then, in contrast to the classical collaborative algorithm which returns random/top N recommendations, the diversification algorithm finds the most diverse N items to be recommended. This is done by first selecting the top item from the initial recommendation set, then iteratively adding items into the selection based on their distance to the rest of the current selection.

The mathematical notion of diversity

The distance between items is defined using Manhattan distance:
d(a, b) = Sum(a_k - b_k), with a, b as items and k iterating over the latent features.
As argued in [1], this distance metric is the most suitable as it ensures that differences along different latent features are considered in an additive way, and large distances along one feature cannot be compensated by shorter distances along other features. This means that two items differing one unit along two dimensions are considered as different as two items differing two steps along one dimension. According to [1], this is more in line with how people perceive differences between choice alternatives with real attribute dimensions.

The diversity of a recommendation set X is defined as the average difference per feature i_k between the highest and lowest scoring items along that feature where i ∈ X, i_k is the score of item i on feature k, and D is the number of latent factors (dimensions):
Diversity(X) = Sum(k=1,D)((max(i_k) - min(i_k)) / D), i ∈ X.

This notion of diversity is called AFSR (Average Feature Score Range).

Experimental evaluation

The authors of [1] conducted several studies to understand how latent feature diversification affects user perception of recommendations, how diversity is related to choice difficulty and choice satisfaction, to measure perceived diversity and effort to choose items.

As mentioned in the first section, the authors got promising results, which motivates us to repeat the evaluation in our context.

Diversification algorithm as part of PEACH

The PEACH (Personalisation for EACH) project is a software solution empowering broadcasters and editorial teams to deliver personalised media services and experiences. The solution includes not only architecture required to collect, store and query data, but also a set of custom data science tools and recommendation algorithms. The diversification algorithm is implemented and offered as part of the platform.

To learn more about PEACH, visit and stay tuned as PEACH will be the subject of an upcoming article.

algorithms data science diversity media content recommendation personalisation public service media recommendations recommender systems

Sample Sizes and Beta Distributions

by Vincent WARMERDAM, NPO on 04 Aug 2016

Sample Sizes and Beta Distributions

Vincent D. Warmerdam - GoDataDriven - 2016-07-26

This document contains an overview on how you can use beta distributions for certain inference tasks concerning AB-tests. The goal is to provide an overview of the problem from a distribution perspective as well as to provide some basic R code for the reader to play with.

The problem

People often wonder for how long they need to run their AB test before they know which version is better. This document contains sensible guidelines as well as some R code to help make this decision easier. We will focus the blogpost on the use of the beta distribution to help us deal with uncertainty.

Example Data

Let's assume that we've got some results from an AB test from two recommenders that were live on our website. The results of these two recommenders are listed below.


df <- data.frame(                
  recommender = c("foo", "bar"),                 
  show = c(1100, 1200),                
  click = c(40, 54)                
) %>% mutate(ctr = click/show)                

df %>% print                

This script outputs the following;

recommender show click ctr               
1 foo 1100 40 0.03636364               
2 bar 1200 54 0.04500000               

One engine named foo has a slightly lower click through rate (CTR), but is this enough to argue that bar is better? The difference between the two might just be due to chance and we would like to keep this into account before we decide that one alternative is better over another.

Describing Uncertainty via Simulations

In order to get an intuitive feeling on the problem at hand, we might first go ahead and simulate data. I'll pretend that we have a process that has the probability of causing a click with probability $0.015$ and another process that has the probability of causing a click with probability $0.02$. I will then simulate this process 10000 times to help us get an impression of what values we can expect. The results of this simulation are shown in the histogram below.

sample_func <- function(prob_hit = 0.01, n_draws = 1000){                
  sample(c(0,1), n_draws, replace=TRUE, prob = c(1 - prob_hit, prob_hit)) %>%                 
    sum() %>%                 
    (function(x) x/n_draws)                

to_plot <- data.frame(                
  ctr1 = 1:5000 %>% map_dbl(~ sample_func(0.015, 1000)),                 
  ctr2 = 1:5000 %>% map_dbl(~ sample_func(0.02, 1000))                
) %>% gather(k, v)                

ggplot() +                 
  geom_histogram(data=to_plot, aes(v), binwidth = 0.001) +                 
  facet_grid(k ~ .) +                 
  ggtitle("Notice the overlap")        

Describing Uncertainty via Beta Distributions

We notice that there is less overlap when we allow for more samples. We can show this via simulation, but simulations can be rather slow. Turns out that there is a very nice distribution called the Beta distribution that calculates the curve we're interested in such that we don't need to perform all these simulations ourselves. To prevent this blog post from being very mathy, instead of showing all the involved math I'll instead show that the dbeta function in R does exactly what we want.

I'll first defined two helper functions in R that accept a dataframe.

calc_quantiles <- function(df){                
  qtiles <- seq(0.0, 0.005, 0.00001)                

  gen_betas <- function(shown, click){                
    dbeta(qtiles, click, shown - click)                

  to_plot <- data.frame(x = qtiles)                
  for(i in 1:nrow(df)){                
    name <- as.character(df[i, 'recommender'])                
    to_plot[name] <- gen_betas(df[i, 'show'], df[i, 'click'])                

  to_plot %>% gather(k, v, -x)                

plot_quantiles <- function(to_plot){                
  ggplot() +                 
    geom_ribbon(data=to_plot, aes(ymin = 0, x = x, ymax = v, fill = k), alpha = 0.5) +                 
    ggtitle("belief over two recommenders")                
These helper functions allow use to make pretty plots that help us prove a point. For example, for our AB results we can plot the curves directly now without the need of any simulation. We can increase the size of one of the A or B groups or CTR values to also glimpse at the overlap.

s <- 90000                
ctr_a <- 0.001                
ctr_b <- 0.0015                
size_a <- s * 0.5                
size_b <- s - size_a                 
  recommender = c("foo", "bar"),                 
  show = c(size_a, size_b),                
  click = c(size_a*ctr_a, size_b*ctr_b)                
) %>% calc_quantiles() %>% plot_quantiles()                


The right choice for sample size depends on your willingness of overlap between two distributions. When in doubt it seems the safest option to sample more! It is safer to make few decisions that are correct than make many that might be wrong.

A normal attitude is to predetermine beforehand how big the samples are before you make any conclusion based on the data. To determine an appropriate sample size you'll first need to determine how much overlap you are willig to have. Then, depending on the click rate you think you'll expect you can perform a calculation. The math for this is rather involved, so you may feel free to instead use the following formula/tool instead. It will calculate the probability that the result for group B are better than for group B given the results of an AB test.

p_b_better_a <- function(hits_a, shows_a, hits_b, shows_b){                
  a_A <- hits_a                
  b_A <- shows_a - hits_a                
  a_B <- hits_b                
  b_B <- shows_b - hits_b                
  total <- 0.0                
  for(i in 0:(a_B-1)){                
    total <- total + exp(lbeta(a_A+i, b_A+b_B) - log(b_B+i) - lbeta(1+i, b_B) - lbeta(a_A, b_A))                

For example;

p_b_better_a(hits_a = 10, shows_a = 1000,                 
             hits_b = 15, shows_b = 1000) # = 0.8477431                
p_b_better_a(hits_a = 10, shows_a = 1000,                 
             hits_b = 15, shows_b = 1100) # = 0.7852795                
p_b_better_a(hits_a = 10, shows_a = 1000,                 
             hits_b = 15, shows_b = 2000) # = 0.2526912                
p_b_better_a(hits_a = 100, shows_a = 10000,                 
             hits_b = 150, shows_b = 10000) # = 0.9993086                

You might be tempted to think that 84.7% is quite high and that you might decide that B is better than A. Realize that in practice this might mean that 15% of your conclusions will be wrong. It usually is better to be conservative and pick a larger sample size instead.

Determining Sample Size

The last function works in hindsight, but how do we determine the size of an A/B group beforehand?

I'll assume that you know the following;

  • the base rate, the CTR of what you currently have which is an estimate of how well your A group will be doing
  • the level of overlap that you're willing to accept
  • the division of traffic between group A and group B

For example. Suppose that my current CTR is 0.01, that I want to be able to measure a 0.1% difference with accuracy and that half my traffic will go to A and the other half will go to B. Then the following script visualises how many interactions I'll need to measure.

ctr_a <- 0.01                
ctr_diff <- 0.001                
ratio_a <- 0.5                

num_samples <- seq(1, 300000, 1000)                

probs <- num_samples %>%                 
  map_dbl(~ p_b_better_a(hits_a = ceiling(ctr_a * . * ratio_a)+1,                 
                         shows_a = . * ratio_a,                 
                         hits_b = ceiling(. * (1-ratio_a) * (ctr_a + ctr_diff))+1,                 
                         shows_b = . * (1-ratio_a)))                

c95 <- data.frame(p = probs, s = num_samples) %>% filter(p >= 0.95) %>% head(1) %>% .$s                
c975 <- data.frame(p = probs, s = num_samples) %>% filter(p >= 0.975) %>% head(1) %>% .$s                
c99 <- data.frame(p = probs, s = num_samples) %>% filter(p >= 0.99) %>% head(1) %>% .$s                

ggplot() +                 
  geom_line(data=data.frame(p = probs, s = num_samples),                
             aes(s, p), colour = 'steelblue') +                 
  geom_vline(aes(xintercept = c95)) +                 
  geom_vline(aes(xintercept = c975)) +                 
  geom_vline(aes(xintercept = c99)) +                 
  ggtitle("playground for sample size")        


In this document we've described a mathematical view into how big your samples need to be. Be mindful that this should be considered as a mere lower bound.

Preferably you would have an AB-test run two weeks independant of the number of users that it hits simply because you want to make sure that you have some estimate of what the effect of weekend is. We've seen recommenders that perform very well on weekends and very poorly during weekdays.

Another thing that makes it hard to do AB testing in the real world is that the world usually changes. Your recommender might perform very well in the holiday season but very poorly during spring. You should always be testing against these time effects.

Behaviour Driven Development (BDD) and testing in Python

by Gil BRECHBUEHLER, EBU on 07 Jul 2016

Behaviour Driven Development (BDD) and testing in Python


First we need to introduce the concepts of unit testing and functional testing, along with their differences:

  • Unit testing means testing each component of a feature in isolation.
  • Functional testing means testing a full feature, with all its components and their interactions.

The tests we present here are functional tests.

Behaviour driven development is the process of defining scenarios for your features and test that the features you are testing are behaving correctly under each scenario.
Actually this is only one part of behaviour driven development, it is also a methodology. Normally for each feature you have to:

  • write a test scenario for the feature
  • implement the feature
  • run the tests and update the code until the tests on the feature pass

As you can see the methodology of behaviour driven development is close to the methodology of test driven development.

The goal of this blog however is not to explain behaviour driven development, but to show how it can be implemented in Python. If you want to learn more about behaviour driven development you can read here for example.

Before starting: Python itself does not give us any BDD tools, so to be able to use BDD in Python we use the following packages:

Finally, here is the example Python function we will test with BDD:

def foo(a, b):    
    if a > 10 or b > 10:    
        raise ValueError    
    if (a * b) % 2 == 0:    
        return "foo"    
        return "bar"    

It is a simple example, but I think it is enough to explain how to do behaviour driven testing in Python. If you want to follow BDD methodology strictly, you have to write your functional tests before implementing the feature, however for an example it is easier to first introduce the functionality we want to test.

Note : the fact that the function "foo" does not accept numbers strictly bigger than ten is just for example purposes.

Gherkin language (link)

BDD has one great feature : it allows to define tests by writing scenarios using the Gherkin language.

Here are the scenarios we will use for our example :

Feature: Foo function    

    A function for foo-bar    

    Scenario: Should work    
        Given <a> and <b>    
        Then foo answers <c>    

        | a | b | c   |    
        | 2 | 3 | foo |    
        | 5 | 3 | bar |    

    Scenario: Should raise an error    
        Given <a> and <b>    
        Then foo raises an error    

        | a   | b  |    
        | 2   | 15 |    
        | 21  |  2 |    
        | 45  | 11 |    

A feature in Gherkin represents a feature of our project. Each feature has a set of scenarios we want to test. In scenarios we can define variables, such as <a> and <b> in this case, and examples that define values for these variables. Each scenario will be run once for each line of its Examples table.
Given lines allow us to define context for our scenarios. Then lines allow us to define the behaviour that our function should have in the defined context.

Features and scenarios are defined in .feature files.

Tests definition

Scenarios are great to describe functionalities in a largely understandable way, but scenarios themselves are not enough to have working tests. Along with our feature file we need a test file in which we define functions that correspond to each line of the scenarios. We will first show the full Python file and then explain it in details :

from moduleA import foo
from pytest_bdd import scenarios, given, then
import pytest

scenarios('foo.feature', example_converters=dict(a=int, b=int, c=str))    

@given('<a> and <b>')    
def numbers(a, b):    
    return [a, b]    

@then('foo answers <c>')    
def compare_answers(numbers, c):    
    assert foo(numbers[0], numbers[1]) == c    

@then('foo raises an error')    
def raise_error(numbers):    
    with pytest.raises(ValueError):    
        foo(numbers[0], numbers[1])    

In our case this file is named Note that for pytest to be able to automatically find your test files, they have to be named with the pattern test_*.py.

scenarios('foo.feature', example_converters=dict(a=int, b=int, c=str))    

This line tells pytest that the functions defined in this file have to be mapped with the scenarios in foo.feature file. The example_converters parameter indicates pytest to which type each variables from the Examples should be converted. This argument is optional; if omitted pytest will give us each variable as a string of characters (str).

Then :

@given('<a> and <b>')    
def numbers(a, b):    
    return [a, b]    

@then('foo answers <c>')    
def normal_behaviour(numbers, c):    
    assert foo(numbers[0], numbers[1]) == c    

@then('foo raises an error')    
def should_raise_error(numbers):    
    with pytest.raises(ValueError):    
        foo(numbers[0], numbers[1])    

In these three functions we define what has to be done for each line of the scenarios, the mapping is done with the tags used before each function. We get the values of the a, b and c variables by giving arguments with the same name to the functions.

Pytest-bdd also makes use of fixtures, a feature of pytest: giving the numbers function as an argument to the compare_answers and raise_error functions allows us to directly access anything the numbers function returned. Here it is an array containing the two integers to pass to the foo function. For more details on how fixtures work in pytest see pytest documentation.

Running the tests

To run the tests we simply call the py.test command :

$ py.test -v    
============================== test session starts ==============================    
platform linux2 -- Python 2.7.11, pytest-2.9.2, py-1.4.31, pluggy-0.3.1 -- /home/gil/.pyenv/versions/2.7.11/envs/evaluate-tests/bin/python2.7    
cachedir: .cache    
rootdir: /home/gil/work/blog, inifile:    
plugins: cov-2.2.1, bdd-2.16.1    
collected 5 items[2-3-foo] PASSED[5-3-bar] PASSED[2-15] PASSED[21-2] PASSED[45-11] PASSED    

============================== 5 passed in 0.02 seconds ==============================    

If we launch pytest without giving any file to it, it searches for files names with the pattern test_*.py in the current folder and recursively in any subfolder.
We see that five tests have actually run, one for each line of the Examples section of the scenarios.


Behaviour Driven Development is a great tool especially because it allows us to define functionalities and their behaviour in a really easy and largely understandable way. Moreover writing BDD tests in Python is easy with pytest-bdd.

Note that pytest-bdd is not the only Python package that brings BDD to Python, there is also planterbox for Nose2, another testing framework for Python. Behave is another framework for behaviour driven development in Python.

BDD python Test Testing

Version support

by Frans De Jong, EBU on 15 Jan 2015

Profiles now support versions

We have added new functionality to the Profiles you can create in EBU.IO/QC.

From now on each Test has a version indicated next to its ID.

If a newer version is available, a little warning sign appears next to it.

This way users can manage their profiles as they like (there is no forced update to the latest Test version), but at the same time they are encouraged to check out newer versions of the Tests they are using.

Version selection

Users can decide the version to use in the Profile Manager, using a simple drop-down list.


The version information is visible to users regardless of their log in status, but as currently there is only one single version for each Test published publicly, the general audience will not really make use of it yet.

However, for editors (which are working with many different draft versions of the Tests), the new functionality is already practically relevant.

A large batch of updates to all Tests is expected in Q2, when the EBU QC Output subgroup has completed its work.

QC quality Quality Control version versions