After a long break we get back to the topic of Small Data problems (read the original article HERE). This time we evaluate a modelling framework for metrics like Click-through Rate and Conversion Rate. In most business scenarios these are estimated with a fraction of Clicks to Impressions and Orders to Visitors. Most reporting suites use this methodology and most analysts are conditioned to ignore the occasional 0% CTR. Single value estimates do not work well when the amount of data is very small. What should we do if an ad has only 4 impressions and no clicks? Is the Click-through Rate actually zero? What if we had 4 impressions and 3 clicks? Is CTR close to 75%? Not really! When we have thousands of visitors, an accurate estimate can be achieved with a simple fraction. Once we enter the Small Data World, a different apparatus is needed to deal with uncertainty. The solution comes again from the world of Bayesian statistics.

# Modeling clicks with the Binomial Distribution

Each display of an ad is an experiment. A mathematician will see it as a coin-toss. A customer sees the ad and tosses a non-symmetrical coin. If it lands heads, she clicks on the ad. This is a classic description of a Bernoulli Trial – a random experiment with two options: success (click) and failure (ignore the ad). The probability of success is the Click-through Rate. To simulate a group of customers with a virtual coin-toss we can use the rbinom(n, size, prob) function in R. Here:

- n – number of observations
- size – number of trials in each observation (1 in our case – each customer chooses between Click and Ignore etc.)
- prob – probability of success (CTR)

To simulate a record of 100 impressions of an ad with 10% CTR we use the following code (“1” indicates a click).

1 2 3 4 |
> rbinom(100, 1,0.1) [1] 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [38] 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [75] 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 |

A series of Bernoulli Trials is called a Binomial Experiment. We have \(n\) trials (Customers) with \(p\) as the probability of success (CTR). A random variable corresponding to the number of successes in a Binomial Experiment (also called a Bernoulli Process) is said to have a Binomial Distribution \(B(n,p)\). The probability of observing \(k\) successes (Clicks) in \(n\) trials (Impressions) with average CTR equal to \(p\) is:

$$B(k,n,p) = {n \choose k} p^k (1-p)^{n-k}$$

For example, if we have 100 visitors and the Click-through Rate on our ad is 10%, the probability distribution of the number of clicks is presented on the following chart:

1 |
> plot(dbinom(1:100, 100, 0.1), type="l", xlab="Clicks", ylab="Probability") |

With CTR close to 10% there is still quite a high chance of seeing 8,9,12 clicks. The smaller the number of observations, the higher the noise. With 4 observations we have approx. 65% chance of seeing no clicks, 30% of seeing one click and almost 5% of seeing 3 clicks.

# Beta distribution for Click-through Rate

So how do we estimate CTR with only a handful of observations? Instead of using a single value for CTR, we can use a distribution to account for the uncertainty. A good trick when building probabilistic models is to use a generative approach. How would you design an algorithm to simulate artificial data from the model? To simulate clicks on an ad we would first select the value of Click-through Rate from some distribution and then use it as the probability of success in the Binomial Distribution to simulate clicks like we did above. This means we need two random variables. First, a continuous distribution for Click-through Rate with values in the [0,1] range. Second, a Binomial Distribution that uses the CTR as its parameter.

Which distribution would be good for the Click-though Rate? Your first choice might be the Normal Distribution, but note that it can take any real value – even a negative one. A negative CTR makes no sense. What is more, it is symmetric and there is no reason to limit the model to only symmetric distributions. The answer is the Beta Distribution – it is continuous with values on the [0,1] interval and a variety of shapes. The two parameters that define it are \(a\) and \(b\).

There are multiple ways of estimating \(a\) and \(b\) from data, but there is one that is especially useful. It is called a “Mean and Sample Size” parametrization. Assume we are to estimate the distribution for Click-through Rate from a “huge” sample of 10 impressions and 2 clicks. Let \(\nu\) be the sample size. In our case \(\nu = 10\) and let \(\mu\) be the mean CTR. We have

2 clicks over 10 impressions which gives \(\mu = \frac{2}{10} = 0.2\). The parametrization of a Beta distribution is then:

$$\alpha = \mu \nu, \enspace \beta = (1 – \mu) \nu$$

In our simple case:

$$\alpha = 0.2 \cdot 10 = 2, \enspace \beta = (1 – 0.2) \cdot 10 = 0.8 \cdot 10 = 8$$

We can plot this distribution in R:

1 |
> plot(seq(0,1,0.01), dbeta(seq(0,1,0.01), 2, 8), type="l", ylab="Density") |

The uncertainty from having just 10 observations is accounted for in the spread of the distribution. As an exercise, try to compute the parametrization if we had 1000 observations and 200 clicks. You will notice it becomes highly concentrated around the 20% Click-through Rate.

A combination of Beta and Binomial distributions is often called a Beta-Binomial distribution.

# Bayesian inference

Usually, we are working with multiple ads running over extended periods of time. As we accumulate observations, we would like to produce an estimate of the Click-through Rate while accounting for the uncertainty. Some ads might have thousands of impressions and clicks while some might just have a couple. This is where Bayesian inference comes into play. We start with a prior belief about the Click-through Rate and as new observations are recorded, we update our beliefs in a step-by-step fashion. The prior can be chosen in multiple ways. If we have enough date, we can use the parametrization above based on the mean and sample size. If we have no information, we can use a non-informative prior, e.g. Jeffreys Prior where \(\beta(0.5,0.5)\):

Once we have defined our prior belief, we need to specify the likelihood function. In our case, the likelihood of observations (clicks) under a given set of parameters (CTR) is given by the binomial distribution. A Beta prior with a Binomial likelihood allows us to use Conjugate Priors to get a closed-form equation for the posterior distribution. If our prior is \(\beta(a,b)\) and we observe \(x\) clicks after \(N\) impressions, the posterior also has a Beta distribution: \(\beta(a + x, b + N – x)\). For example, starting with a Jeffrey’s Prior and observing 1 click after 4 impressions, we get the posterior in the figure below.

A Bayesian model can be analyzed either through summaries of the posterior (means, medians, standard deviations etc.) or through random sampling. For example, after the above Bayesian Inference step we would like to see the distribution of total cost after 10000 impressions of an ad with 0.35 Cost-per-Click.

This can be simulated as follows (including a Jeffrey’s prior calculation):

1 2 3 4 5 6 7 8 9 10 |
cpc <- 0.35 # Jeffrey's Prior priorA <- 0.5 priorB <- 0.5 clicks <- 1 impressions <- 4 hist(rbeta(10000, priorA + clicks, priorB + impressions - clicks) * 10000 * cpc, 20, main="Cost distribution", xlab="Cost") |

# Finishing word

The above example may seem over-engineered, but it is a good building block towards larger Small Data frameworks. Once you start adding more metrics with high degrees of uncertainty (e.g. Click-Through Rate * Conversion Rate * Average Order Value), the Bayesian methodology coupled with posterior sampling shows its true power. It will allow you to make informed decisions even in the absence of large volumes of data.

Nice article which smoothly leads to the idea of randomized probability matching techniques for bandit problems (Thompson sampling) in an intuitive and non-technical way.

I just worked on the estimation of CTRs in online advertising and I encountered the same “small data in big data” problem described here. Many ads have only a small number of clicks and impressions which makes it difficult to directly estimate the CTRs based on the clicks/impressions ratio only.

Recently, I carried out a series of experiments on real online advertising data. It turned out that the observed (aggregated) CTRs could actually be fitted almost perfectly to a beta distribution, but ONLY if I take out all samples with a low impression count which else would lead to a huge peak at CTR=0 (no clicks). My question is: Am I doing right in leaving out all samples with CTR=0 and using the resulting beta distribution for the prediction of future ad CTRs? After all, we know that there is a large fraction of ads that won’t be clicked, so I wonder if neglecting them in the analysis would lead to estimated CTRs that are too large. Or could you argue that any statistical analysis of CTRs only makes sense for CTR>0 because the term “click-through rate” only has a real meaning when at least one click is observed?