A/B testing has become the standard and popular way of optimizing website and campaign structure. Tools like Google Analytics Content Experiments and Webtrends Optimize make the test setup effortless. Unfortunately, the common questions remain – *Is my sample size big enough?* *Are my results statistically significant?*

With a little bit of R we can answer those questions.

## Basics

First, some definitions and terminology (no worries, I will try to make it easy to understand).

- We say that a result of a test is
**statistically significant**if it is unlikely to have occurred by chance alone in a given setting. - We decide on two rival hypotheses – the
*Null Hypothesis*and the*Alternative Hypothesis*. - The
**Null Hypothesis**is the assumption that there is no relationship between the values. Usually, we set it to the exact opposite of what we would like to see. When expecting a higher conversion rate on the new landing page, use a Null Hypothesis that the mean conversion rates are equal. - The
**Alternative Hypothesis**is what we must assume once the Null Hypothesis is rejected. There are many types of Alternative Hypotheses, but in most A/B tests it is usually enough to know the two values are different. In these cases we will use a*Two-tailed*alternative hypothesis. A*One-tailed*alternative allows us to specify the direction of inequality. - The result of the test is either to
**reject the Null Hypothesis while accepting the Alternative Hypothesis**or conclude that there is**not enough evidence to reject the Null Hypothesis**. Note that we have no way of confirming the Null. If you fail to reject the hypothesis that there is no difference in Conversion Rates between two ads, you should not conclude that the two conversion rates are equal. Maybe your sample was too small to detect the difference? - We select a
**Significance Level**– the probability of detecting an effect that is not there. Stat-gurus call it the probability of a Type I Error or the probability of a False-Positive detection (read more on types of error here). Usually, we select a number that is less or equal than 5% (hence we have a 5% chance of making a mistake when rejecting the Null). - We should also select the
**Power of Test**, which is the probability of finding an effect that exist. This is also called the probability of a True-Positive. Commonly chosen values are 80% or 90%. - We calculate the
**p-value**from the*Test Statistic*. It is the probability that the observed result is due to chance assuming the Null Hypothesis holds. P-value is what you usually see reported instead of the Test Statistic because it is easy to interpret. - To
**reject the Null Hypothesis**we need a*p-value*that is lower than the selected Significance Level.

There are two basic Hypothesis Testing methodologies that come in handy when doing A/B testing.

## Test of proportion

We use a Test of Proportion when working with percentage metrics like Conversion Rate, Click-through Rate etc. Key assumption is the independence of groups, e.g. that customers in one group always see one version of the landing page. The Null Hypothesis here is that the percentage is equal in both groups.

The optimal sample size can be obtained in R using two functions.

**power.prop.test {stats}** is part of the base R distribution and computes power of test or sample size.

1 2 3 4 5 |
power.prop.test(n = NULL, p1 = NULL, p2 = NULL, sig.level = 0.05, power = NULL, alternative = c("two.sided", "one.sided"), strict = FALSE) |

where:

*n*– Number of observations (per group)*p1*– probability in one group*p2*– probability in other group*sig.level*– Significance level (Type I error probability)*power*– Power of test (1 minus Type II error probability)*alternative*– One- or two-sided test*strict*– Use strict interpretation in two-sided case

We use this function by providing all but one parameter. R will calculate the left-out argument. Use *p1* to give the known value (e.g. current Conversion Rate) and *p2* to state the smallest effect you want to detect with the test.

For example, to compute the sample size needed to pickup a one percentage increase in conversion rate at a 5% significance level and 80% power of test:

1 2 3 4 5 6 7 8 9 10 11 12 |
> power.prop.test(p1=0.015, p2=0.025, sig.level=0.05, power=0.8) Two-sample comparison of proportions power calculation n = 3075.582 p1 = 0.015 p2 = 0.025 sig.level = 0.05 power = 0.8 alternative = two.sided NOTE: n is number in *each* group |

From the above, we need at least 3076 samples in each group. Similarly, we can compute the largest difference in Conversion Rate with a sample of 1000 customers:

1 2 3 4 5 6 7 8 9 10 11 |
> power.prop.test(n=1000, p1=0.015, sig.level=0.05, power=0.8) Two-sample comparison of proportions power calculation n = 1000 p1 = 0.015 p2 = 0.03443593 sig.level = 0.05 power = 0.8 alternative = two.sided NOTE: n is number in *each* group |

A great package for computing power of test is **pwr**. It contains functions to compute sample sizes and power for most popular tests. You can install it easily from CRAN:

1 2 3 |
> install.packages("pwr") > library(pwr) |

Use **pwr.2p2n.test {pwr}** to compute sample size needed with two unequal groups.

1 2 3 |
pwr.2p2n.test(h = NULL, n1 = NULL, n2 = NULL, sig.level = 0.05, power = NULL, alternative = c("two.sided", "less","greater")) |

*h*– Effect size.*n1*– Number of observations in the first sample*n2*– Number of observationsz in the second sample*sig.level*– Significance level (Type I error probability)*power*– Power of test (1 minus Type II error probability)*alternative*– a character string specifying the alternative hypothesis, must be one of “two.sided” (default), “greater” or “less”

The recommended values for *h* are: 0.2 for small effects, 0.5 for medium and 0.8 for big effects. Use one of *n1* or *n2* to provide sample size in one group and omit the other to compute it. For example to calculate the size of the test group needed to detect a small effect with 1000 customers in the control group:

1 2 3 4 5 6 7 8 9 10 11 |
> pwr.2p2n.test(h=0.2, n1=1000, sig.level=0.05, power=0.8) difference of proportion power calculation for binomial distribution (arcsine transformation) h = 0.2 n1 = 1000 n2 = 244.1239 sig.level = 0.05 power = 0.8 alternative = two.sided NOTE: different sample sizes |

From the above: we need at least 245 customers in the test group.

The actual test is performed using the **prop.test {stats}** from the base distribution.

1 2 3 4 |
prop.test(x, n, p = NULL, alternative = c("two.sided", "less", "greater"), conf.level = 0.95, correct = TRUE) |

*x*– a vector of counts of successes, a one-dimensional table with two entries, or a two-dimensional table (or matrix) with 2 columns, giving the counts of successes and failures, respectively.*n*– a vector of counts of trials.*p*– a vector of probabilities of success.*alternative*– a character string specifying the alternative hypothesis.*conf.level*– confidence level of the returned confidence interval.*correct*– a logical indicating whether Yates’ continuity correction should be applied where possible.

For example, to test two campaigns each with a 1000 displays, 32 and 54 conversions:

1 2 3 4 5 6 7 8 9 10 11 12 |
> prop.test(c(32, 54), c(1000,1000)) 2-sample test for equality of proportions with continuity correction data: c(32, 54) out of c(1000, 1000) X-squared = 5.3583, df = 1, p-value = 0.02062 alternative hypothesis: two.sided 95 percent confidence interval: -0.040754721 -0.003245279 sample estimates: prop 1 prop 2 0.032 0.054 |

The p-value is less than 0.05, so we can reject the hypothesis that conversion rates are equal and assume the second group has a higher rate.

## Test of means

Comparing non-fractional values that follow a normal distribution (e.g. Average Order Value, Time Spent on Page etc.) is done with a Two-sample unpaired t-test. The recommended variant is the Welch t-test. It has quite flexible assumptions and can be used with unequal sample sizes and unequal variances in both groups. Note that if your data follows a log-normal distribution you may need to apply that log function.

Sample size can be computed using **pwr.t2n.test {pwr}** function:

1 2 3 4 |
pwr.t2n.test(n1 = NULL, n2= NULL, d = NULL, sig.level = 0.05, power = NULL, alternative = c("two.sided", "less","greater")) |

Parameters are similar to the function for the Test of Proportions, but the effect power is given as the *d* parameter. Use 0.2 for small, 0.5 for medium and 0.8 for large effects.

Sample call for computing the test sample size for a campaign with an estimated medium effect and 10000 customers in the control group:

1 2 3 4 5 6 7 8 9 10 |
> pwr.t2n.test(n1=10000, d=0.5, sig.level=0.05, power=0.90) t test power calculation n1 = 10000 n2 = 42.21519 d = 0.5 sig.level = 0.05 power = 0.9 alternative = two.sided |

As you can see, if the expected effect is strong. Only around 43 customers are needed in the test group.

The test itself is done with a **t.test {stats}**:

1 2 3 4 5 |
t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, ...) |

In the simplest scenario, just provide the *x* and *y* parameters to perform a Welch Two-saple test:

1 2 3 4 5 6 7 8 9 10 11 12 |
> t.test(group1, group2) Welch Two Sample t-test data: group1 and group2 t = -1.5631, df = 99.423, p-value = 0.1212 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -7.0219603 0.8334419 sample estimates: mean of x mean of y 125.0789 128.1731 |

This time, our p-value is greater than 5% so we **cannot reject** the hypothesis that the values are different. More tests or a larger sample size may be necessary.

Read more on the Power Analysis at Quick-R here.

I love this type of analysis, and although it is quite over my head, I believe it is relevant to clients of the type I have where a/b testing can be risky due to low traffic volumes on the clients’ websites.

I’m interested to know – and I apologize for possibly reducing the complexity too much here – if there is a reverse path that could be at all valid for cases where the client website really doesn’t have sufficient traffic. By reverse path, I mean: if I started with a sample size of X giving me a/b tests, whats the most validity I could expect to avail of – or alternatively, how much difference in the results would I need to see with a given sample size in order to judge it significant?

Here’s a more concrete example of what I am trying to ask: If I have a brand new website with almost no traffic, but I want to identify any major issues it might have by bringing 50 random visitors (for example, via a PPC campaign), and 25 of those visitors see version a and 25 see version b, how much benefit might I be able to expect? Or maybe a better question could be, how much measured difference between the a group and the b group results would I need to detect in order for my sample size of 25 per group to become useful/valid/significant?

Hopefully I haven’t made a fool of myself here by taking the article in a reverse direction, but part of my question relates to identifying how much money it would cost in terms of paid visitors in order to get valid a/b tests happening on new websites.

Hi there,

Thanks for your message. Assuming you are testing for conversion rate (a test of proportion) and that your base conversion rate (say in variant A of your website) is around 2%, with 25 people in both groups you would need to see your conversion rate go up to 35% with variant B to judge it significant.

Detecting an increase in conversion rate from 2% to 3% would take 5120 samples in each group (a total of almost 10,000 visitors). From 2% to 5% you would need 786 samples in each group (a total of 1500 visitors). Multiply the number of visitors by an average Cost-per-Click and you have the total cost. You can find a nice report on average CPC’s per industry here:

http://www.adgooroo.com/resources/blog/adwords-cost-per-click-rises-26-between-2012-and-2014/

So if your average CPC is $1.5 (0.89 GBP) and you want to detect an increase from 2% to 5% in conversion rate, you would need to spend $2,250 (1340 GBP). With adwords costs growing each year it may not be the best way to drive traffic. Unless your website is in a real niche.

Let me know if that answers your question.

Many thanks,

K

[…] Note: This post was heavily influenced by Marketing Distillery's A/B Tests in Marketing. […]