In the last post we looked at the statistics behind simple A/B testing (you can find the whole article here). What if you would like to test more than two alternatives on the website? Working with multiple page layouts requires a statistically sound methodology to confirm results.

A/B/n Testing (for lack of a better name) is an extension to the A/B testing which allows for more than two versions of the same page to be compared simultaneously. Your first thought might be to use pairwise A/B tests, but it would be inconvenient and the statistics employed require correction to account for increased chance of Type I errors (False Positives) – read more on this problem here.

A simple statistical methodology for A/B/n Testing can be based on ANOVA – the Analysis of Variance. It is a generalized version of a t-test that checks whether the means between groups (page variants) are equal or not. The Null Hypothesis in ANOVA is that all means between groups are equal, i.e. all group samples are in fact samples from the same population. Analysis of Variance has the following key assumptions:

**Groups are independent**– each customer is presented with only one variant of the page.**Residuals are normal**– the metric is normally distributed in each group. This is because ANOVA utilizes linear regression under-the-hood.**Variance of the metric is equal between groups**– this property is also called*Homoscedasticity*and is one of the most troublesome requirements of ANOVA.

## Sample size

We can use **pwr.anova.test** from the **pwr** package to determine the sample size needed for a statistically significant ANOVA test.

1 2 |
pwr.anova.test(k = , n = , f = , sig.level = , power = ) |

where:

**k**is the number of groups (variants).**n**is the common number of samples in each group (assuming groups of equal size).**f**indicates the expected effect size. Use 0.1 for a small effect, 0.25 for a medium effect and 0.4 for a large effect.**sig.level**– significance level (Type I error probability). Usually set to 5%**power**– power of test (1 minus Type II error probability). Usually larger than 90%.

The above function is used by omitting one of the parameters. For example, to calculate the number of samples needed to compare 4 groups with a small expected effect and significance level of 5% and power of test at 90%:

1 2 3 4 5 6 7 8 9 10 11 12 13 |
> library(pwr) > pwr.anova.test(k=4, f=0.1, sig.level=0.05, power=0.9) Balanced one-way analysis of variance power calculation k = 4 n = 355.2657 f = 0.1 sig.level = 0.05 power = 0.9 NOTE: n is number in each group |

From the above – we would need at least 356 samples in each group. Read more on power analysis in R here.

## Homoscedasticity

Imagine you are testing 4 variants of an e-commerce website and you are interested in estimating the impact of each design on the Average Order Value. If homoscedasticity holds, the variance in each group should be identical and a boxplots should show whiskers of equal length and symmetry.

If we had no homoscedasticity (a condition called *heteroscedasticity*), the variance visible on the boxplot would be different for each group (compare with fig 2).

Before we continue with the ANOVA analysis, we should test for homoscedasticity. A standard option is the Barttlet’s Test available in standard R as **bartlett.test**:

1 2 |
bartlett.test(formula, data, subset, na.action, ...) |

The Null Hypothesis in Bartlett’s test is that variances between groups are equal.

For example, let us assume the test results are stored in a *stacked* format – each observation in a separate row with a group indicator in a separate column.

1 2 3 4 5 6 7 8 9 10 11 12 13 |
> head(data,10) group order_value 1 D 95.19512 2 A 126.29439 3 D 85.68171 4 A 140.20701 5 B 160.07969 6 C 110.22377 7 D 90.75303 8 D 89.41973 9 C 71.10501 10 B 185.78189 |

We can test for homogeneity of variance (homoscedasticity) with:

1 2 3 4 5 6 7 |
> bartlett.test(order_value ~ group, data) Bartlett test of homogeneity of variances data: aov by group Bartlett's K-squared = 0.8274, df = 3, p-value = 0.8429 |

The p-value is much higher than 0.05, so we have no reasons to reject the Null Hypothesis that the variances are equal between groups. This is exactly what we wanted to see – we can safely proceed with ANOVA.

Another test of this kind is the Levene’s Test available from the **car** packages as **leveneTest**. It is less sensitive than Bartlett’s test to departures from normality (read more here and here). The call syntax is almost identical to batlett.test:

1 2 3 4 5 6 |
> leveneTest(order_value ~ group, data) Levene's Test for Homogeneity of Variance (center = median) Df F value Pr(>F) group 3 0.2244 0.8794 396 |

Again, the p-value is larger than 0.05 so we have no reason to reject the hypothesis that the variances are equal.

## ANOVA for homoscedastic data

Once we confirm (or at least have no reason to reject) homoscedasticity, we can perform a One-way ANOVA test using the **aov** function. “One-way” means that we are using a single grouping variable. The call is:

1 2 |
aov(formula, data = NULL, projections = FALSE, qr = TRUE, contrasts = NULL, ...) |

where the most important parameters are:

**formula**, which specifies the model.**data**– the data frame with*stacked*test results.

Note, that the Null Hypothesis in an ANOVA test is that **ALL means are equal**. It is enough for just one mean to be different to provide enough evidence for rejection the Null Hypothesis.

1 2 3 4 5 6 7 8 |
> dataAOV <- aov(order_value ~ group, data) > summary(dataAOV) Df Sum Sq Mean Sq F value Pr(>F) group 3 731273 243758 1075 <2e-16 *** Residuals 396 89763 227 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 |

The p-value is very small and we should reject the Null Hypothesis that all means are equal. To find out which groups have significantly different means, we perform a post-hoc Tukey HSD test available in standard R under **TukeyHSD**. Some authors argue that the ANOVA step is in fact unnecessary and we could perform the Tukey HSD test alone. Nevertheless, the ANOVA + Tukey approach is considered standard is most books.

1 2 |
TukeyHSD(x, which, ordered = FALSE, conf.level = 0.95, ...) |

where

**x**is the ANOVA model fitted with*aov*.**which**is the list of groups to test. Defaults to all variables.

Individual Null Hypotheses in Tukey HSD are that the means are equal between a given pair of group. For our data the call would be:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
> TukeyHSD(dataAOV) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = order_value ~ group, data = data) $group diff lwr upr p adj B-A 0.9255835 -4.567699 6.418866 0.9724604 C-A -95.5001628 -100.993446 -90.006880 0.0000000 D-A -71.0729367 -76.566220 -65.579654 0.0000000 C-B -96.4257463 -101.919029 -90.932463 0.0000000 D-B -71.9985202 -77.491803 -66.505237 0.0000000 D-C 24.4272261 18.933943 29.920509 0.0000000 |

The result contains a number of interesting values:

**group indicator**in the form of*group1-group2*.**diff**is the estimate of difference between means in groups.**lwr**and**upr**are confidence intervals.**p adj**is the adjusted p-value with the Null Hypothesis that the means are equal in a given group.

As you can see, the only pair where the means are probably equal is the B-C pair. An inspection of boxplots seems to confirm that result.

[boxplot here]

## Dealing with heteroscedastic data

If the Bartlett’s or Levene’s tests reject homoscedasticity, there are two modified version of the Analysis of Variance that have relaxed requirements.

The **oneway.test** uses a Welch correction for data with non-homogeneity of variances. Note, that we still require the normality of value distributions. The calling method is almost identical to *aov*:

1 2 |
oneway.test(formula, data, subset, na.action, var.equal = FALSE) |

For non-normally distributed data, one can use the Kruskal-Wallis rank sum test with the null that the location parameters of the distribution are the same in each group. A basic call to **kruskal.test** is:

1 2 |
kruskal.test(formula, data, subset, na.action, ...) |

I’m unclear as to the values of f for “small” to “large”. The f-index or effect size is dependent on the distribution of means and the number of groups. One can find a lower bound for it based on the effect size, the (assumed) common variance and the number of variants or groups. Why not just use this value and either a middle or max value for the same to bound sample size estimates?

I think the real value of Effect Size is actually in the fact that it is a single number – easy to interpret and explain. An interesting discussion on the matter can be found here: http://www.leeds.ac.uk/educol/documents/00002182.htm

Particularly like the fact that you have included Barlett’s test for homoscedasticity in the explanation.