Most A/B Tests are Illusionary

Tue 30 September 2014 ⊕ Category: Library
#Experiment #Data

Paper

MOST WINNING A/B TEST RESULTS ARE ILLUSORY Martin Goodson (DPhil) Jan 2014

Summary

Demonstrates how application of standard statistical techniques are equally valid when applied to A/B testing, and how missing these can result in erronous conculsions being drawn from A/B test results.

statistical power
multiple testing
regression to the mean

Statistical Power

Simply, that the size of the sample you measure increases the power of the result where power is the reliability of the measure to indicate a difference when there really is a difference.

For A/B testing this means you need to run an experiement for long enough that what your measuring is actually a difference. The paper includes a methodology for calculating sample size.

Multiple testing

• Performing many tests, not necessarily concurrently, will multiply the probability of encountering a false positive. • False positives increase if you stop a test when you see a positive result.

Regression to the mean

Over a period of time even random results will regress to the mean. If you use a smaller time window you may identify early winners that are in fact random winners. Look out for the trends over time — if an initial uplift in A/B tests falls you may be observing regression to the mean.

Final quote

You can increase the robustness of your testing process by following this statistical standard practice:

• Use a valid hypothesis - don’t use a scattergun approach

• Do a power calculation first to estimate sample size

• Do not stop the test early if you use ‘classical methods’ of testing

• Perform a second ‘validation’ test repeating your original test to check that the effect is real

How optimizely (almost) got me fired

TheShed