Most A/B Tests are Illusionary
Category: Library
#Experiment #Data
Paper
MOST WINNING A/B TEST RESULTS ARE ILLUSORY Martin Goodson (DPhil) Jan 2014
Summary
Demonstrates how application of standard statistical techniques are equally valid when applied to A/B testing, and how missing these can result in erronous conculsions being drawn from A/B test results.
- statistical power
- multiple testing
- regression to the mean
Statistical Power
Simply, that the size of the sample you measure increases the power of the result where power is the reliability of the measure to indicate a difference when there really is a difference.
For A/B testing this means you need to run an experiement for long enough that what your measuring is actually a difference. The paper includes a methodology for calculating sample size.
Multiple testing
• Performing many tests, not necessarily concurrently, will multiply the probability of encountering a false positive. • False positives increase if you stop a test when you see a positive result.
Regression to the mean
Over a period of time even random results will regress to the mean. If you use a smaller time window you may identify early winners that are in fact random winners. Look out for the trends over time — if an initial uplift in A/B tests falls you may be observing regression to the mean.
Final quote
You can increase the robustness of your testing process by following this statistical standard practice:
• Use a valid hypothesis - don’t use a scattergun approach
• Do a power calculation first to estimate sample size
• Do not stop the test early if you use ‘classical methods’ of testing
• Perform a second ‘validation’ test repeating your original test to check that the effect is real