Finding Statistical Significance in A/B Testing

Getting results A/B testing

A/B testing is marketing speak for running a controlled experiment on email. What you are doing is trying to find a way to communicate most effectively to achieve a desired result. When performing an A/B test, the first step is to generate a hypothesis. This is a test after all, and all tests should have a hypothesis which you are trying to prove or disprove.

An example of a hypothesis in an A/B test could be, “If I include the recipient’s name in the subject line of my email, I will get a higher open rate.” To prove or disprove this hypothesis, you need to have a control group and an experimental group. Group A can be our control group, they will receive an email which has the same type of subject line you have been using all along. Group B can be the experimental group and will receive an email with their name in the subject line.

By comparing the results of the email open rates, you can figure out which email was the winner of the test. But remember, this is a controlled experiment, and all controlled experiments need to be statistically significant to be considered a success.

What is statistical significance?

Statistical significance is a way of saying that the results of a test are important, that they have meaning which is a reflection of real world results. Having statistically significant results to a test is great! It means that your test, be it subject line, time of sending, etc. has a result which shows meaning.

P-Value is the numerical result of a statistical calculation. A low P-Value indicates something is statistically significant, a high P-Value shows it is not. The standard P-Value cutoff for significance in an experiment is .05. A P-Value of .05 means that there is a five percent chance the results are random and a 95 percent chance that they are the result of the hypothesis being tested.

Bigger is better

What ends up being the largest factor which influences statistical significance? Size of the sample group.

You sent out an A/B test on subject line to two groups of 250 each. Group A had 75 opens, an open rate of 30%. Group B had 87 opens, an open rate of 35%. That’s a 5% difference. Looking at the numbers alone, that seems like a pretty significant difference, right? Unfortunately, not really. Because that 5% only represents 12 people, it’s not a large enough number to have any real significance. By calculating the P-value, a number which is used to determine significance, we find there is a 12% chance your results are completely random.

Now, if 100 people from Group A had opened the email, the results change pretty dramatically. With these results there is less than a 1% chance that the results are random and slightly more than a 99% chance that the results are accurate. The difference in these open rates is large enough to have a statistical significance. But most tests are not going to have such dramatic differences, and will only vary in open rates by a few percentage points.

If our initial test with the 30% and 35% open rates had been run on a larger sample, say 1000 emails in each group, then those same open rates would yield a very significant result with less than 1 in 100 chance that it was a random fluke.

Want to try running these numbers yourself? This A/B significance calculator is a pretty good place to start.

But what if my list is too small to achieve statistically significant results?

One problem that smaller nonprofits are going to encounter is that their lists just aren’t big enough to generate statistically significant results. Does this mean that you shouldn’t be using A/B testing, or that the results are invalid? Not at all, but it does mean that there is a randomness in your results which needs to be acknowledged.

Smaller is better

One alternate option is to focus less on A/B testing of your entire email list, and instead focus on list segmentation. When sending one message to your entire list, you risk sending people irrelevant content. Recently, we sent out an email about a product upgrade that only affected a small number of our clients. That email, with an open rate of 60%, exceeded a typical email blast. If we had sent that email out to our entire list, not only would few people open the email, but we probably would have lost a lot of subscribers because we were not delivering relevance.

By sending people content they want – and only the content they want – you can increase open rates and reduce unsubscribes.

Kathy A. July 13, 2015, 11:34 am

Just wondering how you are suggesting that people randomize the assignment to the two groups. Otherwise, a P-test is not valid.
- Marcos July 13, 2015, 12:57 pm
  
  Hi Kathy,
  
  Within our Databank software, there is a built-in A/B testing tool which will randomly assign the emails. There also is a random record selection tool which will create a random list of individual records if you are doing a direct mailing, for instance. The RAND function in Excel is also an alternative, and more time consuming, method to achieve the same result.
Georgi June 12, 2017, 6:06 am

Just out of curiosity: in the years since posting this, have you considered sequential testing approaches, which are generally much better suited to testing in an online marketing environment (including e-mail templates)? E.g. AGILE A/B Testing? It would allow to run tests faster and with proper statistical guarantees in terms of both statistical significance and statistical power.
- Marcos June 15, 2017, 8:48 am
  
  Thanks for your question, Georgi.
  
  We do not have built-in sequential testing for A/B tests. It’s an interesting idea and one which we will keep in mind the next time we evaluate our A/B testing offerings.