A/B Testing Tool FAQ

Scroll down for more about the testing tool. You'll find info about:

  • A/B testing, general info
  • The Sample Finder
  • How to setup/run an A/B test w/your email tool
  • The Significance Inspector
     

General Info About The Tools and A/B Testing

What is A/B testing?

A/B testing is an experiment with two variables or versions. For example, testing two versions of an email, with the exact same content but different subject lines.

 

Who developed this tool and why?

This tool was developed by New Media Mentors. Through our work mentoring nonprofits we noticed that groups were struggling with the statistics side of A/B testing. We wanted to remove this barrier and make it easier for all organizations to make their communications more effective. Because our expertise is mentoring organizations (not statistics), we consulted folks with deeper statistics knowledge while building the tool.

If you're interested in being mentored by New Media Mentors, please don't hesitate to contact us.

 

I’m ready for a more advanced stats tool. Where can I find one?

If you're ready for a more advanced stats tool, congratulations! Check out ABBA. This tool will let you do multivariate testing, give you more options for changing your confidence interval, and give you results using statistics lingo.

 

Can this tool be used for multi-variate testing?

Nope. We specifically designed this tool for A/B testing only to make it as easy as possible for non-math experts to use. If you're ready for a more advanced stats tool, check out ABBA. This tool will let you do multivariate testing, give you more options for changing your confidence interval, and give you results using statistics lingo.

 

What is an open rate?

The open rate is the percentage of email recipients that opened an email. It is generally calculated by dividing the number of opens by the number of emails sent or delivered.

 

What is a click-through rate?

The click-through rate is the percentage of email recipients that clicked on an email. It is generally calculated by dividing the number of clicks by the number of emails sent or delivered.

 

What is an action rate?

The action rate is the percentage of email recipients that took action from an email. It is generally calculated by dividing the number of actions by the number of emails sent or delivered.

 

How can I report an issue with the tool?

Use this form to share your experience with the A/B testing tool or to report any issues.

 

Sample Finder

How does the Sample Finder work?

When running an email A/B test, sending to more people allows you to more accurately gauge an email's performance. If you send to 100 people and 25 respond (a 25% response rate), there's a decent chance that if you send to another 100 people, you could get 30 people responding (a 30% response rate). But if you send to 1,000,000 people and 250,000 people respond, it's extremely unlikely that sending to another 1,000,000 would get 300,000 people to respond, even though the response rates are still 25% and 30%, respectively. This difference is due to a principle known as the Law of Large Numbers, which explains that the more data you collect, the closer your results will be to the true performance level.

When comparing the performance of two emails in an A/B test, you want to be able to gauge performance fairly accurately in order to determine which one is actually better than the other. The Sample Finder uses equations from the field of statistics (based on the Pearson’s chi-squared test) to help you figure out how big an audience you actually need to be confident that one email is better than the other. There are three values you'll need to enter to find your necessary audience size.

  • Expected response rate
    What percentage of your recipients do you expect to respond to these emails? It's easier to detect performance differences for emails with higher response rates, since this is an important piece of information for calculating the necessary audience size. You should reference your past emails to get a sense of what value to use here.
     
  • Expected difference in response rate between emails
    How much do you expect response rate might increase in this test? It's typical to see increases around 15% for subject line testing and increases around 25% for email draft testing. If you're testing a change to your email template or landing page, you might see a difference of only 5%.
     
  • Confidence level of your results
    How sure do you want to be that you'll be able to know which email version is better after the test? Depending on the type of test you're running, you may want to be more or less sure about picking the winning email:

Somewhat sure
Level of confidence is 90%. Good to use for subject line testing on small lists.

Pretty sure
Level of confidence is 95%. Good to use for email draft testing, and subject line testing on large lists.

Very sure
Level of confidence is 99%. Good to use for testing of new best practices.

Plug in the values for your email test to find out your recommended sample size, which is how many people should receive each of your email versions.

What kind of assumptions does the Sample Finder make?

When we developed the Sample Finder, we wanted to make it easy for non-statisticians to use. To accomplish this, we used non-mathy labels for fields and limited some choices. Here's how it all breaks down:

  • How sure do you need to be?
    This is our way of asking what your confidence level needs to be. To make things easy we limited the choices to Pretty Sure (95%) and Very Sure (99%). We're essentially asking if you need to be 99% sure that your results are accurate or if you're okay with them being just 95% accurate.

  • Expected response rate
    This is our way of asking about how you expect your list to respond (ie: what the click-through rate will be, etc.). You probably want to base this on past experience mixed with your best guess for this campaign.
  • How much of a difference do you expect this test to make?
    This is our way of asking what the margin of error is.

 

My suggested sample size is impractically big. What should I do?

If your suggested sample size is impractically big, you won't be able to conduct the test you're hoping to do and get statistically significant results. If you were hoping to test the action rate, consider testing the click-through rate instead. (Your click-through rate is likely bigger, so the test will work with smaller samples.) If the suggested sample size for your expected click-through rate is still too big, consider testing the open rate. It's definitely much better to test the click-through rate or action rate if you can, but testing open rates can still be valuable.

 

Running the Test

How do I run an email test with my email tool?

You'll need to turn to your tool provider for specifics on how to conduct an A/B test with your tool. Here are some guides for some of the most common tools:

 

Significance Inspector

How does the Significance Inspector work?

The Significance Inspector takes the data that you enter and plugs the numbers into a statistical equation known as a Pearson's chi-squared test. This test is designed to calculate how likely it is that the two emails that you're comparing actually have the same performance, given the results that you got. This likelihood is expressed through a number called the p-value.

When the p-value is high, it's very likely that your emails have the same performance, so you can't be confident that one is better than the other. When the p-value is low, though, you can be fairly sure that the email with the higher response rate actually is better than the other one. Lower p-values correspond to higher levels of confidence for identifying the better email:

P-value is 0.1 or less
Level of confidence is 90%. You'll be somewhat sure of your results.

P-value is 0.05 or less
Level of confidence is 95%. You'll be pretty sure of your results.

P-value is 0.01 or less
Level of confidence is 99%. You'll be very sure of your results.

When your p-value is low enough to meet your desired level of confidence, your results are considered statistically significant. This means you can safely proceed with the assumption that the higher-performing email is the better one.

What kind of assumptions does the Significance Inspector make?

  • Group A & Group B
    These are your two testing groups.

  • Number of responses
    This is the number of opens, clicks or actions - whatever you were testing.

  • Number of emails sent
    This is the number of emails sent for each testing group.
  • How sure do you need to be?
    This is our way of asking what your confidence level is. To keep things simple we limited the choices to Somewhat Sure (90%), Pretty Sure (95%), and Very Sure (99%).

 

My results weren’t statistically significant. Why not?

If your results weren't statistically significant, there could be a number of reasons for this:

  • Your testing groups were too small.
    It's possible that your testing groups were just too small, given how close your results were. Next time consider using larger testing groups. If you didn't use the Sample Finder to determine your testing group sizes, give it a shot. That could shed some light on the problem.

  • The results were just too close to call.
    It's possible that your results were just too close to call. Especially if you needed to be "Very Sure" of your results (ie: 99% sure that your results were statistically significant). If you think you could live with only being "Pretty Sure" (ie: 95% sure that your results were statistically significant), try making that change and checking your results again. If your results still weren't statistically significant and you've already ruled out the sample size issue (described in the previous bullet), you might want to run your test again.