# Do It Yourself A/B Testing

I always start marketing interviews with a phone screen of some variant of the following question:

“Let’s say this is your first day at Urbanspoon and I show you the following data. We’ve just launched an A/B test of that I’d like you to evaluate. [The example can be almost anything you want to test different results for – from almost any search element, PPC campaigns, email subject lines etc. In this case, I’m using a PPC example.] Imagine you are running two different ads on a campaign with 50 kewords. We’ve been running Ad A for a while and have 17,235 impressions and 272 clicks. I started running Ad B last week and that has received 41 clicks on 2,253 impressions. What would you do?”

I’m looking for an answer that goes beyond demonstration of pre-algebra skills and rudimentary familiarity with a calculator.

Obvious answers include splitting up 50 keywords into different groups, looking down stream to see differences in conversion rates, and technical answers around quality score. But what I’m really looking for is a theoretical understanding of statistics and the interplay between sample sizes, variability and confidence intervals.

Answers to the above theoretical question usually fall into one of three buckets:

1. I’d run as more Ad B’s so our impressions are equal and then compare the click through rates.  #FAIL
2. I’d run the Ads longer, you need at least 3 weeks of data to make a decision.  #FAIL
3. Ad B is better b/c the click through rate is higher.  #FAIL “and thanks for taking the time to talk with me, our HR department will be in touch . . .”

Turns out, you don’t need to have an equal number of impressions or a set amount of time to run this analysis. It’s actually a fairly simple concept that can ultimately then be mathematically defined:

The greater the difference between your A and B samples (drawn randomly from the same pool) the smaller the size your test needs to be in order to confidently assert that one performs better than another. Example – if we wanted to test if men were taller than women and we measured 100 men and 100 women  and the men averaged 7 feet tall and the women averaged 4 feet tall, you’d be fairly confident saying that men are taller than women.

Conversely, if the difference was 3 inches instead of 3 feet, you’d probably want to measure more men and women before confidently asserting men are taller than women.

In fact, it’s possible that your sample was misleading – as a population women are really taller than men, but your sample didn’t bear that out. This level of confidence can be mathematically expressed as a percentage – I’m 95% certain that A is better than B. (Meaning there is a 5% chance, or 1 out of every 20 times where you’ll unwittingly pick the underperformer.) The greater the level of confidence you want, the larger the sample size you need.

All of this can be calculated with innumerable free online tools. Larger sophisticated systems like Adwords and big ESPs build this statistical testing in to their testing methodology – but it’s easy for do-it-yourselfers too.

I like a tool called AB Tester, which allows you to measure up to three alternatives compared to a benchmark:

In the results above, I’ve done the analysis for our question . . . The “Confidence” column tells me there’s a 79.19% chance that B is better than our control A.

Watch how this Confidence grows when we add a zero to each column – keeping the CTR the same but increasing the sample size:

By increasing the size of the test tenfold, now there’s only 0.5% that A is really better than B.

Let’s go from theoretical to real. Here are results from an email test we did for our Hawaiian getaway promotion to Ludobites 9. (It’s over now, sorry.)

The first data column is sends, then delivered, then opens, then clicks. Assume we want to test three different content types to three different cities (now admittedly this is not a random sample – maybe people in San Francisco respond differently to content . . . )

Take the data from the 3/6 send and plug it in to A/B Tester. Note I’m comparing the CTR from Opened emails to isolate content as an impact to click through rate. Also note that while the sample sizes are similar, they don’t have to be the same.

My best performer here is the San Francisco content at a 5.5% CTR. I use that as a control and plug the other two into AB Tester:

This tells me there’s a 3.3% likelihood that the LA content might really outperform the winner (San Francisco). Additionally, there’s a 23.5% chance Seattle content is better than our “winner”. More testing necessary . . . .

Opinions expressed in the article are those of the guest author and not necessarily Search Engine Land.

About The Author: is the founder of Atticus Marketing - a search agency dedicated exclusively to the legal profession. Prior to Atticus, Conrad ran marketing for Urbanspoon and the legal directory Avvo, which rose from concept to market leader under his watch.

## SearchCap:

Get all the top search stories emailed daily!

# Like This Story? Please Share!

Other ways to share:

# Like Our Site? Follow Us!

Read before commenting! We welcome constructive comments and allow any that meet our common sense criteria. This means being respectful and polite to others. It means providing helpful information that contributes to a story or discussion. It means leaving links only that substantially add further to a discussion. Comments using foul language, being disrespectful to others or otherwise violating what we believe are common sense standards of discussion will be deleted. Comments may also be removed if they are posted from anonymous accounts. You can read more about our comments policy here.

Good post!

• http://profile.yahoo.com/PEJPHQSQUGB2UV4XB4L2SJUKSE MikeV

so not to be mean, but other than the typo’s in this article, what was the answer you were looking for in the interview? Obviously, no one is going to go through that much of an in-depth answer and calculations in their head in an interiview

• logan9497

Conrad, this is a good post, but I have two questions:

1 – same as MikeV, what is the correct answer.
2 – How would this apply for A/B testing for organic search? Could the A/B tester be applied?

• http://www.alexanderlund.com Alexander Lund

#Fail Grammar everywhere

Important post.
Do you know of a similar tool for multivariate testing? (Say, for assessing the response rates of a combination of 2 experiments run on the same landing page)

2 way anova would do it

• http://profile.yahoo.com/FYXR4GTHUVVYSDX5AVLBVTAUVI EricM

Ok, are you going to let them go online and use this tool at the interview? Horrible.  Choosing B based on CTR still turns out to be the right answer.  #WIN

Mike and Logan – Almost everything you do with search can be A/B tested . . . as long as you have a control with which you can contrast any changes.  Any of the following scenarios apply:  you’ve done a link building campaign, or changed title tags for a portion of your site, altered your footer links (ahh -see if that one works), or maybe added something that increases page weight. Simply look at your overall search volume for a term (Google Keyword research tool or whatever your favorite flavor may be) and then review a specific set of relevant entry terms or pages for changes in inbound traffic.

To be more concrete:  Let’s say I want to change title tags on part of Urbanspoon to include a city name – and for my “B” test I change those title tags for restaurants in Los Angles.  I look at expected searches for those pages (your denominator) (in our case phrase match on the restaurant name and the restaurant name + Los Angles for example) and then compare inbound search traffic (your numerator).  Your comparative can be chronological (i.e. a before and after) or against a control set on your site that you didn’t change.  I’d recommend both.

As for the answer to the interview question – as I mentioned in the article, I’m looking for the theoretical understanding of the interplay between sample size, variability and confidence levels.  If they can tactically tell me how to get there, that’s even better.

Hope this helps – oh and MikeV – “typo’s” is possessive, “typos” is plural.  A little bit of irony for my Thursday morning.  :)

• ABTesterNewbie

There may be a fundamental problem with the A/B test you describe above. Namely, the two versions of the ad copy ran at different points in time (or there was only a small overlap period). I don’t believe that this makes for a statistically valid test, so your confidence and z-scores may not be meaningful.

Could you please elaborate on this? How do you statistically account for this timeframe difference?

Newbie:  ”Take the data from the 3/6 send and plug it in to A/B Tester”.  I wouldn’t be concerned about the five minute time difference between 11:16 and 11:21 am.

• ABTesterNewbie

Sorry, should have made it clearer. I was referring to your interview question:  ”Imagine you are running two different ads on a campaign with 50 kewords. We’ve been running Ad A for a while and have 17,235 impressions and 272 clicks. I started running Ad B last week and that has received 41 clicks on 2,253 impressions.”

Sounds like the time difference here is quite significant (“for a while” vs. “last week”), not minutes. Am I missing something?

Got it.  (I’m glad you weren’t agonizing over a 5 minute interval.)  So this is where some common sense and first hand knowledge of your business comes in.  In general, I can’t imagine a huge difference over time in PPC performance.  But take Urbanspoon for example . . . search volume changes significantly on Friday and Saturdays (and before Valentine’s Day).  So, overall, your point is fair.

Got it.  (I’m glad you weren’t agonizing over a 5 minute interval.)  So this is where some common sense and first hand knowledge of your business comes in.  In general, I can’t imagine a huge difference over time in PPC performance.  But take Urbanspoon for example . . . search volume changes significantly on Friday and Saturdays (and before Valentine’s Day).  So, overall, your point is fair.

but since the OP brought up confidence level and z-scores, i think his name isn’t fair.  :-)

• Mark Aitkin

This is why it is very important to run the two campaigns at the same point in time. While your sample sizes can vary as the confidence interval reduces by a factor of your standard error which in part involves calculation with the sample size; the points in time should be the same. This will minimise the effect of confounding variables on the conversions.

• Александр Ироничен

basically it’s about counting the probabilities using a specified tool…too many words for simple idea

I agree with Mark.
Ignoring the time may result in wrong decision making.

• ChristopherSkyi

The question is straightforward for marketers who took, passed, and remember their basic statistic course, but you could probably put these marketers into the 2 std. dev. above the mean (i.e., not too many! :).

For those who don’t have the background, it seems to me, this is not a fair question.

I have the background, and I could easily throw back at you, yeah, but demonstrate to me that your control and treatment populations are really normally distributed — if they are not, your  estimates are biased (over or under ) as a function unknown non-normal distributions. But, that’s probably going over your head right now, yet that’s a very real possibility.  In medical testing, you have to know the underlying distributions and make corrections to your test in the case of non-normality. In fact, the chances are the two populations are NOT normal and you better hope their deviations are not so much that the estimates are significantly off.

The truth is, there’s way WAY more to this whole issue then just normalizing the data into z scores and running a 1-tailed test.  Most of the time you can get away with it, but if the stakes are high (i.e., if the cost of being wrong is high), you better know what your underlying distributions are. There are statistical tests and procedures to handle this, but at that point, you’ll need to hire an expert in statistics.

• ChristopherSkyi

I think Google Website Optimizer has a built-in/automatic estimators . . .

• ChristopherSkyi

“There may be a fundamental problem with the A/B test you describe above. Namely, the two versions of the ad copy ran at different points in time (or there was only a small overlap period).”

Definitely that’s a problem, a potentially big one. Ideally you’d never want to let time be an uncontrolled variable if you can help it. There’s no way, statistically, to control for that after the experiment. It’s a design flaw in the experiment, and it is a potential killer in terms of valid conclusions.

The only thing you could do would be to somehow come up w/an independent estimate of the effects of those two time periods and then modify all the original data by that factor.  Good luck successfully doing that. Its a case of closing the barn door after the horse gets out. Better, essential in fact, to control critical variables up front, right off the bat.

# Get Our News, Everywhere!

North America

EMEA

APAC

Search Engine Land produces SMX, the Search Marketing Expo conference series. SMX events deliver the most comprehensive educational and networking experiences - whether you're just starting in search marketing or you're a seasoned expert.

## Search Engine Land Webcasts, Whitepapers

Learn more about internet and search marketing with our free webinars, whitepapers and research reports at Digital Marketing Depot.

## Internet Marketing News & Strategies

News From Marketing Land:

See more at Marketing Land