Do It Yourself A/B Testing

I always start marketing interviews with a phone screen of some variant of the following question:

“Let’s say this is your first day at Urbanspoon and I show you the following data. We’ve just launched an A/B test of that I’d like you to evaluate. [The example can be almost anything you want to test different results for – from almost any search element, PPC campaigns, email subject lines etc. In this case, I’m using a PPC example.] Imagine you are running two different ads on a campaign with 50 kewords. We’ve been running Ad A for a while and have 17,235 impressions and 272 clicks. I started running Ad B last week and that has received 41 clicks on 2,253 impressions. What would you do?”

I’m looking for an answer that goes beyond demonstration of pre-algebra skills and rudimentary familiarity with a calculator.

Obvious answers include splitting up 50 keywords into different groups, looking down stream to see differences in conversion rates, and technical answers around quality score. But what I’m really looking for is a theoretical understanding of statistics and the interplay between sample sizes, variability and confidence intervals.

Answers to the above theoretical question usually fall into one of three buckets:

  1. I’d run as more Ad B’s so our impressions are equal and then compare the click through rates.  #FAIL
  2. I’d run the Ads longer, you need at least 3 weeks of data to make a decision.  #FAIL
  3. Ad B is better b/c the click through rate is higher.  #FAIL “and thanks for taking the time to talk with me, our HR department will be in touch . . .”

Turns out, you don’t need to have an equal number of impressions or a set amount of time to run this analysis. It’s actually a fairly simple concept that can ultimately then be mathematically defined:

The greater the difference between your A and B samples (drawn randomly from the same pool) the smaller the size your test needs to be in order to confidently assert that one performs better than another. Example – if we wanted to test if men were taller than women and we measured 100 men and 100 women  and the men averaged 7 feet tall and the women averaged 4 feet tall, you’d be fairly confident saying that men are taller than women.

Conversely, if the difference was 3 inches instead of 3 feet, you’d probably want to measure more men and women before confidently asserting men are taller than women.

In fact, it’s possible that your sample was misleading – as a population women are really taller than men, but your sample didn’t bear that out. This level of confidence can be mathematically expressed as a percentage – I’m 95% certain that A is better than B. (Meaning there is a 5% chance, or 1 out of every 20 times where you’ll unwittingly pick the underperformer.) The greater the level of confidence you want, the larger the sample size you need.

All of this can be calculated with innumerable free online tools. Larger sophisticated systems like Adwords and big ESPs build this statistical testing in to their testing methodology – but it’s easy for do-it-yourselfers too.

I like a tool called AB Tester, which allows you to measure up to three alternatives compared to a benchmark:

In the results above, I’ve done the analysis for our question . . . The “Confidence” column tells me there’s a 79.19% chance that B is better than our control A.

Watch how this Confidence grows when we add a zero to each column – keeping the CTR the same but increasing the sample size:

By increasing the size of the test tenfold, now there’s only 0.5% that A is really better than B.

Let’s go from theoretical to real. Here are results from an email test we did for our Hawaiian getaway promotion to Ludobites 9. (It’s over now, sorry.)

The first data column is sends, then delivered, then opens, then clicks. Assume we want to test three different content types to three different cities (now admittedly this is not a random sample – maybe people in San Francisco respond differently to content . . . )

Take the data from the 3/6 send and plug it in to A/B Tester. Note I’m comparing the CTR from Opened emails to isolate content as an impact to click through rate. Also note that while the sample sizes are similar, they don’t have to be the same.

My best performer here is the San Francisco content at a 5.5% CTR. I use that as a control and plug the other two into AB Tester:

This tells me there’s a 3.3% likelihood that the LA content might really outperform the winner (San Francisco). Additionally, there’s a 23.5% chance Seattle content is better than our “winner”. More testing necessary . . . .

Opinions expressed in the article are those of the guest author and not necessarily Search Engine Land.

Related Topics: Channel: Analytics | In House Search Marketing | Search Marketing: General | SEM Tools: Web Analytics

Sponsored


About The Author: is the founder of Atticus Marketing - a search agency dedicated exclusively to the legal profession. Prior to Atticus, Conrad ran marketing for Urbanspoon and the legal directory Avvo, which rose from concept to market leader under his watch.

Connect with the author via: Email | Twitter | Google+ | LinkedIn



SearchCap:

Get all the top search stories emailed daily!  

Share

Other ways to share:

Read before commenting! We welcome constructive comments and allow any that meet our common sense criteria. This means being respectful and polite to others. It means providing helpful information that contributes to a story or discussion. It means leaving links only that substantially add further to a discussion. Comments using foul language, being disrespectful to others or otherwise violating what we believe are common sense standards of discussion will be deleted. Comments may also be removed if they are posted from anonymous accounts. You can read more about our comments policy here.
  • http://www.facebook.com/jamesjhuman James Hu

    Good post!

  • http://profile.yahoo.com/PEJPHQSQUGB2UV4XB4L2SJUKSE MikeV

    so not to be mean, but other than the typo’s in this article, what was the answer you were looking for in the interview? Obviously, no one is going to go through that much of an in-depth answer and calculations in their head in an interiview

  • logan9497

    Conrad, this is a good post, but I have two questions:

    1 – same as MikeV, what is the correct answer.
    2 – How would this apply for A/B testing for organic search? Could the A/B tester be applied?

  • Pat Grady

    answer: delete B. dissect B for concepts (like ‘free shipping’), add to my ‘likely bad’ concepts list, marked down as x1 with date.  dissect A, add it concepts to ‘likely good’, count as x1, with date.  proven good, proven bad lists stay empty until x count rises.  write more ads with variants from ‘likely good’ theme list, recycle a ‘likely bad’ now and then, and add variants from ‘new ideas’ list.  rinse, lather, repeat – but slow iteration cycle as optimization curve flattens (and work on finding and overcoming bottlenecks elsewhere).  consider recycling proven concepts into tests for other campaigns and ad groups.  make test data and concepts lists available to SEO, PR and Branding team via Dropbox, use descriptive folder name so they can mine as needed.  don’t ask VP or engineering for their next ‘ad B’ idea to test, dive into Analytics and looks busy whenever they walk by.  in coffee room, do share brief data summary with smart intern who writes the actual blog posts for VP, mention that interviewer suggested we share our latest “win”.  at next monthly meeting that somehow could not be avoided (turns out my leg is not broken, just sprained), sing praises of interviewer’s critical insight when asked about KPIs used to reach new performance plateau.  deposit nice slice of bonus into self-directed IRA.  take cert exam on weekend, network via out-reach that is not a request for help, look for opening to escape working for the man.

  • http://www.alexanderlund.com Alexander Lund

    #Fail Grammar everywhere

  • http://www.facebook.com/eyal.josch Eyal Purvin-Josch

    Important post.
    Do you know of a similar tool for multivariate testing? (Say, for assessing the response rates of a combination of 2 experiments run on the same landing page)

  • http://www.facebook.com/people/Steve-Jones/564432613 Steve Jones

    2 way anova would do it

  • http://profile.yahoo.com/FYXR4GTHUVVYSDX5AVLBVTAUVI EricM

    Ok, are you going to let them go online and use this tool at the interview? Horrible.  Choosing B based on CTR still turns out to be the right answer.  #WIN

  • http://www.facebook.com/profile.php?id=749852742 Conrad Saam

    Mike and Logan – Almost everything you do with search can be A/B tested . . . as long as you have a control with which you can contrast any changes.  Any of the following scenarios apply:  you’ve done a link building campaign, or changed title tags for a portion of your site, altered your footer links (ahh -see if that one works), or maybe added something that increases page weight. Simply look at your overall search volume for a term (Google Keyword research tool or whatever your favorite flavor may be) and then review a specific set of relevant entry terms or pages for changes in inbound traffic.  

    To be more concrete:  Let’s say I want to change title tags on part of Urbanspoon to include a city name – and for my “B” test I change those title tags for restaurants in Los Angles.  I look at expected searches for those pages (your denominator) (in our case phrase match on the restaurant name and the restaurant name + Los Angles for example) and then compare inbound search traffic (your numerator).  Your comparative can be chronological (i.e. a before and after) or against a control set on your site that you didn’t change.  I’d recommend both.  

    As for the answer to the interview question – as I mentioned in the article, I’m looking for the theoretical understanding of the interplay between sample size, variability and confidence levels.  If they can tactically tell me how to get there, that’s even better.  

    Hope this helps – oh and MikeV – “typo’s” is possessive, “typos” is plural.  A little bit of irony for my Thursday morning.  :)  

    -Conrad    

  • ABTesterNewbie

    There may be a fundamental problem with the A/B test you describe above. Namely, the two versions of the ad copy ran at different points in time (or there was only a small overlap period). I don’t believe that this makes for a statistically valid test, so your confidence and z-scores may not be meaningful.

    Could you please elaborate on this? How do you statistically account for this timeframe difference?

  • http://www.facebook.com/profile.php?id=749852742 Conrad Saam

    Newbie:  ”Take the data from the 3/6 send and plug it in to A/B Tester”.  I wouldn’t be concerned about the five minute time difference between 11:16 and 11:21 am.   

    -Conrad

  • ABTesterNewbie

    Sorry, should have made it clearer. I was referring to your interview question:  ”Imagine you are running two different ads on a campaign with 50 kewords. We’ve been running Ad A for a while and have 17,235 impressions and 272 clicks. I started running Ad B last week and that has received 41 clicks on 2,253 impressions.”

    Sounds like the time difference here is quite significant (“for a while” vs. “last week”), not minutes. Am I missing something?

  • http://www.facebook.com/profile.php?id=749852742 Conrad Saam

    Got it.  (I’m glad you weren’t agonizing over a 5 minute interval.)  So this is where some common sense and first hand knowledge of your business comes in.  In general, I can’t imagine a huge difference over time in PPC performance.  But take Urbanspoon for example . . . search volume changes significantly on Friday and Saturdays (and before Valentine’s Day).  So, overall, your point is fair.

    -Conrad

  • http://www.facebook.com/profile.php?id=749852742 Conrad Saam

    Got it.  (I’m glad you weren’t agonizing over a 5 minute interval.)  So this is where some common sense and first hand knowledge of your business comes in.  In general, I can’t imagine a huge difference over time in PPC performance.  But take Urbanspoon for example . . . search volume changes significantly on Friday and Saturdays (and before Valentine’s Day).  So, overall, your point is fair.

    -Conrad

  • Pat Grady

    but since the OP brought up confidence level and z-scores, i think his name isn’t fair.  :-)

  • Mark Aitkin

    This is why it is very important to run the two campaigns at the same point in time. While your sample sizes can vary as the confidence interval reduces by a factor of your standard error which in part involves calculation with the sample size; the points in time should be the same. This will minimise the effect of confounding variables on the conversions.

  • Александр Ироничен

    basically it’s about counting the probabilities using a specified tool…too many words for simple idea

  • http://www.facebook.com/eyal.josch Eyal Purvin-Josch

    I agree with Mark.
    Ignoring the time may result in wrong decision making. 
    See Simpson’s paradox:
    http://lesswrong.com/lw/3q3/simpsons_paradox/

  • ChristopherSkyi

    The question is straightforward for marketers who took, passed, and remember their basic statistic course, but you could probably put these marketers into the 2 std. dev. above the mean (i.e., not too many! :). 

    For those who don’t have the background, it seems to me, this is not a fair question. 

    I have the background, and I could easily throw back at you, yeah, but demonstrate to me that your control and treatment populations are really normally distributed — if they are not, your  estimates are biased (over or under ) as a function unknown non-normal distributions. But, that’s probably going over your head right now, yet that’s a very real possibility.  In medical testing, you have to know the underlying distributions and make corrections to your test in the case of non-normality. In fact, the chances are the two populations are NOT normal and you better hope their deviations are not so much that the estimates are significantly off. 

    The truth is, there’s way WAY more to this whole issue then just normalizing the data into z scores and running a 1-tailed test.  Most of the time you can get away with it, but if the stakes are high (i.e., if the cost of being wrong is high), you better know what your underlying distributions are. There are statistical tests and procedures to handle this, but at that point, you’ll need to hire an expert in statistics. 

  • ChristopherSkyi

    I think Google Website Optimizer has a built-in/automatic estimators . . .

  • ChristopherSkyi

    “There may be a fundamental problem with the A/B test you describe above. Namely, the two versions of the ad copy ran at different points in time (or there was only a small overlap period).”

    Definitely that’s a problem, a potentially big one. Ideally you’d never want to let time be an uncontrolled variable if you can help it. There’s no way, statistically, to control for that after the experiment. It’s a design flaw in the experiment, and it is a potential killer in terms of valid conclusions.  

    The only thing you could do would be to somehow come up w/an independent estimate of the effects of those two time periods and then modify all the original data by that factor.  Good luck successfully doing that. Its a case of closing the barn door after the horse gets out. Better, essential in fact, to control critical variables up front, right off the bat.

 

Get Our News, Everywhere!

Daily Email:

Follow Search Engine Land on Twitter @sengineland Like Search Engine Land on Facebook Follow Search Engine Land on Google+ Get the Search Engine Land Feed Connect with Search Engine Land on LinkedIn Check out our Tumblr! See us on Pinterest

 
 

Click to watch SMX conference video

Join us at one of our SMX or MarTech events:

United States

Europe

Australia & China

Learn more about: SMX | MarTech


Free Daily Search News Recap!

SearchCap is a once-per-day newsletter update - sign up below and get the news delivered to you!

 


 

Search Engine Land Periodic Table of SEO Success Factors

Get Your Copy
Read The Full SEO Guide