Pitfalls Of A/B Ad Testing, Part 3

Over the past two months in this column, I’ve discussed some of the pitfalls of A/B ad testing, and in this third and final installment, I’ll discuss a new PPC ad optimization model I’ve been working on and have lovingly entitled, the Van Wagner Ad Sets Optimization Model.

The model is completely new and thoroughly untested, but from early feedback from readers and colleagues, there seems to be strong anecdotal support that this model can become an important asset in any PPC campaign manager’s toolkit. I’ll be presenting my Ad Sets Optimization Model more formally at SMX Advanced in Seattle next month, but will give you a general overview here today.

Before jumping into the Ad Sets Optimization Model, there’s some unfinished business from previous two columns, Pitfalls of A/B Split Testing, Part 1, and Part 2. Last month, I offered the incentive of a chance at a lobster dinner to readers who provided feedback and critique on this ad testing discussion. My grateful thanks to all who chimed in, and congratulations the SearchEngineLand reader whose handle is MMantlo. Please email me and I will arrange to have that lobster dinner delivered to your door, courtesy of Lobster.com.

Ad Sets optimization model

The Van Wagner Ad Sets Optimization Model is based on the premise that an ad group containing a set of well-performing ads can outperform an ad group that contains only the single best ad in that ad group.

Some readers and colleagues have confirmed that after completing rounds of A/B testing and settling on one champion ad, they’ve seen unexplained declines in ad group CTR, clicks, and conversions, even after addressing the most common cause of this phenomenon, keyword match types bringing in unfocused search traffic.

How can a set of ads in an ad group outperform the best ad in that ad group?

Or, as one colleague posed the question, “If none of the ads in the set outperforms the champion individually, how can the entire set? This strikes us as analogous to claiming that ten fast men together can be faster than the fastest man in the world.”

Yes, the idea is counterintuitive, but only if you are narrowly focus on the problem of finding the best performing ad, rather than optimizing your ad group. Instead of the fastest runner analogy, a better analogy may be the Tour de France bike race, where a peleton of ten good riders working together can always beat the fastest single rider.

So how can sub-optimal ads work together with the best ad to increase the yield of the ad group they belong to? The simple answer is that they enable the ad group to connect with a wider range of audience needs and desires than a single ad.

To demonstrate how this works, let’s take an example of an ad group with a single exact match keyword, “blue widgets,” and two finalist ads from our A/B testing.

Here are the ad headlines and performance metrics. Note that while I am mentioning clicks and click-through rates here, the analysis works the same way using conversions and conversion rates.

ad A: [Save 20% on Blue Widgets] 350 Clicks / 5000 Impressions = 7.0% CTR
ad B: [ECO Friendly Blue Widgets] 450 Clicks / 4000 Impressions = 11.3% CTR

The sampling is significant according to Verster and they say it’s pretty much a slam dunk—you can be 99% confident that ad B will continue to beat ad A.

But wait, before you declare a winner, think about the audience population your ads are tailored to. In this case, the two ads touch on very different consumer desires. Ad A is designed to appeal to bargain hunters, and ad B is meant to appeal to people interested in green lifestyles. These may be almost mutually exclusive audiences. To cheapskates, eco-friendly generally doesn’t mean cheap, and vice-versa for eco-consumers.

If these ads appeal to non-overlapping audience segments, what happens when you take ad A offline? You lose the entire audience for ad A, which in this case represents more than half of your target audience. That would be a very bad decision!

Instead of making that decision to bisect the audience, you should instead consider running the two ads as a set, and campaigns set to even ad rotation. The basic tenet of the Ad Sets Optimization model is that when more of your good ads are seen by more of your target audience, your ad group yield will improve, even if some ads are better than others.

The Ad Sets Optimization Model relies on two things. First, people search on a given search term more than once, anywhere from 2 to 20 times, on their way to a decision. Second, search engines will rotate ads in a way that earns them most revenue.

With two ads in your ad set and campaigns set to even rotation, you can use a coin-toss probability to describe the likelihood that a user will see both of your ads.

  • On 1 search, there’s a 50% chance they will see both ads.
  • With 2 searches, the probability rises to 75%.
  • After 3 searches, the probability becomes 87.5%.
  • After 4 searches, the probability reaches 93.75%.

As this simple probability table suggests, your ads are very likely to be seen by your two target audiences even after just a few searches, giving you a very good shot at getting the click and the conversion.

On the other hand, if you blindly follow A/B ad split testing to its logical conclusion and take ad A offline, you have a 0% chance of getting a click from your cost-conscious audience.

We don’t know exactly how the engines rotate ads, but I think it is reasonable to assume that a new searcher will probably see ad B first, because it has the higher CTR. The engine may even present the same ad on that user’s second query even if they did not click on it the first time. However, after showing a user ad B twice in a row without getting a click, I’d imagine the search engine would be much more inclined to show ad A to see if that will attract a click from this searcher. If that is indeed how it works, then the coin-flip probabilities shown above tilt even more towards both ads being seen during their search session.

The Van Wagner Ad Sets Optimization Model can be used in conjunction with existing A/B ad testing procedures to produce higher performing ad groups. To learn more about it, come to the Test That Ad! session at SMX Advanced in Seattle on Tuesday, June 8th.

Opinions expressed in the article are those of the guest author and not necessarily Search Engine Land.

Related Topics: Channel: SEM | Paid Search Column


About The Author: is President and founder of Find Me Faster a search engine marketing firm based in Nashua, NH. He is a member of SEMNE (Search Engine Marketing New England), and SEMPO, the Search Engine Marketing Professionals Organization as a member and contributing courseware developer for the SEMPO Institute. Matt writes occasionally on internet, search engines and technology topics for IMedia, The NH Business Review and other publications.

Connect with the author via: Email | Twitter | LinkedIn


Get all the top search stories emailed daily!  


Other ways to share:

Read before commenting! We welcome constructive comments and allow any that meet our common sense criteria. This means being respectful and polite to others. It means providing helpful information that contributes to a story or discussion. It means leaving links only that substantially add further to a discussion. Comments using foul language, being disrespectful to others or otherwise violating what we believe are common sense standards of discussion will be deleted. Comments may also be removed if they are posted from anonymous accounts. You can read more about our comments policy here.
  • skydive.ny

    Interesting concept – thinking back to my DOX statistics course; what premise is being violated then, which in effect violates the process of randomization…independence I believe?

    The trials are not independent, is another way to describe this?

    On 1 search, there’s a 50% chance they will see both ads.
    With 2 searches, the probability rises to 75%.
    After 3 searches, the probability becomes 87.5%.
    After 4 searches, the probability reaches 93.75%.


  • Stupidscript

    To put it as non-confrontationally as I can: How can any Google ad test be a valid A/B test when Google modifies ad delivery depending on their projected return?

    Examine of your log files and you’ll also note that many times your ad is shown as a result of an irrelevant query … and clicked on! There is so much noise in Google clicks that it is almost impossible to trust the A/B ad data Google provides.

    For example, if I get 5 clicks for my “widget coloring” business, I might see queries something like this in my logs:

    1) coloring book
    2) widget coloring
    3) hair coloring
    4) customized widgets
    5) “why bother coloring widgets”

    Of those, how many should actually be considered to be providing valid data for my little A/B test? Right. 3 out of 5 = 40% irrelevant clicks. Useful for the A/B ad test? Not very, unless you’re testing for irrelevant clicks. Included in the A/B statistics? Yep, without a whisper.

    All I’m saying is that any system that purports to generate statistical data while at the same time manipulating the activities that produce the data can not be trusted to provide more than the faintest hint of a trend. So in addition to your excellent thesis about why multiple ads perform better than single ads, don’t forget to toss in that *all* statistics gathered from Google should be considered to be preliminary, at best, and not accurate without additional analysis on your part.

  • http://www.findmefaster.com Matt Van Wagner

    Thank you, Jeff. This is a great question and I’ll admit I don’t yet have a great anwser.

    We don’t know what the ad rotation rules are. Only Google, Yahoo and Microsoft have this information. However, we do know that quality score influence on the rotation prevents it from being strictly random, and other economic factors that mean the trials (ad impressions) are not truly independent, either, so you are 100% correct that there has to be a better way to describe the probability of whether or not an ad will be shown over the course of multiple search sessions. For now, I am assuming randomness is a good starting point for guessing at the probabilities and that adding in some bayesian logic will help tune in our guesses.

    If you have any alternative ideas/theories/formulas on how to describe this probability, please write back. Would be most grateful for your thoughts.


  • http://www.findmefaster.com Matt Van Wagner

    Yep, match-types definitely pollute the A/B test results, stupidscript and make evaluating A/B test results perilous. It’s dicey if you use only exact match. Using broad match can render A/B tests useless, as you politely point out.

    Thanks – hope other readers take your comment seriously and start looking at their logs.


  • Andrew Goodman

    Interesting debate.

    Pollution aside, it’s possible to assess the principles somewhat in isolation, without match types overly entering into the equation. One way would be to run an ad group on an exact match or a phrase match without too much skew in the profiles of search queries from one to the other.

    In terms of ad delivery, Google’s choices used to include “rotate evenly” along with the CTR/QS-happy “optimize” setting. Today, they’ve essentially admitted to wresting some testing control from our hands, as that option in the interface now reads “… *more* evenly”. More evenly? Ouch!

    Look, I don’t see that as detracting from the general principle, which is that you can run A/B (I don’t generally advocate these as the most precise way, BTW; I believe more ads is better, MV testing if possible on mature accounts) tests and often find statistically significant winners.

    Long story short, even though he gives us stuff, I don’t know if I’m buying what my friend Matt is selling re: winning ads “degrading” or the superiority of rotating multiple ads that might appeal to different personas. The data seem sketchy to me. We’d ideally show the right ad to the right persona, of course, and arguably, Google is working on that type of thing with Conversion Optimizer.

    In current practice, I don’t believe entirely in this “ad sets” principle, but I do see the logic of the ad set in theory, as it would offer a more fertile, genetically diverse if you will (or multiple-persona-friendly) set of ads to potentially pair up with the “right” searcher. (That might point to a tactic if you were using an extremely smart Conversion Optimizer type of product: don’t manually reduce your ‘ad set’ to one or two winning ads, when the technology would do a better job of pairing up the right ad with the right searcher, if you maintained a sufficiently diverse, larger set of highly effective ads… the one or two ad formula might harm performance. So yes Lobster Man, I can see how the logic works.)

    IMHO, that’s currently mostly pie in the sky and needs to take a back seat to more rigorous and creative testing in general.

    We do know this much: ads do “win and lose” and we need to choose winners and cull losers. The vast majority of folks in the industry have virtually no strategy and no insight into this process, either on the creation or the analysis side. Fortunately for many, it’s not too far off the mark to go with the best-CPA ad creatives when there are major disparities in CPA.

    Through all of this, one thing we’ve learned from experience is (whether you owe it to the “degradation effect,” randomness, or other reasons that could be explained with a slightly more sophisticated version of the classroom statistics we try to hang on this as naive theorists) – you need a lot more volume to confirm statistically significant, real impacts than you think, especially if you’re not controlling for all kinds of volatile variables, or variables as simple as ad position or competitor messaging. Significant differences in performance should speak to you; small differences tend to say very little and are often reversed.

  • http://www.findmefaster.com Matt Van Wagner

    Hi Andrew,

    Thanks for stopping back in to your old neighborhood, and weighing in with your trademark well-considered opinion.

    (For those of you new to this space, Andrew Goodman authored this monthly column for past 3 years, and is now sharing his PPC experience biweekly over at ClickZ in his new column, Paid Search Strategies.)

    The test you propose is worth a try, though I wonder how much influence adgroup level QS would skew the results towards the ad group with solo winning ad since it would have a higher CTR. Worth testing, certainly.

    I am not sure how much Conversion Optimizer would resolve the problem of divergent search populations. More than persona matching, it would have to match to actual real people, and how much does Conversion Optimizer know about who is searching? Does it know man from woman, rich from poor, emotional from rationale searchers?

    Further to your point on Conversion Optimizer, as far as I am aware, the Ad Optimizer algorithms and the Conversion Optimizer algorithms work independently. If you have Ad Optimizer turned on, then it does it’s job to select the next ad up, before Conversion Optimizer decides where to place the ad on the page.

    (Anyone from Google care to weigh in on this?)

    I’ll admit that there’s more work to do on the Ad Sets Optimization Model and I am still laying the foundations for it using hypothetical data and scenarios until we can find a way to test it practically.

    My invitation still stands to any and all to help test and poke at the theory with real world tests and data that demonstrates how it works (or doesn’t)? Would love to hear from you.

    Thanks again for stopping by, Andrew!

    (p.s – Your first book is what got me started in this business in the first place, and I continue to appreciate your pleasantly provocative musings.)


Get Our News, Everywhere!

Daily Email:

Follow Search Engine Land on Twitter @sengineland Like Search Engine Land on Facebook Follow Search Engine Land on Google+ Get the Search Engine Land Feed Connect with Search Engine Land on LinkedIn Check out our Tumblr! See us on Pinterest


Click to watch SMX conference video

Join us at one of our SMX or MarTech events:

United States


Australia & China

Learn more about: SMX | MarTech

Free Daily Search News Recap!

SearchCap is a once-per-day newsletter update - sign up below and get the news delivered to you!



Search Engine Land Periodic Table of SEO Success Factors

Get Your Copy
Read The Full SEO Guide