The Pitfalls Of A/B Ad Split Testing, Part 2

In last month’s column, Pitfalls of Paid Search, Part 1, I raised the question of whether A/B Ad split testing, taken to the extreme, can be detrimental to ad group performance. Using the example of over-breeding of show dogs as an analogy, I suggested that our quest to discover the best ad of all time through rounds of A/B testing may be thinning the genetic diversity of our marketing messages and causing the unintended and undesirable consequence of degrading our PPC campaign performance.

I also proposed a theory that an ad group with an “ad set”—a couple of strong, well-tested ads can outperform an ad group that has been optimized to contain only a single ad—the winning ad champion from multiple rounds of A/B split testing. Based on the conversations and online exchanges I’ve had with many of you, there seems to be general support for the idea that an group that contains an strong ad set should be capable of outperforming an ad group with just a single A/B test ad champ—at least in theory.

So, let’s dig into these theories a bit more deeply as we examine some of the pitfalls in A/B split ad testing.

Ad over-optimization—theory or reality?

If over-optimization through A/B testing really exists, the most obvious place to look for evidence would be in the before-and-after test results. If, after running a good and valid A/B ad test, you delete the losing ads, and stick with your champion ad, and then observe that your champion ad suddenly takes a dive, this could be good evidence that ad over-optimization is indeed real.

Let’s take a look at how we expect an A/B test to work. For the sake of simplicity, I’ll refer only to click-through rate (CTR) as the performance metric in this example. Ideally, of course, you want to measure using conversion rates and values.

Let’s assume you are split-testing the last two championship ads in your portfolio and you have achieved these results:

  • Ad A: 100 Clicks / 1000 Impressions = 10.0% CTR
  • Ad B: 50 Clicks / 1000 Impressions = 5.0% CTR

The ad group performance, therefore, looks like this:

[Ad group]: 150 Clicks /2000 Impressions = 7.5% CTR

Using A/B split-test methodology, you would pause Ad B and continue running with just the champ, expecting perhaps that your results would look like this after accumulating another 2000 impressions:

Ad A: 200 Clicks / 2000 Impressions = 10.0% CTR

This was the sort of result that Jesse223, one of our readers who commented last month was expecting but did not achieve. Instead, he said “definitely noticed the phenomena of pausing lower performing ads and seeing a subsequent decrease for the well-performing ad.”

How could that happen? How could a winning ad go bad right after being declared the champion in a valid series of A/B tests? Jesse223 chalked it up to lack of reliable data or not running the tests long enough, but I believe there are at least two reasonable explanations for why his champion ad performed poorly, and this may be based on faulty assumptions about the effectiveness A/B testing of search engine ads.

A/B ad rotation is even, not random

Proper A/B tests and the statistical tests used to measure them require that the ads under test rotate randomly. Setting your ads to even rotation in your campaigns does not mean they rotate randomly. It means only that, over time, the search engines will attempt to give each ad the same amount of ad impressions. The statistical tests used for A/B testing require random, independent variables under test, so your results are naturally going to be suspect. It doesn’t matter if you calculate your own chi-squares or use one of the many websites that make the calculation for you. If your test is not truly a random test of independent variables your stats are not going to be reliable predictors, either.

Whether or not you choose to believe your ads are evenly rotated enough for the purposes of the statistical tests, there is no avoiding the fact that there is probably very little randomness in how search engines select which of your two A/B ads to present in the search results. The factors include relevance and quality score of the keyword/ ad pair, and the history of your own searches, among others factors that are completely beyond your control. Craig Danuloff of Click Equations makes this point effectively in his response to Alan Mitchell’s recent article, “Rotating Ads vs. Optimizing Ads: Which Is Better?” on the Wordstream blog.

Ad Impressions do not equal searchers

Another bad assumption about the results you obtain from A/B split testing is that the audience that sees your ads is sufficiently random and independent. In reality, this is simply not the case. We all know that users make multiple searches during their search sessions, whether the session is 20 minutes long or spread out over 3 weeks. We also know that many times users type in the exact same search term they did on their last search. Just last week, Yusuf Mehdi, Senior Vice President at Microsoft commented that Microsoft’s research shows that “50% of all searches for any given user are repeat searches.”

A preponderance of multiple searches by individuals really throws a monkey wrench into the workings of A/B split tests and results in dramatically different outcomes than your tests would predict.

I’ll demonstrate this using the same sample data we used above that led us to declare ad A the champion. To make the math easier to digest, I’ll make a few simple assumptions about the people conducting multiple searches. First, I’ll assume that these people make exactly two searches, and second, I’ll assume that the ads are so completely different that a person inclined to click on ad B would not find ad A interesting enough to click on.

In this scenario, taking the losing ad B offline and running the campaign for another 2000 impressions could result in performance like this:

ad A: 100 Clicks / 2000 Impressions = 5.0% CTR

Surprising, isn’t it? In spite of the fact that exact same number of people clicked on ad A (100), the CTR dropped in half to 5% because ad A now got credited with all of the ad impressions that ad B would normally have gotten. Because ad A was not appealing to people who liked ad B, ad A simply accumulated more impressions but no clicks, resulting in an alarming and unexpected drop in CTR. While this hypothetical example shows the effect in an extreme scenario, it does present a plausible explanation for divergent performance that Jesse223 described after concluding his A/B split-tests.

It is also interesting to note that if this multiple-search behavior is reasonably commonplace, and if Google and Yahoo’s ad optimization routines are designed to favor impressions for the best ad based on CTR, then these algorithms may do a great job of getting clicks for your best ad, at the same time your ad group suffers performance death by degrees.

Try an ad test experiment—win a lobster dinner for two

Although I have given some evidence and possible explanations for pitfalls inherent to A/B ad split testing, I have not yet been able to conclusively prove that over-optimization actually exists or that ad sets can outperform the best ad from your A/B testing. The problem is that I can’t find a way to simultaneously test an ad set that includes the A/B ad champ in it against an ad group that has only the A/B ad champ.

However, just because I haven’t thought of a way to prove these theories doesn’t mean that it can’t be done. I know there are many great PPC scientists out there who may have an idea of how to go about it and I am willing to offer a reward to help prove or disprove any of these theories.

The reward? A delicious Lobster Maine Lobster Bake for 2, courtesy of Take any angle you’d like to contribute to this discussion and support or debunk any theories in this column.

  • Can you prove that Google and Yahoo’s ad optimizers are great for selecting the best ad, but lousy for ad group performance?
  • Do you have your own horror story or data that shows how a perfectly valid A/B ad test winner tanked after the test concluded?
  • Can you prove that A/B ad split test over-optimization is a real phenomenon?
  • Do you have data or other evidence to that support the hypothesis that ad groups containing multiple strong ads, i.e. ad sets, will outperform the most highly tuned A/B split test champ?

If so, contact me with your experiments, your data, or your counter-theories.Everyone who sends me any contribution to this discussion will be eligible to win the lobster dinner, which can be shipped to you anywhere in the U.S.

I’ll pick the winner from the entries submitted, and announce the winner in next month’s column. I look forward to hearing from you!

Opinions expressed in the article are those of the guest author and not necessarily Search Engine Land.

Related Topics: Channel: SEM | Paid Search Column


About The Author: is President and founder of Find Me Faster a search engine marketing firm based in Nashua, NH. He is a member of SEMNE (Search Engine Marketing New England), and SEMPO, the Search Engine Marketing Professionals Organization as a member and contributing courseware developer for the SEMPO Institute. Matt writes occasionally on internet, search engines and technology topics for IMedia, The NH Business Review and other publications.

Connect with the author via: Email | Twitter | LinkedIn


Get all the top search stories emailed daily!  


Other ways to share:

Read before commenting! We welcome constructive comments and allow any that meet our common sense criteria. This means being respectful and polite to others. It means providing helpful information that contributes to a story or discussion. It means leaving links only that substantially add further to a discussion. Comments using foul language, being disrespectful to others or otherwise violating what we believe are common sense standards of discussion will be deleted. Comments may also be removed if they are posted from anonymous accounts. You can read more about our comments policy here.
  • Craig Danuloff

    Matt – Great Article (Both of them):

    Another key factor is the search query variation. If the ad group has multiple keywords, and especially if those keywords are not all exact match, the ad group will attract a wide range of different search queries. Finding the ‘winning’ ad for the group is an average – it could be that for one search query one ad kills and the other fails, but for other search queries the opposite is true. Only by looking at query-to-ad specific metrics can you see this pattern. By killing an ‘underperforming’ ad, you may be killing the only ad that performs for certain queries.

    This issue suggests that your idea of keeping several ads running is a good one, but with a slightly different rational – to find which queries react to which copy and then break the keywords apart into different ad groups so that each query gets the ad that is best for it.

    It’s not a keyword world, it’s a query world.

    BTW: Your articles are exactly the kind of deeper look at the complex reality that the market really needs, so we aren’t all fooled by false simplicity.

  • Matt Van Wagner

    Thank, you, Craig.

    I agree with you 100%, and meant to mention this influence on ad group performance. I took note of this point at your presentation last week at SES in NY, where you mentioned seeing as many as 220+ variations of queries that were triggered by phrase matches.

    I suspect that even when exact match is used, ad set optimization will work because of mulitple searches by the same people, and also because of audience skews. For example, if you know your bigger audience is women, but you also sell to men, focussing on ads that only appeal to women will be to the detriment of almost half your audience. Assuming ad sets theory works, then you could write ads that appeal to and will likely be seen by your other audiences.

    BTW – you are officially in the hunt to win the lobster dinner!

  • Clement

    Can’t agree more with Craig – With “Expanded Broad Match”, the keyword definitely isn’t equivalent to the “search query”. In the same vein, it is also worth noting that the same ad can be triggered by the same *query* but in a very different context, and stilll count as the “same” impression. Google’s search network in general, and “Adsense for search” in particular, can create a lot of noise in the data, especially since Google’s QS only takes Google’s properties into account.
    Ad A/B testing relies on the assumption that the user intention stays relatively “stable” throughout the experiment, but those different websites can come and go and they can have different user intention on the exact same query (e.g. someone typing “ipad” on engadget vs. someone typing “ipad” on Google).

  • Rebel SEO

    I suggest that you use advance filtering in Google Analytics to display the exact search queries and then cross reference those with the ad text to find which combinations of ad/keywords are the best and then separate out the top performers into their own ad groups.
    So in other words, find the keywords that the “winning ad” is winning the biggest with, and create a new ad group with only those keywords and only that ad. Remove the keywords from the original ad group but leave the original ad. Then see what happens!

    Another thing you can do it duplicate the winning ad 2x so you have 4 ads total with your “winning ad” showing 75% of the time and the “loser” showing only 25% of the time. When I do this, I often find that performance varies greatly between the three copies of the winning ad.

    One last thing… I have noticed that when I first launch and ad group with two very different ads, Google will sometimes show one of the ads exclusively at first. Eventually they will give the other ad some love and it evens out 50/50, but perhaps that initial favoritism has to do with Google matching the most relevant ad to the specific query. So does each keyword really have a hidden quality score for each ad in its ad group? hmmm

  • Stupidscript

    A quick question about random v. even ad rotation:

    How can results be reliable if the test itself is truly random? If they are *truly* random, there is a 50/50 chance that only one ad will *ever* be displayed. But aside from that, randomness would seem to be detrimental to ad A/B testing given the broad array of other factors in determining why CTR moves the way it does.

    Testing until 42 total impressions is reached:

    Test 1 Random Ad Displays (ad #A and #B):

    Test 1 Average Results:
    Ad A averaged 50% CTR out of 34 impressions
    Ad B averaged 50% CTR out of 8 impressions

    Test 2 Even Ad Displays:

    Test 2 Average Results:
    Ad A averaged 50% CTR out of 21 impressions
    Ad B averaged 50% CTR out of 21 impressions

    It is true that one would wait until the impressions threshold (say, 21) is reached before concluding a test, however as news events swirl and the focus of public activity shifts, how can one have confidence in the results of any randomized A/B test? By the time the threshold is reached, ad B may have become irrelevant due to the changing face of the public interest.

    Why isn’t the consistency of even ad rotation of benefit to an A/B split test? Am I missing a basic statistical principle? Thanks for the insight.

  • Matt Van Wagner

    Great points, Clement, thnak you.

    Even if you had 100% control over ad placements on the search partner sites (or content network sites), you still have diversity in the demography and shifting motivations your audience, so A/B testing falls short there as an optimization tool.

    You’re now in the hunt for a lobster dinner, too!

  • Matt Van Wagner

    Thank you for weighing in, Rebel SEO.

    Focusing on queries, not just keywords is a great idea, and amplifies Craig’s point..

    Even even if you tighten down your ad groups so you know the exact queries that drive traffic, you still are left with a heterogeneous audience that a single ad can not fully address, so this is where ad set optimization,may boost your ad group performance.

    Interesting data from your launch. I am curious to know if the impressions eventually evened out. Also, did you notice that the fast starter in that ad group become the eventual winner? Would love it if you could share the daily CTR and impression history of those two ads, either by posting another comment or emailing me your data. I don’t need to see your actual ads.

    My point about even rotation, which Stupidscript also commented on below is that Google does not guarantee a predefined rotation scheme for ad rotation, they only say they will attempt to give each ad similar amount of impressions.

    Count yourself in on the lobster dinner bounty for your contributions to this discussion.

  • Matt Van Wagner

    Good questions, Stupidscript.

    This may be a semantic problem, where I characteristed even rotation incorrectly. The rotation is not literally even, as in your ABABAB example Google defines rotate this way:

    ” Rotated ad serving delivers ads more evenly into the auction, even when one ad has a lower CTR than another. The impression statistics and ad served percentages of the ads in the ad group will be more similar to each other than if you had selected the optimization option. However, these statistics still may differ from each other, since ad position may vary based on Quality Score and CPC.”

    Source: Adwords Help.
    AdWords › Help articles › Your Ad Performance › Improving Your Performance › Ad rotation settings

    Notice that they are careful to explain that the ad rotation will serve ‘similar percentages’ and that other factors such as position and CTR take away the random rotation.

    My point any assumption of random sampling in the test is not supportable. A truly random sample could behave as you suggest, but is not likely over a reasonable sample size. Dust off your old stats books, and take a look at the coin-flip example. To your point, a sufficiently large sampling is needed for a valid test.

    Thank you for taking the time to contribute to the discussion. You are now eligible to win the lobster dinner from, too. Check back next month to see…

  • MMantlo

    Are we thinking about ad copy too simplistically here? How can you measure the performance of a ad without taking into account conversion rate, and cost per acquisition or ROI? Isn’t that the bottom line of achievement anyway?

    CTR is great, but if your ad with the highest CTR also has the poorest CONV or Costs too much to be beneficial to your account, I think those are the instances in which you eliminate an ad in favor of another ad or group of ads. This helps qualify your traffic, and gives you the ability to do what Google says \win customers not just visitors.\ Ads that pre-qualify traffic are essential in the marketplace because it helps the user connect with what they are looking for before they cost the marketer money.

    I like a blend of ads in my ad groups. I try to select one or two with a high conversion rate and pair it with one or two with a efficient CPA or ROI; try to balance the best of both worlds. The idea is not to go for strict A/B testing but cluster testing in which you weed out ads that don’t meet you predetermined goals for achievement.

  • Matt Van Wagner

    Hi MMantlo,

    Your point is absolutely correct CVR is the right place to measure. We limited the discussion to CTR, just for purposes of simplicity in this article and because it precedes conversion actions in the event cone.

    Your remarks also seem to indicate that your preference for optimizing to a set of ads.

    A couple of questions for you:

    – have you tested your ad cluster (which I call an ad set) against a solo ad ?

    - In your experience do the ads which remain online represent a variety of copy styles, offers, or calls to action?

    My model seems to suggest that wildly divergent messages could outperform single messages that are simply minor variations in text.

    Please let us know. You can reply here, or if you want, we can take this offline and discuss via email or phone.


    (… and, as a contributor to the discussion, you also have a shot at a lobster dinner delivered to your door, thanks to

  • MMantlo

    Hi Matt,

    Sorry it took so long to get back to you on this, but after I last wrote I plunged myself into case studies of my current account for several days.
    With a base requirement for my examples of A/B testing being substantial traffic, and ad copy testing resulting in a single ad the results were as follows: 13 ad groups studied over 13 months yield 42 instances of an ad group running only one ad copy (just about 25% of the the time). Of that 64% of the time conversions fell, while at the same time over those 42 instances 60% of them saw an increase in conversion rate. This all led me to wonder if we were sacrificing conversions to the conversion rate. All of this led to a rather inconclusive set of case studies.
    For that reason I decided to conduct a larger study across the account over the data from the past 4 months. The findings: having one ad often produced the best CTRs and CPCs, but having 3 ads produced the best conversion rates and CPAs while have CPCs and CTRs that were in the top 3 (out of 8).
    However, this is the example of one account. It would be interesting to see an agency do a study like this to see if the results would remain consistent, or to see if there is variation among verticals, or clients using different metrics to measure success.
    More importantly we have to acknowledge that search does not exist in a vacuum. Bid changes, position changes and landscape fluctuations all have a dramatic impact not only on our accounts, but on the outcome of our testing.
    I’ve heard it said that people who work in SEM walk around with blinders on with regards to what goes on in the advertising world. But I find that most people tend to have a myopic view of everything we do in search. I’ve seen people destroy order volume in search of lower CPCs. In the quest to prevent keyword mapping I’ve seen people give up on broad match regardless of whether or not it converts for them. And, in search of the perfect ad people lower overall conversion rates and increase CPAs. Maybe we need to stop throwing the baby out with the bathwater, and start viewing our accounts with a more holistic eye.

  • Matt Van Wagner

    Wow – thanks for digging in deep!

    Your data seems to support the idea that ad groups with multiple ads can outperform ad group with a single champion ad.

    I am hoping more people will examine this same issue in their accounts and look fo similar results and get curious about the concept of ad set optimization.

    Looking at data in hindsight is one thing, but trying to create a test that proves this conclusively is difficult because it requires you to simultaneously test an ad group that has only your best ad against an ad group that has multiple ads including the your best ad.

    I am still looking for ways to set up a true set of experiments that would demonstrate the result one way or the other. I’ve got some ideas on how to isolate some variables, but latency to sale, seasonality, ad ranking variations, and other factors make this very difficult.

    Great analysis – thanks!


Get Our News, Everywhere!

Daily Email:

Follow Search Engine Land on Twitter @sengineland Like Search Engine Land on Facebook Follow Search Engine Land on Google+ Get the Search Engine Land Feed Connect with Search Engine Land on LinkedIn Check out our Tumblr! See us on Pinterest


Click to watch SMX conference video

Join us at one of our SMX or MarTech events:

United States


Australia & China

Learn more about: SMX | MarTech

Free Daily Search News Recap!

SearchCap is a once-per-day newsletter update - sign up below and get the news delivered to you!



Search Engine Land Periodic Table of SEO Success Factors

Get Your Copy
Read The Full SEO Guide