Search Engine Land » PPC » The Pitfalls Of A/B Ad Split Testing, Part 2

The Pitfalls Of A/B Ad Split Testing, Part 2

In last month’s column, Pitfalls of Paid Search, Part 1, I raised the question of whether A/B Ad split testing, taken to the extreme, can be detrimental to ad group performance. Using the example of over-breeding of show dogs as an analogy, I suggested that our quest to discover the best ad of all time […]

Matt Van Wagner on April 5, 2010 at 11:49 pm | Reading time: 8 minutes

In last month’s column, Pitfalls of Paid Search, Part 1, I raised the question of whether A/B Ad split testing, taken to the extreme, can be detrimental to ad group performance. Using the example of over-breeding of show dogs as an analogy, I suggested that our quest to discover the best ad of all time through rounds of A/B testing may be thinning the genetic diversity of our marketing messages and causing the unintended and undesirable consequence of degrading our PPC campaign performance.

I also proposed a theory that an ad group with an “ad set”—a couple of strong, well-tested ads can outperform an ad group that has been optimized to contain only a single ad—the winning ad champion from multiple rounds of A/B split testing. Based on the conversations and online exchanges I’ve had with many of you, there seems to be general support for the idea that an group that contains an strong ad set should be capable of outperforming an ad group with just a single A/B test ad champ—at least in theory.

So, let’s dig into these theories a bit more deeply as we examine some of the pitfalls in A/B split ad testing.

Ad over-optimization—theory or reality?

If over-optimization through A/B testing really exists, the most obvious place to look for evidence would be in the before-and-after test results. If, after running a good and valid A/B ad test, you delete the losing ads, and stick with your champion ad, and then observe that your champion ad suddenly takes a dive, this could be good evidence that ad over-optimization is indeed real.

Let’s take a look at how we expect an A/B test to work. For the sake of simplicity, I’ll refer only to click-through rate (CTR) as the performance metric in this example. Ideally, of course, you want to measure using conversion rates and values.

Let’s assume you are split-testing the last two championship ads in your portfolio and you have achieved these results:

Ad A: 100 Clicks / 1000 Impressions = 10.0% CTR
Ad B: 50 Clicks / 1000 Impressions = 5.0% CTR

The ad group performance, therefore, looks like this:

[Ad group]: 150 Clicks /2000 Impressions = 7.5% CTR

Using A/B split-test methodology, you would pause Ad B and continue running with just the champ, expecting perhaps that your results would look like this after accumulating another 2000 impressions:

Ad A: 200 Clicks / 2000 Impressions = 10.0% CTR

This was the sort of result that Jesse223, one of our readers who commented last month was expecting but did not achieve. Instead, he said “definitely noticed the phenomena of pausing lower performing ads and seeing a subsequent decrease for the well-performing ad.”

How could that happen? How could a winning ad go bad right after being declared the champion in a valid series of A/B tests? Jesse223 chalked it up to lack of reliable data or not running the tests long enough, but I believe there are at least two reasonable explanations for why his champion ad performed poorly, and this may be based on faulty assumptions about the effectiveness A/B testing of search engine ads.

A/B ad rotation is even, not random

Proper A/B tests and the statistical tests used to measure them require that the ads under test rotate randomly. Setting your ads to even rotation in your campaigns does not mean they rotate randomly. It means only that, over time, the search engines will attempt to give each ad the same amount of ad impressions. The statistical tests used for A/B testing require random, independent variables under test, so your results are naturally going to be suspect. It doesn’t matter if you calculate your own chi-squares or use one of the many websites that make the calculation for you. If your test is not truly a random test of independent variables your stats are not going to be reliable predictors, either.

Whether or not you choose to believe your ads are evenly rotated enough for the purposes of the statistical tests, there is no avoiding the fact that there is probably very little randomness in how search engines select which of your two A/B ads to present in the search results. The factors include relevance and quality score of the keyword/ ad pair, and the history of your own searches, among others factors that are completely beyond your control. Craig Danuloff of Click Equations makes this point effectively in his response to Alan Mitchell’s recent article, “Rotating Ads vs. Optimizing Ads: Which Is Better?” on the Wordstream blog.

Ad Impressions do not equal searchers

Another bad assumption about the results you obtain from A/B split testing is that the audience that sees your ads is sufficiently random and independent. In reality, this is simply not the case. We all know that users make multiple searches during their search sessions, whether the session is 20 minutes long or spread out over 3 weeks. We also know that many times users type in the exact same search term they did on their last search. Just last week, Yusuf Mehdi, Senior Vice President at Microsoft commented that Microsoft’s research shows that “50% of all searches for any given user are repeat searches.”

A preponderance of multiple searches by individuals really throws a monkey wrench into the workings of A/B split tests and results in dramatically different outcomes than your tests would predict.

I’ll demonstrate this using the same sample data we used above that led us to declare ad A the champion. To make the math easier to digest, I’ll make a few simple assumptions about the people conducting multiple searches. First, I’ll assume that these people make exactly two searches, and second, I’ll assume that the ads are so completely different that a person inclined to click on ad B would not find ad A interesting enough to click on.

In this scenario, taking the losing ad B offline and running the campaign for another 2000 impressions could result in performance like this:

ad A: 100 Clicks / 2000 Impressions = 5.0% CTR

Surprising, isn’t it? In spite of the fact that exact same number of people clicked on ad A (100), the CTR dropped in half to 5% because ad A now got credited with all of the ad impressions that ad B would normally have gotten. Because ad A was not appealing to people who liked ad B, ad A simply accumulated more impressions but no clicks, resulting in an alarming and unexpected drop in CTR. While this hypothetical example shows the effect in an extreme scenario, it does present a plausible explanation for divergent performance that Jesse223 described after concluding his A/B split-tests.

It is also interesting to note that if this multiple-search behavior is reasonably commonplace, and if Google and Yahoo’s ad optimization routines are designed to favor impressions for the best ad based on CTR, then these algorithms may do a great job of getting clicks for your best ad, at the same time your ad group suffers performance death by degrees.

Try an ad test experiment—win a lobster dinner for two

Although I have given some evidence and possible explanations for pitfalls inherent to A/B ad split testing, I have not yet been able to conclusively prove that over-optimization actually exists or that ad sets can outperform the best ad from your A/B testing. The problem is that I can’t find a way to simultaneously test an ad set that includes the A/B ad champ in it against an ad group that has only the A/B ad champ.

However, just because I haven’t thought of a way to prove these theories doesn’t mean that it can’t be done. I know there are many great PPC scientists out there who may have an idea of how to go about it and I am willing to offer a reward to help prove or disprove any of these theories.

The reward? A delicious Lobster Maine Lobster Bake for 2, courtesy of Lobster.com. Take any angle you’d like to contribute to this discussion and support or debunk any theories in this column.

Can you prove that Google and Yahoo’s ad optimizers are great for selecting the best ad, but lousy for ad group performance?
Do you have your own horror story or data that shows how a perfectly valid A/B ad test winner tanked after the test concluded?
Can you prove that A/B ad split test over-optimization is a real phenomenon?
Do you have data or other evidence to that support the hypothesis that ad groups containing multiple strong ads, i.e. ad sets, will outperform the most highly tuned A/B split test champ?

If so, contact me with your experiments, your data, or your counter-theories.Everyone who sends me any contribution to this discussion will be eligible to win the lobster dinner, which can be shipped to you anywhere in the U.S.

I’ll pick the winner from the entries submitted, and announce the winner in next month’s column. I look forward to hearing from you!

Contributing authors are invited to create content for Search Engine Land and are chosen for their expertise and contribution to the search community. Our contributors work under the oversight of the editorial staff and contributions are checked for quality and relevance to our readers. The opinions they express are their own.

Add Search Engine Land to your Google News feed.