Statistical Significance: Not Just For Geeks Anymore

The concept of “statistical significance” is probably one of the most misunderstood phrases in search marketing. People sometimes ask me to assess whether the difference between two clickthrough rates is “statistically significant” or not with the same look on their face as if they are asking if a particular rash looks infected.

“The clickthrough rate (CTR) was 2% on Friday, but 3% on Saturday. That’s a 50% increase. 50% is a lot, right?,” they ask. Well, it certainly is for income tax rates, but not necessarily for differences in clickthroughs. What if Friday saw 2 clicks from 100 impressions and Saturday saw 3 clicks from 100 impressions? Doesn’t sound so impressive anymore, does it?

Part of the problem is that it’s simply impossible to tell from that few impressions whether both have an inherent CTR of 2.5% (and you just happened to see 2 clicks for one and 3 clicks for the other) or whether they legitimately have different underlying CTRs.

Imagine a more extreme case: one ad has a CTR of 2% and one has a CTR of 100%. We see four impressions and all get clicks. How likely is this to be the data for the 2%-CTR ad?

Well, if it is the 2%-CTR ad’s data, then there’s a 2% chance that the first impression would generate a click. That’s about the same as the chance of randomly drawing the ace of spades from a well-shuffled deck of cards. And there’s a 2% chance that the next impression will generate a click, which is about the same as reshuffling that deck and then randomly drawing the ace of spades again (without any sleight-of-hand trickery).

So, the chance of seeing 4 clicks from 4 impressions for a 2%-CTR word must be very, very small, but (please take a minute to convince yourself of this, if you need to) it’s not absolutely zero. Even an ad with only a 2% CTR still might possibly generate 4 clicks from 4 impressions. It’s improbable, but not impossible.

That is why statisticians rarely seem to give a straight answer to whether two ad’s CTRs are different or not. “Statistical significance” is not really a Yes or No situation, it’s just the probability of seeing a certain sequence of events (like four ace-of-spades in a row) not happen purely by chance. Every new impression increases the certainty in our answer, but there is no specific amount of information that seals the deal.

By convention, statisticians often set an arbitrary cut-off of “5% chance of being explained purely by randomness” for classifying whether or not a difference is “statistically significant” or not. That’s why when a magician declares that he’ll pull a certain card from a deck, and then actually does so, the average geek in your life will joyously exclaim, “That’s statistically improbable!” We know that there’s less than the 5% cut-off chance that that card appeared purely by luck.

Imagine now that we have two ads, Ad A, for which we have observed a 2% CTR, and Ad B, whose observed CTR is shown on the x-axis of the graph below. The graph shows the number of impressions (per ad) we must see to be 95% certain that the two ads have different CTRs.


If Ad A has seen 2 clicks from 100 impressions (2% CTR) and B has seen 14 clicks from 100 impressions (14% CTR), then we can be more than 95% certain that Ad B’s CTR is higher than A’s. If the observed CTR of Ad B is only 3%, then we actually need nearly 4000 impressions each to be 95% certain that Ad B performs better. That’s why the difference in observed CTRs between the Friday and Saturday ad performance wouldn’t look so impressive if they only had 200 impressions between them.

As the CTR of Ad B approaches 2%, it takes staggeringly more and more data to differentiate the two ads. Trying to tell a 2.00% CTR ad from one with a CTR of 1.95% (or 2.05%) takes more than a million impressions each. And, if the two ads perform identically, with exactly a 2% CTR, obviously even an infinite amount of data couldn’t tell them apart.

Though the concepts I’ve described above are (hopefully) now very clear, unfortunately some of the web-based tools for differentiating CTRs seem to have disregarded them completely.

For example, if one ad got 1 click with a 25% CTR (that is, 4 impressions) and a second ad got 2 clicks with a 100% CTR (that is, 2 impressions), by Brian Teasley and Perry Marshall says: “You are approximately 99% confident that the ads will have different long term response rates.” 99% confident from just 6 impressions?! No, I’m not. If I flip a coin 4 times and get 1 “heads” and another coin 2 times and get 2 “heads,” I wouldn’t be 99% certain that either one of their per-flip chances deviate from 50% at all., a similar site by Dr. Glenn Livingston (I presume), has similar deficiencies. For the case of Ad A, with 4 impressions, 1 click (25% CTR) and 1 conversion (100% CR), and Ad B with 2 impressions, 2 clicks (100% CTR) and 1 conversion (50% CR), the site tells me both that “Ad B has a higher CTR than ad A (99% Confidence Level)” and that “Ad A has a higher conversion rate than ad B (80% Confidence Level).”

Frankly, the only thing I have 99% confidence about is that Teasley, Marshall and Livingston should have a second look at their computer code to see what’s going wrong.

In the McKinsey Quarterly, Google’s chief economist Dr. Hal Varian said: “I keep saying the sexy job in the next ten years will be statisticians. People think I’m joking, but who would’ve guessed that computer engineers would’ve been the sexy job of the 1990s?”.

He’s absolutely right. In 1990, only a handful of geeks knew what a “homepage” or an “email” were. Ten years later, few people didn’t know. Likewise for search marketing, even basic concepts like determining a confidence interval to identify statistical significance can still seem esoteric. But the industry is quickly realizing that being able to do these calculations is not just for geeks anymore.

Opinions expressed in the article are those of the guest author and not necessarily Search Engine Land.

Related Topics: Channel: Other | Features: Analysis


About The Author: is Manager, AdMax R&D at The Search Agency and a frequent contributor to The Search Agents blog.

Connect with the author via: Email


Get all the top search stories emailed daily!  


Other ways to share:

Read before commenting! We welcome constructive comments and allow any that meet our common sense criteria. This means being respectful and polite to others. It means providing helpful information that contributes to a story or discussion. It means leaving links only that substantially add further to a discussion. Comments using foul language, being disrespectful to others or otherwise violating what we believe are common sense standards of discussion will be deleted. Comments may also be removed if they are posted from anonymous accounts. You can read more about our comments policy here.
  • Brian Lam

    Hi Bradd. For the Teasley calculator at least, it looks like it uses t-values assuming infinite degrees of freedom, which would definitely not apply in the example you described with only 4 impressions. As was mentioned in the first part of the article, you should have a fair amount of responses before evaluating the results or you could get misled into thinking one ad/homepage/headline is awesome when it really isn’t.

  • JulieFB

    Hi Bradd. This is a great article. I am a statistician and I couldn’t agree more that there is a fundamental flaw with the tools you mentioned. I’ve written a PPC ad split testing tool that does take low sample size into consideration. You should check it out. In your scenario with only four clicks this tool will not provide any recommendations and will tell you to collect more data.

  • smec

    Hi Bradd,
    full ack – there are a lot of naïve implementations out there – inappropriate use of statistical tests, which require more samples (e.g. chi square test would be inappropriate here, but even a chi sq. test would not pause any ad in your example).
    Your example CTRs: our own ecommerce optimization software would spit out a p-value of 0.4 => far away from the common 0.05 (or 95% prob.) threshold.

  • Bradd Libby

    JulieFB, your ‘PPC Ad Split Testing Tool’ looks and seems to work great. The FAQ is very informative, the ‘Show me details…’ link provides useful information when a test is successful, and the explanation when the test fails is informative.

    One recommendation: You should set up your testing tool (and your company’s other tools) on a separate domain with a simple, catchy name. ‘smec’ is right; there are a lot of low-value tools available and yours look like they are are worth being seen by a lot more people in the industry.

  • Terry Whalen

    Brad, great article – I’m going to link to this from my blog. Thank you.

    Now, it’s time to check out JulieFB’s ad testing tool(!)

  • Terry Whalen

    Julie, the tool is great. The FAQ is very thorough – I wonder if you may want to add the importance that ads being evaluated should have been run over the same period of time, with equal impression rotation. If folks take 2 ads where 1 of the ads had most of the action in week 1 and the other ad had most of the action in week 2, statistical differences may lead to wrong conclusions based on other things going on with bids and keywords during the 2 different time frames.

    Also, as a side note, in order to make sure ads get equal rotation for accurate testing, remember to set ad-serving to rotate (settings tab in adwords)!

Get Our News, Everywhere!

Daily Email:

Follow Search Engine Land on Twitter @sengineland Like Search Engine Land on Facebook Follow Search Engine Land on Google+ Get the Search Engine Land Feed Connect with Search Engine Land on LinkedIn Check out our Tumblr! See us on Pinterest


Click to watch SMX conference video

Join us at one of our SMX or MarTech events:

United States


Australia & China

Learn more about: SMX | MarTech

Free Daily Search News Recap!

SearchCap is a once-per-day newsletter update - sign up below and get the news delivered to you!



Search Engine Land Periodic Table of SEO Success Factors

Get Your Copy
Read The Full SEO Guide