Canonical Form: The Hidden Keywords In Paid Search

In this post, let’s look at the Canonical Form that Search Engines use behind the scenes when matching our paid keywords to actual user queries. What is it? Why do they do it? So what? Or, more importantly, how can we use it to our advantage? We will answer each of those in turn. First […]

Chat with SearchBot

In this post, let’s look at the Canonical Form that Search Engines use behind the scenes when matching our paid keywords to actual user queries. What is it? Why do they do it? So what? Or, more importantly, how can we use it to our advantage? We will answer each of those in turn. First up: What is it?

Canonicalization

Canonical Form

The canonical form of a keyword refers to the form of the keyword that Paid Search Engines use behind the scenes to match keywords to actual search queries. It is sometimes referred to as Normal Form (Normalized Form) or Equivalent Form. For this article, let’s call this Canonical Form, or Canonicalization.

Wikipedia has a good canonicalization reference, in case you are curious about the origin or use of the word. Every Search Engine does this a bit differently, but the basic principles are similar. So let’s cover this a bit theoretically, without dwelling on the details or the particular differences between Search Engines. We can start with case (e.g.: upper-case vs. lower-case letters).

Upper Case vs. Lower Case

Case is insignificant in paid search (at least, from a keyword matching perspective). Search engine canonicalization will match a user query for “nasa” with an exact-match paid-keyword “NASA.”

Search Engines regard the canonical form of “NASA” to be “nasa,” and they both are considered to match the user query exactly. For that matter, “NaSa” would also be an exact match, as well as every other combination of upper and lower letters. Similar things happen for punctuation.

Punctuation

In general, the rule is that punctuation is replaced with a space to translate to the canonical form. For example, you may have noticed that searches for “bikes com” will match your exact-match paid-keyword “bikes.com” and vice-versa. Likewise, leading, trailing, and double-spaces are all insignificant.

A user-query for “bicycle  store” will match a paid-keyword ” bicycle store” (with a leading-space and a ”  ” double-space). AdCenter provides a list of extraneous characters on their help site. AdWords provides a list of ignored symbols on their help site.

Possessives

AdCenter addresses most of the high-volume and regular possessives directly (but not all of them). For example, the search query “Mike’s Bike” is equivalent to the canonical form “mike bike.”

In AdWords, it would be “mike s bike.” In adCenter’s parlance, adCenter normalizes the possessive form of words, such as Mike’s to Mike.

Plurals

Canonical form can collapse plurals together (but will not always do so). A user-query for “bikes” could match an exact-match paid-keyword “bike.” (Please note: I am aware this example is in direct contrast to the information provided via the link below with regards to plurals of the word “bike.” It is just an example for illustration. Check your own user query report to find examples where plurals are treated as equivalent and delivered as exact-match.)

Likewise for non-standard plurals, like “battery” and “batteries.” They may be treated as equivalent. Of the canonicalizations covered so far, this one seems to be the most inconsistently applied across search engines and over time.

Noise Words

Canonicalization can remove “noise words” from the mix as well. For example, Ad Center will canonicalize a paid keyword “bike for the beach” to be “bike beach.”

The noise words “for” and “the” are not considered when AdCenter matches the canonical form of your paid-keyword to the user-query. AdCenter provides a list of extraneous words on their help site. (I didn’t find an equivalent list on AdWords help – maybe the community will add it to the comments, below?)

So Far…

So far we have: (letter) case, punctuation, whitespace, and plurality, and possession, but there is more.

Did you notice that we have crossed into territory where canonicalization might start to modify the intent of the original search query? “Bike for the beach” implies a different user intent than “bike beach.” The former quite clearly looking for a bike, while the latter would most likely be looking for a place. This does not stop here – there is more.

Misspellings & Closely-Related Words

Taking this one step further, canonicalization will sometimes collapse misspellings, and even seemingly different words to be the same. I am going to use theoretical, illustrative examples here, without claiming that either engine actually canonicalizes these particular keywords in this exact way.

So, an example then; Consider a paid-keyword “bike mart.” Canonicalization could collapse misspellings like “bikemarte” to be equivalent. Similarly synonym substitutions can be made. “Cycle mart” could conceivably be canonicalized to “bike mart” (Again, this is an example meant to be illustrative. I don’t think the search engines have ever actually canonicalized “cycle” to “bike.”)

These canonicalizations happen in particular with brands that happen to be slight misspellings, and also as we reach into the tail for more specific keywords.

AdWords Specific Notes: “site:” & Broad-Match Modifier In Negatives

AdWords will remove “site:” words from your keyword as part of canonicalization. For example, if you add “site:SearchEngineLand.com Crosby” as a keyword, AdWords will consider that equivalent to a keyword “crosby.” It will ignore the rest.

Likewise, if you use “+” either accidentally or in an attempt to trigger broad-match-modifier functionality in a negative keyword, the “+” is ignored as an extraneous symbol. It has no effect.

When & Where is Canonicalization Happening?

Canonicalization applies to negatives and all match types. Canonicalization happens prior to matching via match type, it is like a pre-filter for comparing keywords and user queries. It is always on; You can’t turn it off.

Gather Your Own Data

Don’t take my word for it. You can gather your own evidence. Pull a search query report from a Search Engine that includes both the paid-keyword and paid-match-type, and the user-query it matched. Better yet, pull it from your own analytics source. You may be surprised at what you find.

Why?

Paid Search Engines are businesses (and that is a good thing, believe it or not.) As businesses, they monetize searches by collecting fees from advertisers who pay-per-click in a competitive auction market for each keyword. They are motivated to generate the most value from those searches.

In an admittedly simplistic view, they may seek to “maximize profit, ” “maximize user value,” or “maximize advertiser value,” or some combination of all three. Let’s consider the “keyword market” for each user query the Search Engine receives.

On one hand, Search Engines could provide literal interpretation of the user-queries, and require advertisers to discover and manage all of the various forms of punctuation, capitalization, etc. to match each user-query literally.

In our example above, this would require an advertiser to run 2^4 variations of “NASA” to cover the various ways people could search for “NASA” using different capitalization (e.g.: “Nasa”, “nASA”,etc.). Clearly, this is way too granular, provides minimal incremental value, and would be quite burdensome on the advertisers. Advertisers would stop short of full coverage because it just wouldn’t be worth it. So Advertiser burdens would detract from user-value, and ultimately, Search Engine value.

On the other extreme, Search Engines could collapse everything. Advertisers would have one thing to manage, and would be eligible to appear on every SERP (Search Engine Results Page) for any user-query. Selling travel? Bid $5.25 for “run of site” on Google.com. Selling bird feeders? Bid $5.15 for run of site on Google.com…

Obviously, that would not provide anywhere near the value generated by breaking up the keyword markets in a more granular way. We need to draw a line somewhere. That is the game Search Engines play, and thankfully they play as rational businesses.

In this context, the low-level canonicalizations of case, punctuation, etc. are readily explained. But what about the more interesting cases? Now that we have set the stage, let’s consider a more interesting example; “bike” and “cycle” (theoretically, of course).

Let’s say that searches for “bike” monetize for the Search Engines at $.15 CPC, and searches for “cycle” monetize at $.10. If we could collapse the two keywords, we’d be looking at an incremental $.5 per click every time a user clicks on an ad after searching for “cycle.” Granted, this gets complicated fast as we could argue that the value is diminished, so the advertisers would adjust their bids down, which would reduce the effective CPC and mitigate the expected gains. Yes, they probably would.

We could also consider CTR, ad relevance, etc. They would all be impacted. It is a moving target to be sure. The point is; the Search Engine has a mechanism for collapsing keyword markets (or leaving them distinct). They play this game according to whatever their goals and values are, and just as with most human endeavors, they play it imperfectly.

So What?

This is the fun part. What can you, the discerning PPC Advertiser that you are, do about all of this? You can use it to your advantage to save time and to optimize your accounts.

For starters, you are already reaping the rewards of matching all the different combinations of capitalization, punctuation, misspellings, and other variations on your keywords that just don’t matter. Now that you know why and how, there are also some things you may start to notice, and some things you can do more actively.

For example, have you ever wondered why adCenter Desktop is kicking out words as duplicates, when they don’t appear to actually be duplicates? AdCenter added a canonicalization filter to the Desktop Editor. It stops words from being uploaded before they even make it to adCenter. The same thing would happen if you tried to add them via the Web interface. AdWords tends to allow you to add them regardless, and then sorts it out later by dividing up the traffic between them. While adCenter can be a bit obtrusive in this process, I personally like knowing that every keyword is a unique keyword in adCenter. This brings us to our next opportunity.

You can also save yourself the effort of adding all the different variations of “YourSite.com”, “YourSite com”, “www.YourSite.com”, “www YourSite com”, etc. Just because AdWords or adCenter lets you add them, doesn’t mean they are adding coverage or doing good things to your account. A generalized best practice is to manage all of your keywords in lower case, replacing all punctuation with ” “, and trimming all leading, trailing and double spaces.

If you want to be really complete, you could even remove all the extraneous noise words; this helps you make sure you are not bloating your account with effective duplicates. One possible exception would be if you are using Dynamic Keyword Insertion and have a word like “NASA” that should appear in all caps. In this case, you would of course want to add the keyword with all caps.

Let’s take that a step further and actively remove effective duplicates from your account (e.g.: words that you have been able to add, but that have equivalent canonical forms). If you have them in your account now, you are effectively dividing your traffic arbitrarily between them.

You have an opportunity to collapse that data down into one keyword, removing bloat and giving you more direct control over bids, ads, destination URLs, etc. For the coders out there, adCenter provides an API call GetNormalizedStrings Service Function to assist with this process.

Here is an Excel formula that does much of the basic canonicalization work for you:

=TRIM((SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(CLEAN(LOWER(A1)),"'"," "),"."," "),","," "),"-"," ")))

You could safely use this on the majority of your keyword and negative-keyword operations and improve the manageability of your accounts.

Here is one last handy trick (and if you have read this far, you deserve some gold stars). You can reset AdWords Quality Score on a keyword by adding it with different capitalization. Try it out in your account.

Go find a keyword with a terrible Quality Score (4 or lower), then add that keyword with different capitalization. You should start out with a default (hopefully higher) Quality Score. Here is your chance to breathe new life into that dying keyword! Now make sure you have the best ads possible, great negatives, and a healthy bid to get this one back on the starting lineup.

Good Luck out there, and Happy Holidays!


Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.


About the author

Crosby Grant
Contributor

Get the must-read newsletter for search marketers.