Google Lets You Tell Them Which URL Parameters To Ignore
A new feature has appeared in the Site Configuration Settings Sections of Google Webmaster Tools. The setting, called Parameter Handling, enables site owners to specify up to 15 parameters that Google should ignore when crawling and indexing the site. Google lists the parameters they’ve found in the URLs on your site, and indicates whether or […]
A new feature has appeared in the Site Configuration Settings Sections of Google Webmaster Tools. The setting, called Parameter Handling, enables site owners to specify up to 15 parameters that Google should ignore when crawling and indexing the site.
Google lists the parameters they’ve found in the URLs on your site, and indicates whether or not they think they those parameters are extraneous (with a suggested “Ignore” or “Don’t ignore”. You can confirm or reject those suggestions and can add parameters that aren’t listed.
So what does this mean for site owners?
The primary value of the feature is to improve the canonicalization of a site in Google’s index due to duplicate content. Canonicalization issues occur when multiple URLs load the same content. This scenario can be problematic for a number of reasons (for instance, it can skew analytics data) but from a search perspective, canonicalization issues can cause:
- Crawl efficiency problems: if search engine bots crawl the same page via multiple URLs, they may not have resources to crawl as many unique pages on the site
- PageRank dilution that can lead to lowered search rankings: if external sites link to multiple versions of a page, each page has less Page Rank value than if all links were to one version
- Display and branding problems: search engines display only one version of the URL; you ideally want the canonical version of a URL to display (mysite.com/goldfish) rather than a version with extraneous parameters (mysite.com/goldfish?adid=1205123&sid=452006&sort=high-rating&loc=sea)
A number of canonicalization solutions exist, including several that are Google-specific, so why did they launch this new feature? Yahoo! has included a similar feature as part of its Site Explorer webmaster product for some time and site owners have been asking for a similar feature from Google for a while (certainly at least since I was working on Webmaster Central).
Below a rundown of the various canonicalization options and how this one differs.
Google Webmaster Tools Parameter Handling: When URLs Can Contain Optional Parameters
This new option only helps with canonicalization issues that are caused by optional parameters that are in a standard key-value pair format and that you specify. In other words, it can only be an exclusionary list (don’t crawl parameters x,y, and z) rather than inclusionary (only crawl parameters a and b).
Wouldn’t you always know the complete list of potential parameters? Hopefully. But some canonicalization issues happen because a URL can take any parameters at all. Ideally, you want to ensure your server isn’t set up this way, but if you need this configuration (for instance another team or outside agency needs the ability to use any custom tracking code without waiting for that parameter code to be added to the server set up), then you’re better off using the meta canonical tag.
The two most common reasons for optional parameters and that this feature will work well for are:
- Tracking codes used for analytics data (in this case, you may not want to implement a 301 redirect from the long version of the URL to the canonical one since you could lose the data)
- Page layout changes, such as sort orders (in this case, the code on the page uses the parameter to change the layout of the page, but from a search engine perspective the content on each version is the same, just in a different order)
Why use this canonicalization option over the others? The biggest benefit is likely in the increase in crawl efficiency. When Google discovers a new URL, they can check the included parameters against the parameter handling list and remove any optional ones before crawling it (but still credit any found links to the page). This could substantially reduce the crawling overhead on a site and could free up considerable bandwidth for getting other pages of the site crawled.
It’s also fairly simple to use. Just scan the list of suggested parameters and click the ones that are optional. In some organizations, it can be difficult to get source code added to web pages, making the implementation of the canonical tag difficult and time consuming. With this option, if you have verified webmaster tools access, you don’t need to involve IT at all.
What are the drawbacks to this option? The most obvious issue with this option is that it only works for Google. In time past, you could use this setting and the corresponding one in Yahoo! Site Explorer and not worry about other engines. But with Microsoft Bing’s impending (likely) replacement of Yahoo’s search index, it’s quite possible that Yahoo’s feature will go the way of its index, and if Microsoft doesn’t offer something similar, then a search index with 25%+ market share could be getting your URLs wrong.
You could also shoot yourself in the foot, metaphorically speaking. You could accidentally tell Google to ignore important parameters that, if dropped from the index, could wipe out large portions of your site. As Google adds more of these types of features to webmaster tools, it becomes more important to ensure that anyone who has access to them know what they’re doing.
In reality, Google likely has safeguards in place that at least partially protect against such accidental destruction. That’s undoubtedly why they say that “While Google takes suggestions into account, we don’t guarantee that we’ll follow them in every case.” They don’t want large portions of their index disappearing either.
Unlike accidental blocking with robots.txt, which search engines follow as a directive, this feature (and many of the others) is a signal only. If the other signals already in place strongly contradict it (for instance, the content seems to be vastly different), it likely won’t be used.
But even though Google has safeguards like this one in place, you may not want to chance it if you’re not confident of which parameters are really optional (all the time, since this is a site-wide setting).
This option also won’t work if your canonicalization issues aren’t related to parameters or if the parameters aren’t in standard key-value pair format.
Meta canonical attribute
The canonical attribute is a page-level meta tag that specifies the canonical version for the page. This can be useful because no matter what optional parameters are added to the version of the URL that renders the page, search engines can always know the canonical version. You can find detailed information about this tag in my article about its launch.
Why use this canonicalization option over the others? You just specify the canonical version of a page once, and no matter what parameters are added to the URL, search engines are always provided with the canonical version.
Since this meta data is on the page itself, any search engine can read it, and in fact, Google, Yahoo!, and Microsoft have all announced support for it. As of yet though, only Google seems to be actively using it.
What are the drawbacks to this option? Unlike with the parameter handling feature, search engines have to crawl the page before they can read the tag, so some crawl efficiency is lost. This tag should promote long-term efficiency, however, since theoretically, once the bot has crawled the non-canonical version of the URL and read the tag, it shouldn’t have to crawl that version of the URL again.
As already noted, implementation requires modification of the page source code, which isn’t always easy within some organizations.
As with parameter handling, it’s possible to implement this tag incorrectly. For instance, it’s been discovered that some sites have accidentally set the canonical version of every page to the home page. As with the parameter handling feature, search engines consider the tag a “strong hint” as a precaution against these types of mistakes and won’t use the data when it strongly contradicts their other signals. In the case of Google, the only search engines who is actively using the tag so far, this has proven to be the case.
It’s universally agreed that (other than not have multiple versions of a URL at all) the best way to canonicalize URLs is to redirect all versions to the canonical one using a 301 redirect. This implementation sends all users and search engines to the canonical version and effectively consolidates all links to the page and ensures only the canonical one is indexed and ranked.
Why use this canonicalization option over the others? It’s understood and followed by all major search engines and it provides the best user experience (visitors have one URL to access, bookmark, and share). In most cases, search engines consolidate all links to the redirect target and rank the canonical one.
This option is the best choice when you are moving content (for instance, changing your URL structure or changing domains) and to indicate whether your want content indexed under the www or non-www version of the domain.
Also keep in mind that if you redirect to the canonical version you’re more likely to get links to the right version, since most visitors will simply copy and paste what they see in the address bar.
What are the drawbacks to this option? When you are using parameters for sort orders or tracking, a redirect may negate those parameters. ou can generally configure your analytics program to handle this properly, but it probably won’t work out of the box.
Redirects can also slow down crawl efficiency, particularly due to redirect chains. Ideally, search engines crawl the redirect then eventually stop crawling the origination URL, but if the bot encounters links to the original URL, it will continue crawling both versions (or more, if the page has moved multiple times).
Google webmaster tools change address feature
This feature enables you to tell Google when you’re changing domains. You have to verify ownership of both the old domain and the new domain and then you can specify a move from one to the other. You can find more information about this feature xx.
Why use this canonicalization option over the others? The best use of this feature is when you are changing domains and you aren’t able to implement a 301 redirect from the old domain to the new. (This is the case, for instance, with blogspot.com sites.) Even if you are able to implement the redirect, it can’t hurt to let Google know as well!
What are the drawbacks to this option? You can only use this option to move from one domain to the other. And as with the other Google webmaster tools features, it only works for Google.
Google webmaster tools preferred domain feature
The preferred domain feature enables you to tell Google whether your want your domain indexed with the www subdomain or without it.. Since most sites resolve either way, a complete duplicate set of content of your site will exist if you don’t set www/non-www canonicalization. Why is this a problem? Ideally it’s not and search engines consolidate the content correctly. But often, search engines find links to both versions and end up crawling both, indexing both, and crediting the links to the versions separately.
Why use this canonicalization option over the others? You may as well always use this option, although you should implement a 301 redirect as well, if you can. Google initially implemented this feature for those sites that weren’t able to do so.
What are the drawbacks to this option? Again, this option works only for Google. And it doesn’t provide as much of a guarantee as a 301 redirect.
Blocking duplicate content with a robots directive
The traditional advice for avoiding duplicate content has been to block the duplicates with robots.txt (or a robots meta tag) to ensure the correct version is indexed. It can be important that the right version be indexed vs. the version intended for print, for instance.
Why use this canonicalization option over the others? Generally speaking, you shouldn’t now that the canonical meta tag is available. The scenarios for which you wouldn’t want to redirect (such as the print version example) can be more easily solved with the canonical tag and the scenarios for which you’re worried about crawl efficiency issues that would leave large portions of your site uncrawled (such as large-scale optional parameters) can now more easily be solved with Google’s parameter handling feature.
What are the drawbacks to this option? The primary drawback to this option is the loss of link credit. Any links to blocked pages fall into a black hole and can’t be credited to the canonical version of the page, as happens with the other options.
The parameter handling feature can also provide insight on how Google sees your site
For some time, Google has been attempting to canonicalize URLs and show the canonical version in the results, even when a site owner hasn’t implemented any of these canonicalization options. For instance, they may determine that several pages contain the same content and algorithmically consolidate them and associate them with the one Google determines is canonical. They haven’t described exactly how they determine the canonical version, but they might, for instance, choose the URL with the fewest number of parameters or the shortest version of the URL.
Last year, they started letting webmasters know when they encountered URLs that they thought were extraneous and were causing crawling problems. It’s likely that Google is using a similar source to generate the list of parameters it suggests should be ignored.
In this way, the parameter handling feature provides insight into how Google perceives the site. If you see many parameters listed that aren’t optional, take a look at the content on the URL that use those parameters.
This could signify a larger problem. It could be that Google doesn’t see enough unique content on them (this can happen, for instance, with pages that list part numbers, contain mostly images and item codes, or list little information outside of a login). You may want to look for ways to differentiate the pages a bit more.
Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.