Understanding Search Engines Duplicate Content Issues
I admit it. I am a search engine geek. Because I have a passion for understanding search usability, one of my particular interests is duplicate content filtering. If you want to really irritate searchers, present the same content to them in all or most of the top 10 positions in search results.
In the past, before search engines became effective at name results clustering, many search engine optimization (SEO) professionals, including myself, considered it quite the accomplishment to help client sites appear in the majority of the top 30 search results. I remember when one of my client sites held 24 of the top 30 positions. He thought I was the greatest invention since the light bulb. However, having analyzed the search data and Web analytics data, I realized that having all of those top positions did not necessarily mean top conversions. So I was happy to see how search engines are becoming increasingly effective at filtering out duplicate content.
At the SMX Advanced conference in June 2007, there were a few takeaways that I thought were very important for SEO professionals to keep in mind: multiple duplicate content filters, and knowing when to apply 301 redirects.
One common misconception about duplicate content filtering is that there is only one main duplicate filter. In fact, there are multiple duplicate filters, and they are applied throughout the three main parts of the search engine process:
- Spidering or crawling
- Query processing
Some duplicate content filters weed out content before Web pages are added to the index, meaning that some duplicate content will not be displayed in search results. A Web page cannot rank until it is in a search engine index; therefore, crawl-time filters can actually exclude URLs from being added to the search engine index.
Some duplicate content filters are applied after pages are added to the search engine index. Web pages are available to rank, but they might not display in search engine results pages (SERPs) as Web site owners might like them to appear. For example, no one wants their content to appear in the dreaded Supplemental Index.
Another common misconception is that if a listing appears in Google’s Supplemental Index, the site has been penalized. Duplicate content does not cause a site to be placed in the Supplemental Index. From Vanessa Fox’s blog:
If you have pages that are duplicates or very similar, then your backlinks are likely distributed among those pages, so your PageRank may be more diluted than if you had one consolidated page that all the backlinks pointed to. And lower PageRank may cause pages to be supplemental.
And from Matt Cutts’ blog:
Having urls in the supplemental results doesn’t mean that you have some sort of penalty at all; the main determinant of whether a url is in our main web index or in the supplemental index is PageRank. If you used to have pages in our main web index and now they’re in the supplemental results, a good hypothesis is that we might not be counting links to your pages with the same weight as we have in the past. The approach I’d recommend in that case is to use solid white-hat SEO to get high-quality links (e.g. editorially given by other sites on the basis of merit).
301 redirects vs. robots exclusion
Remember when meta-tag content used to be the “secret weapon” to getting top rankings in Infoseek? Lately, I feel that search engine optimization professionals feel that 301 redirects are the secret weapon to getting and preserving link development, especially when redundant/duplicate content is involved.
For those of you who do not know what a 301 redirect is, I like to use this analogy. Have any of you ever moved and had to fill out those change of address cards at the post office? Basically, when you fill out these change of address cards, you are telling the U.S. postal service that your address has moved permanently to a new address. I like to think of a 301 is a change of address card for computers. The status code is telling search engines that the content at a specific URL (Web address) has permanently moved to another URL.
There are times when using 301 redirects are appropriate and times when it is not appropriate. For example, let’s use a home page. The following home page URLs typically lead to the same content:
In this situation, it is best to implement a 301 redirect so that the most appropriate URL will lead the home page content. Search engines utilize canonicalization, which is the process of selecting the most appropriate URL when there are several choices. Be pro-active. Don’t let the search engines determine the most appropriate URL to crawl and to display in search results. As the Web site owner, you should select the URL that is best for your business and target audience.
Implementing 301 redirects is not the solution for every instance of duplicate content, in spite of what many SEO professionals might claim. The robots exclusion protocol is often far more appropriate.
Here is an example. Suppose a Web site owner has purchased and implemented a new content management system (CMS), and, as a result, the URL structure changed. During the site redesign, the Web site owner has eliminated content that has not converted well or is outdated. Should the Web site owner implement 301 redirects for the eliminated content?
Many SEO professionals often state that 301 redirects should be implemented to preserve the “link juice” to the expired content. In this situation, if a searcher clicks on a link to the expired content, he/she will typically be redirected to the home page. How does this benefit the search experience? The searcher expects to be delivered to specific content. Instead, he/she is redirected to a home page to begin searching for the desired content. It is a futile process, as the content has been removed. The result is a negative search experience and a negative user experience.
If content is removed, then delivering a custom 404 page is more appropriate, in spite of the “link juice” theory.
Search usability is not a term that is only applicable to Web search engines. Search usability does not only address querying behavior. It also addresses other search behaviors (browsing, scanning, etc.) Duplicate content delivery often has a negative impact on a site’s overall search usability, before site visitors arrive at your site and after they arrive. By understanding how the commercial Web search engines filter out and display duplicate content, Web site owners can obtain greater search engine visiblity and a better user experience.
Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.