Today, Google webmaster tools has launched a new message alert to let site owners know when a particular URL doesn’t appear because Google sees it as duplicate of a URL on a different domain. In the blog post announcing the feature and in an in-depth help topic, they provide details on how they identify duplicate clusters of content and choose a “canonical” version of that cluster to display in search results.
“When we discover a group of pages with duplicate content, Google uses algorithms to select one representative URL for that content. A group of pages may contain URLs from the same site or from different sites.”
They note that when they choose a representative URL from a different domain, they call this “cross-domain URL selection”.
In cases where multiple URLs contain the same content (for instance, due to infrastructure configuration, optional parameters, syndication, or internationalization), many options exist for site owners to indicate to Google which version is canonical.
However, in some cases, the site owner doesn’t use these options to specify a preferred version or Google may select a different version than the site owner specifies.
This new feature alerts site owners when their “algorithms select an external URL instead of one from their website”. They say common reasons for this include:
- Site owner-specified – if you’ve moved your domain or have implemented the rel=canonical attribute to indicate that a page on another domain is canonical, then this alert is simply confirmation that Google is indexing as you’ve specified.
- Regional sites – if you have the same content on multiple regional sites (for instance, the same English content on a .com (for US), a .co.uk, and a .com.au), Google may cluster pages with identical content across sites and use relevance signals to determine which to display per query.
- Incorrect canonicalization – in this case, a page may inadvertently use the rel=canonical attribute to specify a page on a different domain as canonical.
- Misconfigured server – a hosting misconfiguration (this in particular happens sometimes with shared hosting) may cause a two different domains to display the same content)
- Hacked site – sites are sometimes hacked to point to other domains.
- Scraped content – the blog says that “in rare situations”, Google may select a URL from a site that has scraped your content.