5 ways to avoid duplicate content and indexing issues on your e-commerce site

Before a page can rank well, it needs to be crawled and indexed. Contributor Manish Dudharejia shares five tips to give your pages the best chance of getting indexed in the search results.

Chat with SearchBot

Duplicate Twins Two Of A Kind Shutterstock 179656370More than any other type of site, e-commerce sites are notorious for developing URL structures that create crawling and indexing issues with the search engines. It’s important to keep this under control in order to avoid duplicate content and crawl budget complications.

Here are five ways to keep your e-commerce site’s indexation optimal.

1. Know what’s in Google’s index

To begin with, it’s important to regularly check how many of your pages Google reports as indexed.  You can do this by running a “site:example.com” search on Google to see how many pages Google is aware of across the web.

Manish1

While Google webmaster trends analyst Gary Illyes has mentioned this number is only an estimate, it is the easiest way to identify whether or not something is seriously off with your site’s indexing.

In regards to the number of pages in their index, Bing’s Stefan Weitz has also admitted that Bing

…guesstimates the number, which is usually wrong…I think Google has had it for so long that people expect to see it up there

Numbers between your content management system (CMS) and e-commerce platform, sitemap, and server files should match almost perfectly, or at least with any discrepancies addressed and explained. Those numbers, in turn, should roughly line up with what returns in a Google site operator search. Smart on-site SEO helps here; a site developed with SEO in mind helps considerably by avoiding duplicate content and structural problems that can create indexing issues.

While too few results in an index can be an issue, too many results are also an issue since this can mean you have duplicate content in the search results. While Ilyes has confirmed that there is no “duplicate content penalty,” duplicate content still hurts your crawl budget and can also dilute the authority of your pages across the duplicates.

Manish3

If Google returns too few results:

  • Identify which pages from your sitemap are not showing up in your Google Analytics organic search traffic. (Use a long date range.)
  • Search for a representative sample of these pages in Google to identify which are actually missing from the index. (You don’t need to do this for every page.)
  • Identify patterns in the pages that are not indexing and address those systematically across your site to increase the chances of those pages getting indexed. Patterns to look for include duplicate content issues, a lack of inbound internal links, non-inclusion in the XML sitemap, unintentional noindexing or canonicalization, and HTML with serious validation errors.

If Google is returning too many results:

  • Run a site crawl with ScreamingFrog, DeepCrawl, SiteBulb, or a similar tool and identify pages with duplicate titles, since these typically have duplicate content.
  • Determine what is causing the duplicates and remove them. There are various causes and solutions and those will make up much of the rest of this post.

2. Optimize sitemaps, robots.txt, and navigation links

These three elements are fundamental to strong indexation and have been covered in depth elsewhere, but I would be remiss if I did not mention them here.

I cannot stress how important a comprehensive sitemap is. In fact, we seem to have reached the point where it is even more important than your internal links. Gary Ilyes recently confirmed that even search results for “head” keywords (as opposed to long tail keywords) can include pages with no inbound links, even no internal links. The only way Google could have known about these pages is through the sitemap.

It is important to note Google and Bing’s guidelines still say pages should be reachable from at least one link, and sitemaps by no means disqualify the importance of this.

It’s equally important to make sure your robots.txt file is functional, isn’t blocking Google from any parts of your site you want to be indexed, and that it declares the location of your sitemap(s). Functional robots.txt files are very important since if they are down, it can cause Google to stop indexing your site altogether according to Ilyes.

Finally, an intuitive and logical navigational link structure is a must for good indexation. Apart from the fact that every page you hope to get indexed should be reachable from at least one link on your site, good UX practices are essential. Categorization is central to this.

For example, research by George Miller of the Interaction Design Foundation suggests the human mind can only hold about seven chunks of information in short-term memory at a time.

I recommend your navigational structure be designed around this limitation, and in fact, maybe even limit your menu to no more than five categories to make it even easier for people to use. Five categories per menu section and five subcategories per drop-down may be easier to navigate.

Here are some important points that Google representatives have made about regarding navigation and indexation:

Bing recommends the following:

  • Keyword-rich URLs that avoid session variables and docIDs.
  • A highly functional site structure that encourages internal linking.
  • An organized content hierarchy.

3. Get a handle on URL parameters

URL parameters are a very common cause of “infinite spaces” and duplicate content, which severely limits crawl budget and can dilute signals. They are variables added to your URL structure that carry server instructions used to do things like:

  • Sort items.
  • Store user session information.
  • Filter items.
  • Customize page appearance.
  • Return in-site search results.
  • Track ad campaigns or signal information to Google Analytics.

If you use Screaming Frog, you can identify URL parameters in the URI tab by selecting “Parameters” from the “Filter” drop-down menu.

Examine the different types of URL parameters at play. Any URL parameters that do not significantly impact the content, such as ad campaign tags, sorting, filtering, and personalizing, should be dealt with using a noindex directive or canonicalization (and never both). More on this later.

Bing also offers a handy tool to ignore select URL parameters within the Configure My Site section of Bing Webmaster Tools.

If the parameters significantly impact the content in a way that creates pages which are not duplicates, here are some of Google’s recommendations on proper implementation:

  • Use standard URL encoding, in the “?key=value&” format. Do not use non-standard encodings such as brackets or commas.
  • You should use parameters, never file paths, to list values that have no significant impact on the page content.
  • User-generated values that don’t significantly impact the content should be placed in a filtering directory that can be hidden with robots.txt, or otherwise dealt with using some form of noindexing or canonicalization.
  • Use cookies rather than extraneous parameters if a large number of them are necessary for user sessions to eliminate content duplication that tax web crawlers.
  • Do not generate parameters for user filters that produce no results, so empty pages do not get indexed or tax web crawlers.
  • Only allow pages to be crawled if they produce new content for the search engines.
  • Do not allow links to be clicked for categories or filters that feature no products.

4. Good and bad filters

When should a filter be crawlable by the search engines, and when should it be noindexed or canonicalized?  My rule of thumb, influenced by Google’s recommendations above, is that “good” filters:

I feel these are or should be indexed.  “Bad” filters, in my opinion:

  • Reorganize the content without otherwise changing it, such as sorting by price or popularity.
  • Keep user preferences that change the layout or design but don’t affect the content.

These types of filters should not be indexed, and should instead be addressed with AJAX, noindex directives, or canonicalization.

Bing warns webmasters to use the AJAX pushState function to create URLs with duplicate content, or this defeats the purpose.

5. Proper use of noindex and canonicalization

Noindexing tells the search engines not to index a page, while canonicalization tells the search engines that two or more URLs are actually the same page, but one is the “official” canonical page.

For duplicates or near-duplicates, canonicalization is preferred in most cases since it preserves SEO authority, but it is not always possible. In some circumstances, you don’t want any version of the page indexed, in which case noindex should be used.

Do not use noindex and canonicalization at the same time. John Mueller has warned against this because it could potentially tell the search engines to noindex the canonical page as well as the duplicates, although he said that Google would most likely treat the canonical tag as a mistake.

Here are things that should be canonicalized:

Here are things that I recommend be noindexed:

  • Any membership areas or staff login pages.
  • Any shopping cart and thank you pages.
  • Internal search result pages.  Illyes has said ” Generally, they are not that useful for users and we do have some algorithms which try to get rid of them…”
  • Any duplicate pages that cannot be canonicalized.
  • Narrow product categories that aren’t sufficiently unique from their parent categories.
  • As an alternative to canonicalization, Bing recommends using their URL normalization feature, found within Bing Webmaster Tools. This limits the amount of crawling necessary and allows your freshest content to be easily indexed.

Contributing authors are invited to create content for Search Engine Land and are chosen for their expertise and contribution to the search community. Our contributors work under the oversight of the editorial staff and contributions are checked for quality and relevance to our readers. The opinions they express are their own.


About the author

Manish Dudharejia
Contributor
Manish Dudharejia is cofounder and president of E2M Solutions Inc., a full service digital agency specialized in eCommerce SEO and websites design and development. Manish has over 10 years of hands-on experience dealing with technical SEO for eCommerce sites from several different niches.

Get the newsletter search marketers rely on.