A Lesson From the Indexing of Google Translate: Blocking Search Results From Search Results

Last year, Google published an SEO Report Card of 100 Google properties. In it, they rated themselves on how well the sites were optimized for search. Google’s Matt Cutts presented the results at SMX West 2010 in Ignite format. He noted that not every Googler is an expert in search and search engine optimization. Googlers who don’t work in search don’t get preferential treatment from those who do and just like any site on the internet, sometimes things aren’t implemented correctly. Just because a site is owned by Google doesn’t mean it’s the best example of what to do in terms of SEO.

This morning Rishi Lakhani tweeted about Google Translate pages appearing in Google search results. As you can see in the example below, pages with individual translation requests have been indexed.

Google Translate Search Results

All of the URLs that include a parameter seem to be individual translations. For instance, http://translate.google.com/?q=ART# displays as follows:

Google Translate Example

The problems with these types of pages being indexed in search engines is twofold:

A site owner might also want to block these types of pages from being crawled and indexed to increase crawl efficiency and ensure the most valuable pages on the site are being crawled and indexed instead.

I asked Google about this and they confirmed that indeed it was simply a matter of the Google Translate team not being aware of the issue and said they would resolve it.

Blocking Autogenerated Search Pages From Being Indexed

In the case of Google Translate, the ideal scenario is that the main page and any secondary pages (such as this tools page) be indexed, but that any pages from translation requests not be indexed.

Using robots.txt

The best way to do this would be to add a disallow line in the robots.txt file for the site that blocks indexing based on a pattern match of the URL query parameter. For instance:

Disallow: /*q=

This pattern would prevent search engines from indexing any URLs containing q=. (The * before the q= means that the q= can appear anywhere in the URL.)

In the case of translate.google.com (and all related TLDs), the robots.txt file that exists for the subdomains seems to be copied from www.google.com. Remember that search engines obey the robots.txt file for each subomain separately. Using the same robots.txt file for a subdomain that’s used for the www variation of the domain could have unintended consequences because the subomain likely has an entirely different folder and URL structure. (You can always check the behavior of your robots.txt file using Google Webmaster Tools.)

Adding the disallow pattern shown above to the www.google.com/robots.txt file would not work as search engines wouldn’t check that file when crawling the translate subdomain and in would instead cause search engines not to index URLs that match the pattern on www.google.com.

translate.google.com (and all google.com subdomains should have their own robots.txt file that’s customized for that subdomain.

Using the meta robots tag

If Google isn’t able to create a separate robots.txt file for the translate subdomain, they should first remove the file that’s there (and from other subdomains as well, as it could be causing unexpected indexing results for those subdomains). Then, they should use the meta robots tag on the individual pages they want blocked. Since the pages in question are dynamically generated, the way to do this would be to add logic to the code that generates these pages that writes the robots meta tag to the page as its created. This tag belongs in the <head> section of the page and looks as follows:

<meta="robots" content="noindex">

Opinions expressed in the article are those of the guest author and not necessarily Search Engine Land.

Related Topics: Channel: SEO | Features: Analysis | Google: SEO | How To: SEO | SEO: Blocking Spiders

Sponsored


About The Author: is a Contributing Editor at Search Engine Land. She built Google Webmaster Central and went on to found software and consulting company Nine By Blue and create Blueprint Search Analytics< which she later sold. Her book, Marketing in the Age of Google, (updated edition, May 2012) provides a foundation for incorporating search strategy into organizations of all levels. Follow her on Twitter at @vanessafox.

Connect with the author via: Email | Twitter | Google+ | LinkedIn



SearchCap:

Get all the top search stories emailed daily!  

Share

Other ways to share:
 

Read before commenting! We welcome constructive comments and allow any that meet our common sense criteria. This means being respectful and polite to others. It means providing helpful information that contributes to a story or discussion. It means leaving links only that substantially add further to a discussion. Comments using foul language, being disrespectful to others or otherwise violating what we believe are common sense standards of discussion will be deleted. Comments may also be removed if they are posted from anonymous accounts. You can read more about our comments policy here.

Comments are closed.

Get Our News, Everywhere!

Daily Email:

Follow Search Engine Land on Twitter @sengineland Like Search Engine Land on Facebook Follow Search Engine Land on Google+ Get the Search Engine Land Feed Connect with Search Engine Land on LinkedIn Check out our Tumblr! See us on Pinterest

 
 

Click to watch SMX conference video

Join us at one of our SMX or MarTech events:

United States

Europe

Australia & China

Learn more about: SMX | MarTech


Free Daily Search News Recap!

SearchCap is a once-per-day newsletter update - sign up below and get the news delivered to you!

 


 

Search Engine Land Periodic Table of SEO Success Factors

Get Your Copy
Read The Full SEO Guide