Are Your Language Detection Methods Blocking Search Engine Crawlers?

At a recent international search marketing conference in London the most frequent question asked by the audience was “How do I get my content found and indexed by global and local search engines?” During the breaks I talked to a few people who indicated little or none of their local market content was being indexed […]

Chat with SearchBot

At a recent international search marketing conference in London the most frequent question asked by the audience was “How do I get my content found and indexed by global and local search engines?”

During the breaks I talked to a few people who indicated little or none of their local market content was being indexed by the major engines. Close examination of these sites revealed that they were all using some form of language detection. In two cases they were doing language detection because they saw Google doing it and assumed that this was the best approach.

However, if you are like me and travel a lot to other countries, you know that assumption can lead to a big problem: Just because I am physically located in a particular country doesn’t necessarily mean that its native language is the content I wish to see. There are other factors that come into play, such as my browser default language, the language I use for queries and so on.

Lets take a deeper look at dynamic language and location detection and explore some of the things you should do to make the process work better.

What is your default dynamic language response?

Browser level language detection is the most popular method of determining a language preference. Your web server simply looks at the visitor’s language preference submitted to the server via “accept-language header” and then locates and serves any content that contains that language code. For example, a person who downloads the French-language version of Firefox will typically have their default language preference set to “French” or “French_France.” When they visit your site the server will read the preference and automatically redirect the visitor to the French version of the site.

While using the accept-language header can be a good starting point for determining the language of the user, it is often misused to “assume” their location. While there can be many advantages to determining a searcher’s language preference to serve them local content, determine local currency, or even format phone numbers that might be more suitable for visitors, there are also potentially catastrophic implications for your search marketing program if you default to browser language preference without considering other factors.

The problem for search marketers is that most search engine crawlers do not use the “accept-language header” and therefore are not sending a language preference. Because crawlers do not send a preference when requesting pages, they are served the “default” language of the server. Do you know the default language of your server? Many web servers, especially .NET and IIS Server, by default, will serve English as the default, meaning search engine crawlers will only see the English language version of your site, regardless if you have tons of content in other languages.

Is your IP detection default location keeping crawlers from finding your local content?

All of those problem sites I encountered at the conference were using IP location detection scripts. Essentially, these scripts receive the IP location of the site visitor and serve them predetermined content based on the county and/or city where the visitor has connected to the Internet. For example, I am writing this article while in Berlin Germany. When I go to Google.com, the server detects I am in Germany and routes me to Google.de and presents the home page in German. For me, a native English speaker, I have to take steps to counter this and select Google in English to get to the content that I need. This is a major problem for search engine crawlers.

The problem is, most search engine crawlers crawl from a specific country location. While I have seen Google crawlers occasionally come from Zurich they are primary crawling from the main data center in Mountain View, California. Due to the crawler locations, no matter where they hit your site, the detection scripts would only route them to English or US centric content—again making content in other languages and countries invisible to the crawlers.

Testing the defaults and making exceptions

To truly understand what your servers are doing you need to test them so that you are confident that they are serving the right content for all situations, especially to the crawlers. Just asking your IT team what is happening is never enough proof of the right settings.

The most reliable test is to have co-workers or customers visit the sites from various locations with different language settings turned on and off to see what is happening for each situation. You should not only check the default settings of your server but also any subscription services your company may deploy such as Akamai’s Global Traffic Management IP Intelligence or cyScape’s CountryHawk IP detection solution. For both of these tools, as well as any other scripts found on the web, you need to ensure that you have loaded exceptions to the redirection rules to ignore the user agent names for the major search crawlers to allow them to access the page they requested.

Finally, you should develop local language/country XML site maps for each local version of the site and register them so that the crawlers have a direct way to access the pages and index your content.


Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.


About the author

Bill Hunt
Contributor
Bill Hunt is currently the President of Back Azimuth Consulting and co-author of Search Engine Marketing Inc. His personal blog is whunt.com.

Get the must-read newsletter for search marketers.