Search Engine Land » Platforms » Google » The ultimate guide to bot herding and spider wrangling — Part 3

The ultimate guide to bot herding and spider wrangling — Part 3

In this third and final installment, contributor Stephan Spencer outlines common coding, mobile and localization issues and offers workarounds to make sure your code provides consistent cues.

Stephan Spencer on June 28, 2018 at 10:00 am | Reading time: 8 minutes

Robots Txt Automation1 Ss Ribbon Lg Part3 1920 In parts one and two of this series, we learned what bots are and why crawl budgets are important. In the third and final segment, we’ll review common coding, mobile and localization issues bots may encounter on their journey to let the search engines know what’s important on your site.

Common coding issues

Good, clean code is important if you want organic rankings. Unfortunately, small mistakes can confuse crawlers and lead to serious handicaps in search results.

Here are a few basic ones to look out for:

1. Infinite spaces (also known as spider traps). Poor coding can sometimes unintentionally result in “infinite spaces” or “spider traps.”

Some issues can cause the spider to get stuck in a loop that can quickly exhaust your crawl budget. These include endless uniform resource locators (URLs) pointing to the same content; pages with the same information presented in a number of ways (e.g., dozens of ways to sort a list of products); or calendars that contain an infinity of different dates.

Mistakenly serving up a 200 status code in your hypertext transfer protocol (HTTP) header of 404 error pages is another way to present to bots a website that has no finite boundaries. Relying on Googlebot to correctly determine all the “soft 404s” is a dangerous game to play with your crawl budget.

When a bot hits large amounts of thin or duplicate content, it will eventually give up, which can mean it never gets to your best content, and you wind up with a stack of useless pages in the index.

Finding spider traps can sometimes be difficult, but using the aforementioned log analyzers or a third-party crawler like Deep Crawl is a good place to start.

What you’re looking for are bot visits that shouldn’t be happening, URLs that shouldn’t exist or substrings that don’t make any sense. Another clue may be URLs with infinitely repeating elements, like:

example.com/shop/shop/shop/shop/shop/shop/shop/shop/shop/…

2. Embedded content. If you want your site crawled effectively, it’s best to keep things simple. Bots often have trouble with Javascript, frames, Flash and asynchronous JavaScript and XML (AJAX).

Even though Google is getting better at crawling formats like Javascript and AJAX, it’s safest to stick to old-fashioned hypertext markup language (HTML) where you can.

One common example of this is sites that use infinite scroll. While it might improve your usability, it can make it difficult for search engines to properly crawl and index your content. Ensure that each of your article or product pages has a unique URL and is connected via a traditional linking structure, even if it is presented in a scrolling format.

Mobile sites

Google’s announcement of mobile-first indexing in November 2016 sent shockwaves through the search engine optimization (SEO) community. It’s not really surprising when you think about it, since the majority of searches are conducted from mobile devices, and mobile is the future of computing. Google is squarely focused on the mobile versions of pages rather than the desktop versions when it comes to analysis and ranking. This means that bots are looking at your mobile pages before they look at your desktop pages.

1. Optimize for mobile users first. Gone are the days when a mobile site could be a simplified version of your desktop site. Instead, start by considering the mobile user (and search engine bots) first, and work backward.

2. Mobile/desktop consistency. Although most mobile sites are now responsive, if you have a separate mobile version of your site, ensure that it has the same internal linking structure, and link bi-directionally between the two sites using rel=alternate and rel=canonical link elements.

Point to the desktop version from the mobile site using rel=canonical and point to the mobile site from the desktop site with rel=alternate. Note that this is an interim solution until you move to responsive design, which is the preferred approach, according to Google.

3. Accelerated mobile pages. Accelerated mobile pages (AMP) are one of Google’s more controversial inventions, and many webmasters are still hesitant to use them, since it means letting Google host a cached version of your pages on their own domain.

Google’s rationale is that accelerated mobile pages allow them to serve content up more quickly to users, which is vitally important with mobile. While it’s not clear whether Google actually prioritizes accelerated mobile pages over other types of mobile pages in search results, the faster load time could contribute to a higher ranking.

Point to the AMP version of a page using rel=amphtml and point back to the canonical URL from the AMP page using rel=canonical. Note that even though accelerated mobile pages are hosted on a Google URL, they still use up your crawl budget.

Should you block bad bots?

Unfortunately, it’s not only search engines that use bots. They come in all shapes and sizes… and intentions, including those designed to hack, spy, spam and generally do nasty stuff to your website.

Unlike friendly search engine bots, these spiders are more likely to ignore all your instructions and go straight for the jugular. There are still some hacks you can use to keep bad bots out. Be warned, these hacks can be time-consuming, so it might be worth consulting your hosting company on their security solutions if you’re really struggling.

1. Using htaccess to block internet protocol (IP) addresses. Blocking bad bots can be as simple as adding a “deny” rule to your htaccess file for each bot you want to block. The tricky part here, of course, is actually figuring out what IP address the bot is using.

Some bots might even use several different IPs, meaning you need to block a range of addresses. You also want to make sure you don’t block legitimate IP addresses. Unless you got a list of known IPs to block from a trusted source or you know which page the bot accessed, along with the approximate time or geographical location of the server, you’re likely to spend hours searching through your log files.

2. Using htaccess to block user agent strings. Another option is to set up a “deny” rule for a specific user agent string. Again, you’ll need a list from a trusted source, or you’ll be sorting through your log files to identify a particular bot, and then add the information to your htaccess file.

Localization

Since bots need to understand what country/regional version of a search engine you want your pages to appear in, you need to make sure your code and content provide consistent cues about where your sites should be indexed.

1. Hreflang. The hreflang tag (which is actually a type of rel=alternate link element) tells bots what language and region your page is targeting (e.g., en-ca or en-au).

This sounds simple enough, but it can cause a number of headaches. If you have two versions of the same page in different languages, you will need to provide one hreflang tag for each. Those two hreflang tags will need to be included in both pages. If you mess this up, your language targeting could be considered invalid, and your pages might trip the duplicate content filter or not be indexed in the right country version of Google.

2. Local spellings. While hreflang tags are important, bots are also looking for other clues that guide them on how they should index your site. One thing to be careful of is local spellings. If your page is targeted at a US audience, yet you use UK spellings, it could result in being listed in the wrong country version of Google.

3. Top-level domains, subdomains or subdirectories for different locations. If you want to make it even clearer to bots that your content is targeted to a specific region, you can use country code top-level domains (ccTLDs), subdomains or subdirectories. For example, the following are various ways to indicate content targeted at Canadian users:

example.ca/category/widget

ca.example.com/category/widget

example.com/ca/category/widget

Conclusion

While many website owners and even some SEOs may think they can wing it with good content and quality backlinks alone, I want to emphasize that many of these small tweaks can have a significant impact on your rankings.

If your site’s not crawled — or crawled badly — your rankings, traffic and sales will ultimately suffer.

Contributing authors are invited to create content for Search Engine Land and are chosen for their expertise and contribution to the search community. Our contributors work under the oversight of the editorial staff and contributions are checked for quality and relevance to our readers. The opinions they express are their own.

Add Search Engine Land to your Google News feed.