Is AJAX Killing Your Crawl Budget?
Columnist Gene McKenna of Groupon discusses options for how to handle JSON files when optimizing your crawl budget.
Numerous folks have written about optimizing your crawl budget. It’s a good idea — keep Google focused on the right stuff on your site, and prevent it from needlessly crawling the wrong stuff (or from crawling the right stuff 1,000 times over in slightly different ways).
3 Standard Tips For Crawl Budget Management
- Use Google Search Console (formerly Webmaster Tools) to tell Google which URL parameters to ignore. If the “a” parameter has no impact on content of a page, you can tell Google to ignore the “a” parameter. This would indicate that URLs like mypage.html?a=foo and mypage.html?a=bar should be considered the same page, thus letting Google know that it doesn’t have to crawl every version — e.g., mypage.html?a=x, ?a=y, ?a=z — that it finds.
- Achieve the same thing using canonical tags. The page at mypage.html?a=foo might say that the canonical URL for the page is just mypage.html. Google can be slow to learn this (days to weeks) — but once it does, it will generally stop or reduce crawling of non-canonical variants.
- Smart use of nofollow tag and robots.txt rules. Don’t let Google crawl stuff it shouldn’t.
Crawl Budget & AJAX
There are probably many new issues to consider, or old issues to consider with respect to new file types. This article focuses on just one of those: JSON files.
An AJAX call often initiates a .json request to get data that will be dynamically inserted into the page. That means if mypage.html includes mypage.json as a data resource, you will start to see requests by Googlebot for those .json files in your Web logs.
Depending on how your JSON resources requests are formed, this can also create a lot of duplicate URLs, or URL variations that you don’t necessarily want Google accessing. For example, a request to mypage.html?a=foo might result in a request for mypage.json?a=foo. Just as you may not consider the parameter ?a=foo to be providing content different from ?a=bar, the request for mypage.json?a=foo and ?a=bar may return the same thing.
And if you use JSONP, a common variant of JSON, the URL will typically have two parameters added to every request, callback and _. This may vary depending on the library one uses to initiate the .json request, but these parameters are designed specifically to have unique values each time they are used; for example, you can see the date and time are embedded into the value of callback in this example generated by a jQuery library:
Unfortunately, not all of the above crawl control options are available for JSON files. You can block *.json or specific .json paths in robots.txt, but if you need Google to see the content contained in that JSON file, you don’t want to block it. You can’t put a canonical into a JSON file, nor can you use a noindex tag. JSON files are for data.
So, in many cases you have to rely on telling Google to ignore specific parameters.
Testing that we’ve done has shown two interesting things:
- If you rely on Google Search Console to surface all commonly used parameters it finds on your site, it may not surface these. It didn’t for us (though you may have a different experience). Maybe Google didn’t surface them for us because we had so many other parameters already identified that Google just didn’t get around to identifying these yet. However, you can always just add these parameters in the list manually and then tell Google to ignore them. We considered the possibility that Google knows about these special parameters because they are so common across the Web. We thought perhaps they weren’t in the list of common parameters on our site because Google is already ignoring them by default, silently. But we observed Google crawling a single page more than 2,000 times in 10 days with multiple combinations of five parameters. Those five parameters included three that Google had been told to ignore months prior, and these two JSONP parameters (callback and _) which Google hadn’t yet been told to ignore.
- Once we added “callback” and “_” to the list of parameters Google should ignore, the crawl rate of .json files dropped dramatically.
Some Final Advice
- If your .json call doesn’t need to pass along all the parameters that the containing page was called with, don’t pass them.
- If you don’t have to use JSONP, don’t. Many sites use it to get around cross-site scripting security warnings (and it’s a bit dodgy at that). If a site wants to have a page at www.domain.com call a data service providing json data at xxx.domain.com or www.otherdomain.com, that is considered a cross-site scripting security issue, and many browsers will complain with warning messages or simply not allow it. JSONP was a way around this. We realized we didn’t need JSONP, since all pages at www.groupon.com made requests for JSON resources also at www.groupon.com. Plain old JSON would do just as well for us.
- If you are using JSONP, make sure that the parameters “callback” and “_” are in your list of parameters to ignore.
- Check your logs often for other parameters on URLs that Google is crawling. If they are not necessary for a page to get the correct content, block them. This advice now includes checking specifically for .json requests because canonicals you may be using for regular page parameters won’t work for .json requests.
Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.
New on Search Engine Land