Last week, the three major search engines came together to say how they agree — and disagree — over the Robots Exclusion Protocol. It’s such an important standard, one every webmaster should understand. To help, Vanessa Fox has compiled an extensive and outstanding overview of it at Jane & Robot in her Managing Robot’s Access To Your Website post.
The tutorial takes you through key areas such as:
- A nice chart showing what you can block using either robots.txt or the
meta robots tag for each major search engine. It also covers other things
like reverse DNS lookup to verify a crawler’s identity.
- Types of content you want private from search engines versus public.
Rather than private versus public, "not listed" versus "listed" might be
better terms Anything that really should be private ought to be kept
behind a password barrier. The tutorial does cover this, but it’s worth
stressing that no one should think robots exclusion is a method to keep
private/personally identifiable information out of search engines. But
there’s other info that you might want "private" in terms of not being
listed, such as printer-friendly pages, as the tutorial also explains.
- How to block search engines, such as on a site-wide basis using
robots.txt, along with tips like using wildcards, specifying particular
search engines by crawler name. Page level blocking (with meta tags) is
also covered. There are lots of examples.
- Common mistakes and myths are addressed, such as the idea that using nofollow alone will keep pages from being indexed. Methods of testing implementation are also covered.
Bookmark the guide — it’s one you’ll want to come back to time and again.