Understanding the difference between the robots.txt file and Robots <META> Tag is critical for search engine optimization and security. It can have a profound impact on the privacy of your website and customers as well. The first thing to know is what robots.txt files and Robots <META> Tags are.

Robots.txt

Robots.txt is a file you place in your website’s top level directory, the same folder in which a static homepage would go. Inside robots.txt, you can instruct search engines to not crawl content by disallowing file names or directories. There are two parts to a robots.txt directive, the user-agent and one or more disallow instructions.

The user-agent specifies one or all Web crawlers or spiders. When we think of Web crawlers we tend to think Google and Bing; however, a spider can come from anywhere, not just search engines, and there are many of them crawling the Internet.

Here is a simple robots.txt file telling all Web crawlers that it is okay to spider every page:

User-agent: *
Disallow:

To disallow all search engines from crawling an entire website, use:

User-agent: *
Disallow: /

The difference is the forward slash after Disallow:, signifying the root folder and everything in it, including sub-folders and files.

Robots.txt is versatile. You can disallow entire sub-folders or individual files. You can disallow specific search engine spiders like Googlebot and Bingbot. The search engines even extended robots.txt to include an Allow directive, file or folder name pattern matching, and XML sitemap locations.

Here is a beautifully executed robots.txt file from SEOmoz:

#Nothing interesting to see here, but there is a dance party
#happening over here: http://www.youtube.com/watch?v=9vwZ5FQEUFg

User-agent: *
Disallow: /api/user?*
Disallow:

Sitemap: http://www.seomoz.org/blog-sitemap.xml
Sitemap: http://www.seomoz.org/ugc-sitemap.xml
Sitemap: http://www.seomoz.org/profiles-sitemap.xml
Sitemap: http://app.wistia.com/sitemaps/2.xml

 

If you are unfamiliar with robots.txt, be sure to read these pages:

What robots.txt does not do is to keep files out of the search engine indexes. The only thing it does is instruct search engine spiders not to crawl pages. Keep in mind that discovery and crawling are separate. Discovery occurs as search engines find links in documents. When search engines discover pages, they may or may not add them to their indexes.

Robots.txt Does Not Keep Files Out Of The Search Index!

See for yourself at  site:permanent.access.gpo.gov.

robotstxt-google-search

Is Robots.txt A Security Or Privacy Risk?

Using robots.txt to hide sensitive or private files is a security risk. Not only might search engines index disallowed files, it is like giving a treasure map to pirates. Take a look for yourself and see what you learn.

Here is Search Engine Land’s robots.txt file.

User-Agent: *
Disallow: /drafts/
Disallow: /cgi-bin/
Disallow: /gkd/
Disallow: /figz/wp-admin/
Disallow: /figz/wp-content/plugins/
Disallow: /figs/wp-includes/
Disallow: /images/20/
Disallow: /css/
Disallow: /*/feed
Disallow: /*/feed/rss
Disallow: /*?

I used it to search for inurl:http://searchengineland.com/figz. As you can see, I found a few files I am probably not supposed to know about.

inurl-google-search

Don’t worry; if I had seen something risky or sensitive on Search Engine Land, I would never have shared this example. Can you say the same about your website or online application?

Use Robots <META> Tag To Keep Files Out Of The Search Index

Because robots.txt does not exclude files from the search indexes, Google and Bing follow a protocol which does accomplish exactly that, the Robots <META> tag.

<html>
<head>
<title>...</title>
<META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW">
</head>

The robots <META> tag provides two instructions:

  1. index or noindex
  2. follow or nofollow

Index or noindex instructs search engines whether or not to index a page. When you select index, they may or may not choose to include a webpage in the index. If you select noindex, the search engines will definitely not include it.

Follow or nofollow instructs Web crawlers whether or not to follow the links on a page. It is like adding an rel=”nofollow” tag to every link on a page. Nofollow evaporates PageRank, the raw search engine ranking authority passed from page to age via links. Even if you noindex a page, it is probably a bad idea to nofollow it. Let PageRank flow through to its final conclusion. Otherwise, you could be pouring perfectly good link juice down the drain.

When you want to exclude a page from the search engine indexes, do this:

<html>
<head>
<title>...</title>
<META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW">
</head>

There’s No Stopping Bad Behavior

A problem you will have with both robots.txt and the robots <META> tag is that these instructions cannot enforce their directives. While Google and Bing will certainly respect your instructions, someone using Screaming Frog, Xenu, or their own custom site crawler can simply ignore disallow and noindex directives.

The only real security is to lock private content behind a login. If your business is in a competitive space, it will get crawled from time to time and there are few things you can do to stop or impede it.

One last note, I am not letting any cats out of the bag here. Pirates and hackers know all of this. They have known for years. Now you do, too.

Opinions expressed in the article are those of the guest author and not necessarily Search Engine Land.

Related Topics: All Things SEO Column | Channel: SEO

Sponsored


About The Author: is a longtime Internet marketing analyst and consultant specializing in inbound marketing, social media and SEO. He enjoys helping enterprise brands organize their Web presence and grow search engine and referral traffic. Tom began Internet marketing in 1996. You can read more of Tom's musings at http://inboundbound.com.

Connect with the author via: Email | Twitter | Google+ | LinkedIn



SearchCap:

Get all the top search stories emailed daily!  

Share

Other ways to share:

Read before commenting! We welcome constructive comments and allow any that meet our common sense criteria. This means being respectful and polite to others. It means providing helpful information that contributes to a story or discussion. It means leaving links only that substantially add further to a discussion. Comments using foul language, being disrespectful to others or otherwise violating what we believe are common sense standards of discussion will be deleted. Comments may also be removed if they are posted from anonymous accounts. You can read more about our comments policy here.
  • http://twitter.com/ChaseSEO Chase Anderson

    Great article – I think it’s a very common misconception to think robots.txt will remove files from the index. Thank you for spreading the good word and helping keep everyone up on the less than admirable techniques that white hats might not be aware of.

  • Stefano Piotto

    Good, only a clarification. Most webmasters use both robots.txt and meta robots, thinking this is a double security, but the robots.txt block the spider that can not read the meta robots noindex instruction: it’s a possible own gol.

  • http://www.rimmkaufman.com/ George Michie

    Important article, Tom, and well written, too. This is a very common misunderstanding.

  • http://www.clippingpathindia.com/ clipping path service provider

    This is very important article.

  • http://www.michaelcropper.co.uk/ Michael Cropper

    If people believe that using robots.txt to keep files secure they should choose a new profession. Anything that is private should be kept behind a login – it’s as simple as that.

  • http://gallardomark.com/ Mark Gallardo

    a lot are still confused of these two ;) glad you explained it very well. i could share this to them ;)

  • स्वप्निल कुलकर्णी.

    I think, by-default it is allowed to access(crawl) all the web pages of your website and when you are using robots.txt then you are specifically disallowing pages of your website. Then why Google.com/robots.txt is using following line in robots.txt file ?

    Allow: /news/directory

    Is there any reason??

  • http://twitter.com/sharithurow sharithurow

    Great article and examples!

    I like the Google thing (and I mean that sarcastically) that if you robots.txt a page, then they can’t read the meta tag. It’s one way of wanting to choose what does and does not go into the index, taking away control from website owners.

    I get it. It’s their index and their search results. But I think website owners should ultimately pick. They probably know more about their specific group of users than a search engine does. They have better context (at least I hope they do).

  • http://about.me/mohammedalami Meding44

    I agree totally with you @tom clients are not aware about Google index, so they think because they changed their robots.txt every thing is solved. Google while refreshing its index doesn’t care about. The only way is to add noindex on page, and some times we should clear cache via GWT .. very time consuming. Hope standard will evolve to adapt to new comprehension we’ve about search engines.

  • http://www.linkworxseo.com/ Link Worx Seo

    What I have read and found is that the allow: is not needed. Stick to the disallow and forget about using allow: If you search the net for robots.txt file programs and research the proper usage and layout of one, you will find that the allow: is not actually needed when done properly.

  • Usha Ghosh

    Hey Tim! Its a very good article :)

    But, please note down the broken links in your article!!

    http://www.google.com/robots.txt

    http://www.bing.com/robots.txt

    http://searchengineland.com/robots.txt

  • robthespy

    I think most of us knew this. But it’s important nonetheless. And I’m sure there are any people who will benefit from this.

    Well done, Tom!

  • cheryl511

    up to I saw the bank draft four $4386, I be certain …that…my neighbour had been truly bringing in money part time on their apple laptop.. there sisters neighbour has done this 4 only about eight months and resently repaid the depts on there apartment and bourt a top of the range BMW M3. go to, jump15.comCHECK IT OUT

  • http://www.facebook.com/therealbenguest Ben Guest

    Anything that needs to be kept private, needs to be kept off the internet.

  • http://www.irishwonder.com IrishWonder

    Robots meta tag is a good solution but if you’ve got a directory on your server with .pdf files or images or something else non-HTML that you do not want indexed or visible or don’t want anyone to know about you’re out of luck with robots meta. The only thing you can do in this case is password protect the directory.

 

Get Our News, Everywhere!

Daily Email:

Follow Search Engine Land on Twitter @sengineland Like Search Engine Land on Facebook Follow Search Engine Land on Google+ Get the Search Engine Land Feed Connect with Search Engine Land on LinkedIn Check out our Tumblr! See us on Pinterest

 
 

Click to watch SMX conference video

Join us at one of our SMX or MarTech events:

United States

Europe

Australia & China

Learn more about: SMX | MarTech


Free Daily Search News Recap!

SearchCap is a once-per-day newsletter update - sign up below and get the news delivered to you!

 


 

Search Engine Land Periodic Table of SEO Success Factors

Get Your Copy
Read The Full SEO Guide