Removing Pages From Google: A Comprehensive Guide For Content Owners

As a site owner, you generally want Google to index as many pages of your site as possible. But there are certainly times when you find that you’ve accidentally let Google index confidential content or other information you don’t want published, and you want to get it removed as quickly as possible. Read on for […]

Chat with SearchBot

As a site owner, you generally want Google to index as many pages of your site as possible. But there are certainly times when you find that you’ve accidentally let Google index confidential content or other information you don’t want published, and you want to get it removed as quickly as possible. Read on for all the details of how to get content you own, published on your own site, successfully removed from Google’s search results.

NOTE: Want to get information about you removed from Google from sites that you do not control? In some limited cases, this is possible. We have a separate guide for that situation: Removing Your Personal Information From Google.

Keeping Content Out Of Google’s Search Results From the Start

Ideally, information that you don’t want in Google’s index won’t end up there at all. The best ways to ensure this are:

  • Require a login to access the information – This method, of course, not only keeps Google out, but ensures that only those you want to view the content are able to. You would use this method, for instance, to keep personal information like credit card and social security numbers private and to manage access to premium content.
  • Use the Robots Exclusion Protocol to block search engines from crawling and/or indexing the content – You can block content using a robots.txt file, a robots meta tag, or an X-Robots tag in the page header. Using a Disallow statement in the robots.txt file keeps Googlebot from crawling the page, although the URL itself may still end up indexed. Using a noindex robots meta tag on the page allows Googlebot to crawl the page, but keeps Google from indexing the page contents or displaying the URL in the search results. (Note that while the Noindex directive in robots.txt has been unofficially followed by Google, it hasn’t stated support for this directive officially, so it’s not an ideal way to ensure content remains out of Google search results.)

Google Webmaster Trends Analyst John Mueller recently suggested that you update your robots.txt file to block content a day before you add that content since Google caches a site’s robots.txt file for 24 hours.

Methods For Keeping Content Out of Google’s Search Results That Don’t Work

Content owners try lots of other methods for keeping content out of Google’s search results that don’t actually work:

  • Not linking to pages – just because a URL has no links to it is no guarantee that Google won’t crawl it. And of course, just because you don’t link to a page on your own site doesn’t mean that other sites won’t link to it.
  • Using the nofollow attribute on links to pages – Although Google won’t follow a link that includes the nofollow attribute, this is no guarantee that Google won’t crawl the page linked to (as described in the above bullet point)
  • Putting links to pages in JavaScript or FlashGooglebot is getting better at crawling these types of formats, so you can’t rely on them to prevent Google from seeing the links.
  • Placing content behind forms – Google has been experimenting with crawling forms for at least two years.

Removing Content On Your Site That’s Been Indexed

Despite the methods available to keep content out of Google’s search results, sometimes content you don’t want indexed ends up there anyway. Just try a quick search for terms such as  “for internal use only”, “embargoed”, “do not distribute”, and “this document contains proprietary information”. How can you get this content removed quickly? The first thing to remember is that Google shows search results based on what’s available on the web. So you can’t just ask them to remove content that would just be added right back in the next time Google crawled it.

You first have to either block the content with robots.txt or a robots meta tag or you have to remove the content from your site and return a 404 or 410 status code for the URL.

Once you’ve done that, you can just wait for Google to recrawl the page and the content will drop out automatically.

Don’t want to wait? You can use Google’s URL Removal tool to request that the content be removed right away. Simply access your verified site in Google Webmaster Tools, then click “Site configuration > Crawler acess > Remove URL.” (Can’t verify ownership of the site? Use Google’s public removal tool as described in our guide on removing your personal information from Google.)

You’ll see a dashboard that lets you manage your URL removal requests. Click “New removal request” and enter the URL that you’d like to remove. Then choose the “Remove the page from search results and cache and click the page returns a 404/410 or has been blocked by robots.txt or a noindex meta tag” checkbox. Once you’ve done that, you’ll see the request show up in the Crawler access dashboard with a status of “Pending;”

URL Removal: Pending Request

Once Google has processed the request (which can take up to 48 hours), the status will change to either “Removed” or “Denied.” If the request was denied, click “Learn more” to find out why. Generally, this happens when the URL still exists on the site and isn’t blocked from Google.

You can remove an entire directory (or your entire site) from Google in the same way you remove a URL. However, even if you’ve removed the content from your site and it returns a 404/410, you’ll still need to block the directory in robots.txt. Once you’ve done this, choose “Remove directory” when using the tool.

Removing a Directory From Google's Search Results

Removing The Cache of Content On Your Site That’s Changed

Sometimes you don’t want to remove the URLs from Google entirely, you just don’t want old content that you’ve removed to show up in Google’s cache. Once you’ve changed the content on the page, you can wait for Google to recrawl and reindex it, or you can request that Google remove the cache until the page is recrawled.

Once it’s recrawled, Google will once again show the cache with the updated content. You can also add a meta noarchive tag to the page, which will keep the page from being cached permanently (or until you remove the tag).

To request the cache removal once you’ve changed the page, start a new removal request and choose “Remove page from cache only.”

Removing Copyrighted Content

If another site has infringed on your copyright and your content on their site is appearing in Google search results, you can file a Digital Millennium Copyright Act notice to request the content be taken down. YouTube has a similar policy.

A Step By Step Recap

  • If you want content you own removed entirely:
    1. Block the content with  robots.txt or robots meta tag or remove the content and return a 404 or 410 status code.
    2. Request removal via Google Webmaster Tools (if you’re a verified site owner) or the public removal tool (if you’re not a verified site owner).
  • If you want the cache with the old content to be removed:
    1. Modify the page (and/or add a noarchive meta tag to the page if you’re the site owner)
    2. Request cache removal via Google Webmaster Tools (if you’re a verified site owner) or the public removal tool (if you’re not a verified site owner).
  • If you don’t own the content you want removed:
    1. Contact the site owner and ask that the content be removed or modified.
    2. Request removal using the public removal tool.

What If the Content Shows It Was Successfully Removed But Still Shows Up?

What if you change, remove, or block the page, request removal of the page or cache, see a message that the request was successful, but then do a search for the removed phase and you still see the result from your site?

First, check the URL that shows up in the search results and compare it to the one that you had removed. Likely, you’ll find that the URLs are different. This most commonly happens when several URLs lead to duplicate content due to canonicalization issues.

For instance, I recently helped a site owner who made changes to the page, requested a cache removal of the canonical version of the URL and later found that the non-canonical versions of the URL were still indexed (but they hadn’t been showing up for search results, because the canonical was ranking). Because they were non-canonical, they were crawled less frequently and so the cache still showed the old version of the content. This meant that when you searched for the old content, one of these non-canonical URLs showed up in search results.

In cases like this, you may have to request removal of all versions of the URL, and depending on the reason for the duplication issues, you may want to 301 redirect the non-canonical versions to the canonical one. (Find out more about canonicalization issues and solutions.)

If Someone Else Has Requested Removal Of Your Content

In Google Webmaster Tools, you can view the URLs that were removed via that public removal tool by those who aren’t verified site owners. Note that these requests will be successful only if you’ve removed or blocked the content (in cases of full removal) or modified the content (in cases of cache removal).

Canceling A Removal Request

If you’ve requested removal of content but want it reincluded, simply go back into the removal tool and click the Reinclude link beside the URL. Google processes these requests within three to five days. URLs reappear and retain all previous data about them (such as PageRank).

When Removal Requests Expire

Removal requests expire after 90 days. If the content is still blocked, removed or modified, then it won’t reappear in Google search results even after the removal expiration. This is because within that 90 days, Google will have recrawled and either reindexed the page with the new content or recorded the new URL status (removed or blocked).

When Not To Use the URL Removal Tool

A common mistake is to try to use the URL Removal Tool to fix canonicalization issues. For instance, if both www.site.com and site.com are indexed, it may be temping to remove one of those versions. However, this tool isn’t intended for those types of uses. If you have URL duplication issues, resolve them using a method best suited for the issue.

Another misuse of the tool is during site moves. Some site owners will use the tool to remove the old version of URLs, but again, this is not an intended use. Instead, 301 redirect the old URLs to the new ones. You may find that the old URLs may appear in the index for a period of time while Google crawls the original URLs, the redirects, and the new URLs. The transition period may take awhile, depending on how long it takes to comprehensively crawl everything. Once you have tested to ensure that the redirects have been implemented correctly and Googlebot is able to follow them (check your server logs and the Google Webmaster Tools crawl errors), sometimes the other tool needed for a successful migration is patience.

More Resources

Want to understand more about some of the topics covered in this article? Here are some additional articles you may find useful:

Also be sure to see Removing Your Personal Information From Google, if you’re trying to remove content that’s on a site you don’t control.


Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.


About the author

Vanessa Fox
Contributor
Vanessa Fox is a Contributing Editor at Search Engine Land. She built Google Webmaster Central and went on to found software and consulting company Nine By Blue and create Blueprint Search Analytics< which she later sold. Her book, Marketing in the Age of Google, (updated edition, May 2012) provides a foundation for incorporating search strategy into organizations of all levels. Follow her on Twitter at @vanessafox.

Get the must-read newsletter for search marketers.