Since Yahoo rolled out a new Delete URL feature this week, a number of questions have come up on how exactly it works. I had time yesterday with some of the Yahoo Site Explorer team to gather answers. Thanks to Priyank Garg and Amit Kumar, who along with Tim Mayer, went through the inner workings.
It’s probably most important to understand the difference between how pages have traditionally been kept out of Yahoo versus what Delete URL does. Traditionally, Yahoo is told not to spider pages at all using either a robots.txt file or a meta robots tag that uses the "noindex" setting. Here’s some more about how those options work, versus the new Delete URL feature:
- Robots.txt: Yahoo checks your robots.txt file on a regular basis to
see what pages it is forbidden from crawling. Block a page using robots.txt,
and Yahoo will stop visiting that page. If the page isn’t crawled, then it doesn’t appear within the index
or gets dropped if it was previously listed. Remove the block from robots.txt,
and Yahoo will start crawling the page again, causing it to return to the
- Meta Robots (set to NOINDEX): If Yahoo isn’t blocked by robots.txt
from crawling a page, then it looks on the page itself to see if there’s a
meta robots tag in place. If so — and if that tag is set to noindex — then
the page will not be added to the listings or dropped if it is already in the
index. It will continue to get crawled! Meta robots does not block
crawling. However, it will not be included in the index as long as the meta
robots tag continues to say noindex.
- Delete URL: Delete URL works independently of the other two options. Use it, and pages will continue to be crawled. However, similar to the meta robots tag using noindex, they won’t get indexed.
The chart below provides some further at-a-glance guidance on what to use and how each blocking feature operates:
|Stops Index Inclusion||Yes||Yes||Yes|
|Stops Link Only Listing||No||No||Yes|
|Why Use?||Easy to block many pages at once||Can’t access root domain||Don’t even want URL to appear or need page out fast|
To expand a bit on the chart, some people don’t want the major search engines to spider certain pages in order to reduce bandwidth load. That means blocking crawling. Only robots.txt will do this for you. It also will keep the pages out of the index.
Unfortunately, robots.txt will only work at the root level of a domain. IE, it has to be at domain.com/robots.txt rather than domain.com/subarea/robots.txt. Some people have their web sites deep within other domains, so the meta robots tag (using noindex — and in all future references, I mean meta robots using the noindex setting) is a way to keep pages out. The pages will continue to be crawled, but they won’t show up.
With both robots.txt and meta robots, it’s still possible that a URL will appear in the listings. This is because Yahoo will still list a URL because it knows of other people linking to it. For example, perhaps you have some confidential report you put online. You might prevent Yahoo from including the report by crawling or indexing the content. However, if other people are linking to it, then the report might still come up. Yahoo won’t know about anything inside of it, but sometimes just links alone can make a page relevant for terms.
Delete URL is also potentially faster than using robots.txt or meta robots. Both of those depend on Yahoo revisiting the site, seeing the restriction and acting on it. It might take Yahoo several days or longer to get back to some sites. Delete URL tells Yahoo to speed up the process. It acts as a virtual meta robots tag, and Yahoo says pages should be removed in 24 to 48 hours.
The virtual meta robots tag concept is important. No, you do not have to have an actual meta robots tag set to noindex on the pages you want to remove. Nor do you need to have a robots.txt file blocking pages. Delete URL will work instead of either of these to keep pages out. It will also work in addition to them.
For extra security, it might be nice if Delete URL only worked if people ALSO had one of the traditional methods in place. But I understand Yahoo’s view that they want a third alternative to work for those who can’t use the other two systems.
After the feature came out, Andy Beal over at Marketing Pilgrim had the fear-inspiring headline of Yahoo Delete URL Feature Disaster Waiting to Happen. He wrote:
It is literally a disaster waiting to happen. There is zero verification other than being logged into the proper Yahoo account to delete an entire site from the Yahoo index.
With Google you are required to upload a robots.txt file to the webserver that verifies the same information being requested through the Google delete URL/Site tool. With Yahoo, you just log in, click delete, click confirm, and it’s gone.
Until they fix this issue I recommend to everyone that you don’t authenticate any domain to Yahoo Site Explorer and if you have previously authenticated a site that you remove the authentication file or meta tag.
Well gosh, then you might as well not have a robots.txt file on your domain. I mean, it’s a disaster waiting to happen. All you need is for someone to figure out your username and password to your site, install that puppy and out goes your site.
I like Andy, so I’m poking at him in good fun. But I do think we need some perspective. Let’s say Andy does authenticate his site with Yahoo. Now I’ve got to figure out what his Yahoo username and password is for that particular site. Is he andy_beal? andybeal45? marketingpilgrim? andyexpat342? Just knowing what username he might use with that site is the first challenge. Then I’ve got to guess the password.
If I do guess the password and get in, bam! Site wiped out! Not really. First, the URLs will go into a processing queue, and that’s going to take up to 24 hours to happen. Look, here I deleted a page from my site yesterday, about 12 hours ago:
As you can see, the status is "Pending Delete" — the URL has yet to get removed. I still have time to prevent it from happening.
Let’s say pages do get wiped out. They’re actually still in the index. Delete URL simply suppresses them from appearing. This means Yahoo can quickly get them back in 1 to 2 days, if need be (though for some rare "low priority" URLs, Yahoo says this might take up to a month).
Of course, I can understand the concern here. There are two other things that might help. First, perhaps site owners who are really worried could set up a special authentication password or PIN to use to authorize a delete. So if someone did get both your username and password, perhaps the delete can’t happen unless they also know your PIN. Second, perhaps an RSS feed or email notice could go out to keep the account holder altered to any major pending action. For its part, Yahoo says they are considering additional safeguards.
Another issue that’s come up is that you can only do up to five active deletes per site at a time. In other words, you can do five delete actions. When those are processed, you can then do more. This is Yahoo being conservative, so the limit might get raised in the future. But five deletes is not the same as five pages. You can delete many more pages than that.
If you delete a root URL like this:
Then all pages below that domain will get removed, such as:
One delete — but many, many pages gone. You can also delete all pages in a particular directory or subarea of your site. So find a page like this:
And all pages in the /subarea1/ section will go.
Keep in mind that while removal is fast, you could still be looking at two to three days in some cases. It takes up to 24 hours for authentication to be verified, though Yahoo says this may happen much sooner (for me, it took several hours yesterday). After that, you’re looking at 24 to 48 hours for most pages to go.
If it’s a real emergency with a legal component, such as copyrighted material that should be pulled under a DMCA action, Yahoo has instructions on that here.
Finally, more than one person can authenticate to manage your site. Want to keep tabs on them? Anyone authenticated for a site will see all Delete URL actions done by anyone else authenticated for that particular site.
What if you have an employee that establishes authentication then goes bad after they are fired. As long as you remove their unique authentication code from your web server, they can’t hurt you. Any deletion action will check to see that authentication for the person requesting it is in place. Authentication is also checked on a routine basis, as well.
For more on removing material from Yahoo, some key help files to check out: