8.5 C
New York
Thursday, March 28, 2024

Up Close With Yahoo's New Delete URL Feature

Since Yahoo rolled out
a new
Delete URL feature
this week, a number of
questions have come up on how exactly it works. I had time
yesterday with some of
the
Yahoo Site Explorer
team to gather answers. Thanks to Priyank Garg and
Amit Kumar, who along with Tim Mayer, went through the inner workings.

It’s probably most important to understand the difference between how pages
have traditionally been kept out of Yahoo versus what Delete URL does.
Traditionally, Yahoo is told not to spider pages at all using either a
robots.txt file or a
meta robots tag that
uses the “noindex” setting. Here’s some more about how those options work,
versus the new Delete URL feature:

  • Robots.txt: Yahoo checks your robots.txt file on a regular basis to
    see what pages it is forbidden from crawling. Block a page using robots.txt,
    and Yahoo will stop visiting that page. If the page isn’t crawled, then it doesn’t appear within the index
    or gets dropped if it was previously listed. Remove the block from robots.txt,
    and Yahoo will start crawling the page again, causing it to return to the
    listings.
  • Meta Robots (set to NOINDEX): If Yahoo isn’t blocked by robots.txt
    from crawling a page, then it looks on the page itself to see if there’s a
    meta robots tag in place. If so — and if that tag is set to noindex — then
    the page will not be added to the listings or dropped if it is already in the
    index. It will continue to get crawled! Meta robots does not block
    crawling. However, it will not be included in the index as long as the meta
    robots tag continues to say noindex.
  • Delete URL: Delete URL works independently of the other two options.
    Use it, and pages will continue to be crawled. However, similar to the meta
    robots tag using noindex, they won’t get indexed.

The chart below provides some further at-a-glance guidance on what to use and
how each blocking feature
operates:


System

Robots.
txt

Meta Robots

D
elete
URL

Stops Crawling

Yes
No No

Stops Index Inclusion

Yes

Yes

Yes

Stops Link Only Listing
No No
Yes

Why Use?

Easy to block many pages at once

Can’t access root domain

Don’t even want URL to appear or need page out fast

To expand a bit on the chart, some people don’t want the major search engines
to spider certain pages in order to reduce bandwidth load. That means blocking
crawling. Only robots.txt will do this for you. It also will keep the
pages out of the index.

Unfortunately, robots.txt will only work at the root level of a domain. IE,
it has to be at domain.com/robots.txt rather than domain.com/subarea/robots.txt.
Some people have their web sites deep within other domains, so the meta robots
tag (using noindex — and in all future references, I mean meta robots using the
noindex setting) is a way to keep pages out. The pages will continue to be crawled, but they
won’t show up.

With both robots.txt and meta robots, it’s still possible that a URL will
appear in the listings. This is because Yahoo will still list a URL because it
knows of other people linking to it. For example, perhaps you have some confidential report
you put online. You might prevent Yahoo from including the report by crawling or
indexing the content. However, if other people are linking to it, then the
report might still come up. Yahoo won’t know about anything inside of it, but
sometimes just links alone can make a page relevant for terms.

Yahoo’s
calls
these “thin” listings (Google

calls
them “partially indexed”). If you use Delete URL, you can
remove all traces of the URL from Yahoo search results. Even thin URLs will be
gone.

Delete URL is also potentially faster than using robots.txt or meta robots. Both of those depend on Yahoo revisiting the site, seeing the restriction
and acting on it. It might take Yahoo several days or longer to get back to some
sites. Delete URL tells Yahoo to speed up the process. It acts as a virtual meta
robots tag, and Yahoo says pages should be removed in 24 to 48 hours.

The virtual meta robots tag concept is important. No, you do not have to have
an actual meta robots tag set to noindex on the pages you want to remove. Nor do you need to
have a robots.txt file blocking pages. Delete URL will work instead of either of these to keep
pages out. It will also work in addition to them.

For extra security, it might be nice if Delete URL only worked if people ALSO
had one of the traditional methods in place. But I understand Yahoo’s view
that they want a third alternative to work for those who can’t use the other two
systems.

After the feature came out, Andy Beal over at Marketing Pilgrim had the
fear-inspiring headline of

Yahoo Delete URL Feature Disaster Waiting to Happen
. He wrote:

It is literally a disaster waiting to happen. There is zero verification
other than being logged into the proper Yahoo account to delete an entire site
from the Yahoo index.

With Google you are required to upload a robots.txt file to the webserver
that verifies the same information being requested through the Google delete
URL/Site tool. With Yahoo, you just log in, click delete, click confirm, and
it’s gone.

Until they fix this issue I recommend to everyone that you don’t
authenticate any domain to Yahoo Site Explorer and if you have previously
authenticated a site that you remove the authentication file or meta tag.

Well gosh, then you might as well not have a robots.txt file on your domain.
I mean, it’s a disaster waiting to happen. All you need is for someone to figure
out your username and password to your site, install that puppy and out goes
your site.

I like Andy, so I’m poking at him in good fun. But I do think we need some perspective. Let’s say Andy does authenticate his site
with Yahoo. Now I’ve got to figure out what his Yahoo username and password is
for that particular site. Is he andy_beal? andybeal45? marketingpilgrim?
andyexpat342? Just
knowing what username he might use with that site is the first challenge. Then I’ve got to guess the
password.

If I do guess the password and get in, bam! Site wiped out! Not really.
First, the URLs will go into a processing queue, and that’s going to take up to
24 hours to happen. I deleted a page from my site yesterday, about 12
hours ago and the status is “Pending Delete” — the URL has yet to get
removed. I still have time to prevent it from happening.

Let’s say pages do get wiped out. They’re actually still in the index. Delete
URL simply suppresses them from appearing. This means Yahoo can quickly
get them back in 1 to 2 days, if need be (though for some rare “low priority”
URLs, Yahoo says this might take up to a month).

Of course, I can understand the concern here. There are two other things that might help.
First, perhaps site owners who are really worried could set up a special
authentication password or PIN to use to authorize a delete. So if someone did
get both your username and password, perhaps the delete can’t happen unless they
also know your PIN. Second, perhaps an RSS feed or email notice could go out to
keep the account holder altered to any major pending action. For its part, Yahoo
says they are considering additional safeguards.

Another issue that’s come up is that you can only do up to five active deletes per
site at a time. In other words, you can do five delete actions. When those are
processed, you can then do more. This is Yahoo being conservative, so the limit might
get raised in the future. But five deletes is not the same as five pages. You
can delete many more pages than that.

If you delete a root URL like this:

http://domain.com

Then all pages below that domain will get removed, such as:

http://domain.com/subarea1/page1.html
http://domain.com/subarea2/page45.html

One delete — but many, many pages gone. You can also delete all pages in a
particular directory or subarea of your site. So find a page like this:

http://domain.com/subarea1/

And all pages in the /subarea1/ section will go.

Keep in mind that while removal is fast, you could still be looking at two to
three days in some cases. It takes up to 24 hours for authentication to be
verified, though Yahoo says this may happen much sooner (for me, it took several
hours yesterday). After that, you’re looking at 24 to 48 hours for most pages to go.

If it’s a real emergency with a legal component, such as copyrighted material
that should be pulled under a DMCA action, Yahoo has instructions on that
here.

Finally, more than one person can authenticate to manage your site. Want to
keep tabs on them? Anyone authenticated for a site will see all Delete URL
actions done by anyone else authenticated for that particular site.

What if you have an employee that establishes authentication then goes bad after they
are fired. As long as
you remove their unique authentication code from your web
server, they can’t hurt you. Any deletion action will check to see that
authentication for the person requesting it is in place. Authentication is also
checked on a routine basis, as well.

For more on removing material from Yahoo, some key help files to check out:


Related Articles

Latest Articles