6.6 C
New York
Thursday, January 26, 2023

Meta Robots Tag 101: Blocking Spiders, Cached Pages & More

Last week, I covered a new command for the meta robots tag — one to prevent search engines from using Yahoo titles and descriptions. In doing that, a number of questions came up about the meta robots tag syntax itself. Google Webmaster Central has now posted “Using the robots meta tag,” providing some clarity from Google. In addition, both Yahoo and Microsoft have also sent me information on using the tag. I’ll run through what everyone says below, complete with charts for easy at-a-glance comparisons.

The meta robots tag was an open standard created over a decade ago and designed initially to allow page authors to prevent page indexing. Over the years, various search engines have added additional support to the tag.

Let me start off by saying that if you DO want your pages in search engines, then DO NOT use the tag. By default, the major search engines will index any page they find. Yes, there is a form of the meta robots tag you can use to explicitly tell search engines to index your pages. It looks like this:

<meta name=”robots” content=”index”>

There’s also a form you can use that adds the command “follow,” which tells the search engines to index your page and also follow any links they find on that page to other pages, which they can then index. It looks like this

<meta name=”robots” content=”index,follow”>

You do NOT need to use either form if you DO want your pages in the search engines. Without either form, they’ll naturally index your pages and follow your links. That’s what they do.

I always joke that putting these forms of the meta robots tag on your web pages is like putting a Post-It note on your chest that says “breathe.” Hey, if you forget to look at that note, you’ll still breathe. That’s what you do, by default. And that’s what the major search engines do. By default, they inhale web pages without you putting up a meta tag telling them to do so.

Now if you DO NOT want your pages in a search engine, then it’s time to perhaps break out the meta robots tag, if for some reason the robots.txt alternative isn’t suitable. Want to keep a particular page out? Then put this on that page:

<meta name=”robots” content=”noindex”>

See the “noindex” value? That tells the search engines that see this page not to include them in their listings. Remember — as I explained before — this will not prevent the pages from being spidered. That’s because search engines have to keep revisiting the page in order to see if the tag is removed. The tag only keeps the page out. Here’s my earlier chart on that topic.

System Robots.
txt
Meta
Robots
Yahoo
Delete
URL Option
Stops Crawling Yes No No
Stops Index Inclusion Yes Yes Yes
Stops Link Only Listing No No
(Yes,
for Google)
Yes
Why Use? Easy to block many pages at once Can’t access root domain Don’t even want URL to appear or need page out fast

What if you don’t want links followed? Sure, you can do this:

<meta name=”robots” content=”noindex,nofollow”>

That extra command, “nofollow,” tells the search engines not to follow any links on that page. Google recently covered this more as an option. But as Google also explained, links from a page with this tag might still get crawled. That’s because if anyone else links to a particular page WITHOUT a nofollow value, then the search engine will follow that link.

So far, I’ve covered all the commands that were originally created with the tag back in May 1996. Since then, more commands (also called values or attributed) have been added. For example, Google writes today to summarize several options you can use. Quoting Google:

  • NOINDEX – prevents the page from being included in the index.
  • NOFOLLOW – prevents Googlebot from following any links on the page. (Note that this is different from the link-level NOFOLLOW attribute, which prevents Googlebot from following an individual link.)
  • NOARCHIVE – prevents a cached copy of this page from being available in the search results.
  • NOSNIPPET – prevents a description from appearing below the page in the search results, as well as prevents caching of the page.
  • NOODP – blocks the Open Directory Project description of the page from being used in the description that appears below the page in the search results.

At times, you may want to use more than one of these commands. I’ll get back to that. But first, how about another chart? I’ll cover the major commands you may want to use below:

COMMAND Ask Google Microsoft Yahoo
NOINDEX Yes Yes Yes Yes
NOFOLLOW Yes Yes Yes Yes
NOARCHIVE Yes Yes Yes Yes
NOODP No Yes Yes Yes
NOYDIR No No No Yes
NOSNIPPET No Yes No No
Robot
Name
TEOMA GOOGLEBOT MSNBOT SLURP
Does Robot Specific Tag Override All Robots Tag? ??? No No No

Several of these are already explained above, in what I quoted from Google. They work the same way for the other major search engines. I’ve also linked to help information from each search engine for more specific advice.

The NOYDIR command is fully explained in my previous Yahoo Provides NOYDIR Opt-Out Of Yahoo Directory Titles & Descriptions post. Only Yahoo supports this, but none of the other major search engines used Yahoo titles and descriptions for listings, so it doesn’t really matter for them.

Now on to the topic of a meta robots tag having multiple values. What if you wanted to keep a page from being cached by all the major search engines and also ensure that neither Open Directory or Yahoo Directory descriptions are used. First, you need the values of the commands to say this. From the table above, they are:

  • NOARCHIVE
  • NOODP
  • NOYDIR

Next, you need to decide what robots to target. We’ll keep it simple for now. To target ALL robots, you use this value:

  • ROBOTS

Now to the meta robots format. Without the values, it looks like this:

<meta name=”NAME-OF-ROBOTS-TO-TARGET” content=”COMMANDS”>

We replace that NAME-OF-ROBOTS-TO-TARGET part with the name of the robots we’re, well, targeting. As explained, that’s ROBOTS, in order to target them all. I’ll put it in bold below:

<meta name=”ROBOTS” content=”COMMANDS”>

Now we put in the commands we want to tell the robots, each separated by a command. The order doesn’t matter. Again, I’ll bold the commands:

<meta name=”ROBOTS” content=”NOARCHIVE,NOODP,NOYDIR“>

Voila! Put that tag ANYWHERE inside the header area of a web page like this:

<HEAD>
<meta name=”ROBOTS” content=”NOARCHIVE,NOODP,NOYDIR”>
</HEAD>

Then you will be telling all major search engines not to cache the page, nor to use Open Directory or Yahoo Directory titles or descriptions for you page listings.

Notice that in the tag above, there are no spaces between the commands. What if I did this?

<meta name=”ROBOTS” content=”NOARCHIVE, NOODP, NOYDIR”>

Google writes today that spaces make no difference. Use them if you want or not, the tag means the same thing. Microsoft tells me the same thing, as does Yahoo.

What if you did this, with no commas:

<meta name=”ROBOTS” content=”NOARCHIVE NOODP NOYDIR”>

Microsoft tells me this is fine. I didn’t ask Yahoo about this, and Google says commas MUST be used. So use commas and don’t be a pain.

Now what if you want to tell search engine different things. Maybe you want Microsoft not to use the ODP descriptions, Google not to cache pages, Yahoo not to follow links on a page and Ask not to index the page at all. Maybe you want to get your head examined for being so strange, too. But aside from your mental health, it is possible to do all this.

You need to have a robots tag for each particular search engine you want to target. See that chart above? At the bottom there’s a “Robot Name” row. That shows you the name of each search engine’s “robot” or “spider” that you’ll issue a command to. With the robot names, we then give each of them their specific commands:

<meta name=”TEOMA” content=”NOINDEX”>
<meta name=”GOOGLEBOT” content=”NOARCHIVE”>
<meta name=”MSNBOT” content=”NOODP”>
<meta name=”SLURP” content=”NOFOLLOW”>

You could also tell all robots to do one thing — say not to follow links — while also issuing a second robots-specific command such as telling only Google not to cache the page:

<meta name=”ROBOTS” content=”NOFOLLOW”>
<meta name=”GOOGLEBOT” content=”NOARCHIVE”>

But wouldn’t a search engine only follow the specific tag written for it? In other words, if you target Google with a specific command in the “GOOGLEBOT” tag, then wouldn’t it follow only that tag and ignore the other?

Google, Microsoft and Yahoo say they will honor them both. I don’t know about Ask. That’s why you see “???” in that “Does Robot Specific Tag Override All Robots Tag?” section of the chart above. I’ll try to get that answered.

What if you had more than one “all” robots tag like this:

<meta name=”ROBOTS” content=”NOFOLLOW”>
<meta name=”ROBOTS” content=”NOODP”>

As explained, you could easily do this instead:

<meta name=”ROBOTS” content=”NOFOLLOW,NOODP”>

But if for some reason you did do it the other way, Microsoft and Yahoo have told me that’s just fine. They honor the information in BOTH of the robots tags. Google’s post today says the same thing.

Finally, the Google post provides reassurance that capitalization doesn’t make a difference. I’ve shown things in various ways above, sometimes the commands in ALL CAPS, sometimes in lowercase. As Google says, case makes no difference. To quote their post:

Googlebot understands any combination of lowercase and uppercase. So each of these meta tags is interpreted in exactly the same way:

<meta name=”ROBOTS” content=”NOODP”>
<meta name=”robots” content=”noodp”>
<meta name=”Robots” content=”NoOdp”>

Ah, but what about something like this:

<MeTa nAMe=”RoBots” conTEnt=”NooDP”>

Well, Google didn’t go that far. But my experience over the past decade has been that meta tags are not case sensitive at all with the major search engines. So I think you’re safe in whatever case, for all the major search engines.


Related Articles

Latest Articles