robots.txt vs meta tags

2021-01-28

~4 min read

648 words

We recently discovered that our robots.txt file wasn’t configured properly and pages we were expecting to be excluded from search engines.

In investigating the fix, I found several noteworthy resources:

While I’d previously come across robots.txt, robotstxt.org was quite helpful in explaining how it works.
In addition to the robots.txt approach, you can also provide directives to robots via <meta> tags.

Differences Between Meta And Robots.txt

InFrontDigital had a nice summary of the differences between using a robots.txt and the Meta tag:

In general terms, if you want to deindex a page or directory from Google’s Search Results then we suggest that you use a “Noindex” meta tag rather than a robots.txt directive as by using this method the next time your site is crawled your page will be deindexed, meaning that you won’t have to send a URL removal request. However, you can still use a robots.txt directive coupled with a Webmaster Tools page removal to accomplish this.

Using a meta robots tag also ensures that your link equity is not being lost, with the use of the ‘follow’ command.

Robots.txt files are best for disallowing a whole section of a site, such as a category whereas a meta tag is more efficient at disallowing single files and pages. You could choose to use both a meta robots tag and a robots.txt file as neither has authority over the other, but “noindex” always has authority over “index” requests.

On the other hand, if your concern is bandwidth, you should use the robots.txt to prevent the robot from navigating to your site at all.

Meta Tag Details

If you use the <meta> approach, there are a few ways to tailor it.

You can specify which robots to target. For example, <meta name="robots"> is a generic meant to apply to all robots while <meta name="googlebot"> would be Google’s only.¹

You can also supply various directives

Per MDN, the list of directives is:

Value	Description	Used by
index	Allows the robot to index the page (default).	All
noindex	Requests the robot to not index the page.	All
follow	Allows the robot to follow the links on the page (default).	All
nofollow	Requests the robot to not follow the links on the page.	All
all	Equivalent to index, follow	Google
none	Equivalent to noindex, nofollow	Google
noarchive	Requests the search engine not to cache the page content.	Google, Yahoo, Bing
nosnippet	Prevents displaying any description of the page in search engine results.	Google, Bing
noimageindex	Requests this page not to appear as the referring page of an indexed image.	Google
nocache	Synonym of noarchive.	Bing

It’s worth noting that how any crawler responds to these directives is ultimately determined by the crawler and using these directives are not a guarantee that a crawler will respect them. The same is true of using a robot.txt.

Google’s documented list of directives is found here.

Multiple Directives

You can provide multiple directives in a few different ways:

Multiple <meta> tags to specify different crawlers
Comma separated names to supply multiple directives

For example, to not allow any robots from indexing while still allowing Google’s to follow links, you could do:

<head>
  <meta name="robots" contents="noindex" />
  <meta name="googlebot" contents="follow" />
  <head></head>
</head>

On the other hand, you can supply multiple directives simultaneously, for example, preventing indexing while allowing link following:

<head>
  <meta name="robots" contents="noindex,follow" />
</head>

Footnotes

¹ Even then, however, there are additional variants. For example, Google’s full list of crawlers is quite extensive.

2021 Daily Journal

Hi there and thanks for reading! My name's Stephen. I live in Chicago with my wife, Kate, and dog, Finn. Want more? See about and get in touch!