~4 min read|
We recently discovered that our
robots.txt file wasn’t configured properly and pages we were expecting to be excluded from search engines.
In investigating the fix, I found several noteworthy resources:
robots.txt, robotstxt.org was quite helpful in explaining how it works.
robots.txtapproach, you can also provide directives to robots via
InFrontDigital had a nice summary of the differences between using a
robots.txt and the Meta tag:
In general terms, if you want to deindex a page or directory from Google’s Search Results then we suggest that you use a “Noindex” meta tag rather than a robots.txt directive as by using this method the next time your site is crawled your page will be deindexed, meaning that you won’t have to send a URL removal request. However, you can still use a robots.txt directive coupled with a Webmaster Tools page removal to accomplish this.
Using a meta robots tag also ensures that your link equity is not being lost, with the use of the ‘follow’ command.
Robots.txt files are best for disallowing a whole section of a site, such as a category whereas a meta tag is more efficient at disallowing single files and pages. You could choose to use both a meta robots tag and a robots.txt file as neither has authority over the other, but “noindex” always has authority over “index” requests.
On the other hand, if your concern is bandwidth, you should use the
robots.txt to prevent the robot from navigating to your site at all.
If you use the
<meta> approach, there are a few ways to tailor it.
You can specify which robots to target. For example,
<meta name="robots"> is a generic meant to apply to all robots while
<meta name="googlebot"> would be Google’s only.1
You can also supply various directives
Per MDN, the list of directives is:
|index||Allows the robot to index the page (default).||All|
|noindex||Requests the robot to not index the page.||All|
|follow||Allows the robot to follow the links on the page (default).||All|
|nofollow||Requests the robot to not follow the links on the page.||All|
|all||Equivalent to index, follow|
|none||Equivalent to noindex, nofollow|
|noarchive||Requests the search engine not to cache the page content.||Google, Yahoo, Bing|
|nosnippet||Prevents displaying any description of the page in search engine results.||Google, Bing|
|noimageindex||Requests this page not to appear as the referring page of an indexed image.|
|nocache||Synonym of noarchive.||Bing|
It’s worth noting that how any crawler responds to these directives is ultimately determined by the crawler and using these directives are not a guarantee that a crawler will respect them. The same is true of using a
Google’s documented list of directives is found here.
You can provide multiple directives in a few different ways:
<meta>tags to specify different crawlers
For example, to not allow any robots from indexing while still allowing Google’s to follow links, you could do:
<head> <meta name="robots" contents="noindex" /> <meta name="googlebot" contents="follow" /> <head></head> </head>
On the other hand, you can supply multiple directives simultaneously, for example, preventing indexing while allowing link following:
<head> <meta name="robots" contents="noindex,follow" /> </head>
Hi there and thanks for reading! My name's Stephen. I live in Chicago with my wife, Kate, and dog, Finn. Want more? See about and get in touch!