The availability of the Robots.txt NoIndex directive is little known among webmasters largely because few people talk about it. Matt Cutts discussed Google’s support for this directive back in 2008. More recently, Google’s John Mueller discussed it in this Google Webmaster Hangout. In addition, Deepcrawl wrote about it on their blog.
Given the unique capabilities of this directive, the IMEC team decided to run a test. We recruited 13 websites that would be willing to take pages on their site and attempt to remove them from the Google index using robots.txt NoIndex. Eight of these created new pages, and five of them offered up existing pages. We waited until all 13 pages were verified as being in the Google index, and then we had the webmasters add the NoIndex directive for that page to their Robots.txt file.
This post will tell you whether or not it works, explore how it gets implemented by Google, and help you decide whether or not you should use it.
Difference Between Robots Metatag and Robots.txt NoIndex
This is a point the confuses many, so I am going to take a moment to lay it out for you. When we talk about a Robots Metatag, we are talking about something that exists on a specific webpage. For example, if you don’t want your www.yourdomain.com/contact-us page in the Google index, you can put the following code in the head section of that webpage:
For each page on your website that you don’t want indexed, you can use this directive. Once Google recrawls the page and sees that directive, it should remove the page from their index. However, implementation of this directive, and Google’s (or Bing’s) removing it from their index, does not tell them to not recrawl the page. In fact, they will continue to crawl the page on an ongoing basis, though search engines may, over time, choose to crawl that page somewhat less often.
A common mistake that many people make is implementing a Robots Metatag on a page while also blocking crawling of that page in Robots.txt. The problem with this approach is that search engines can’t read the Robot’s Metatag if they are told to not crawl the page.
In contrast, implementing a NoIndex directive in Robots.txt works a bit differently. If Google, in fact, supports this type of directive, it would allow you to combine the concept of blocking a crawl of a page and NoIndex-ing it at the same time. You would do that by implementing directive lines on Robots.txt similar to these two:
Since reading the NoIndex instruction does not require loading the page, the search engine would be able to both keep it out of the index AND not crawl it. This is a powerful combination! There is one major downside that would remain though, which is that the page would still be able to accumulate PageRank, which it would not be able to pass on to other pages on your site (since crawling is blocked, Google can’t see the links on the page to pass the PageRank through).
On September 22, 2015, we asked the 13 sites to add the NoIndex directive to their Robots.txt file. All of them complied with this request, though one of them had a problem with it: implementing the directive caused their web server to hang. Since this server crash was unexpected, I tested this on PerficientDigital.com for a given page, and it also caused a problem for our server, resulting in this message:
Later on, I retested this, and the problem dawned on me. I had implemented the NoIndex directive in my .htaccess file instead of robots.txt. Evidently, this will hang your server (or at least some servers). But it was an “Operator Error”! I have since tested implementing it in Robots.txt without any problems. However, I discovered this only after the testing was completed, so that means we had 12 sites in the test.
Our monitoring lasted for 31 days. Over this time, we tested each page every day to see if it remained in the index. Here are the results that we saw:
Now that’s interesting! 11 of the 12 pages tested did drop from the index, with the last two taking 26 days to finally drop out. This is clearly not a process that happens the moment that Google loads Robots.txt. So what’s at work here? To find out, I did a little more digging.
[Tweet “Study shows Google can take up to 3 weeks to deindex after robots.txt NoIndex is added to a page. More at”]
Speculation on What Google is Doing
My immediate speculation was that it seems like Google is only executing the Robots.txt NoIndex directive at the time it recrawls the page. To find that out, I decided to dig into the log files of some of the sites tested. The first thing that you notice is that Googlebot is loading the Robots.txt files for these sites many times per day. What I did next was review the log files for two of the sites, starting with the day that the site implemented the Robots.txt NoIndex, and ending on the day that Google finally dropped the page from the index.
The goal was to test my theory, but what I saw surprised me. First was the site that never came out of the index during the time of our test. For that one, I was able to get the log files from September 30 through October 26. Here is what the log files showed me:
Remember that for this site the target page was never removed from the index. Next, let’s look at the data for one of the sites where the target page was removed from the index. Here is what we saw for that page:
Now that’s interesting. Robots.txt is regularly accessed as with the other site, but the target page was never crawled by Google, and yet it was removed from the index. So much for my theory!
So then, what is going on here? Frankly, it’s not clear. Google is not responding to the NoIndex directive every time they read a Robots.txt file as a directive. Nor are they under any obligation to do so. That’s what led to my speculation that they might wait until they crawl the page next and consider NoIndex-ing the page then, but clearly that’s not the case.
After all, the two sets of log files I looked at both contradicted that theory. On the site that never had the target page removed from the index, the page was crawled five times during the test. For the other site, where the page was removed from the index, the target page was never crawled.
What we do know is that there is some conditional logic involved, we just don’t know what those conditions are.
[Tweet “Study shows that implementing robots.txt NoIndex metatag does not guarantee Google will deindex the page. More at”]
Ultimately, the NoIndex directive in Robots.txt is pretty effective. It worked in 11 out of 12 cases we tested. It might work for your site, and because of how it’s implemented it gives you a path to prevent crawling of a page AND also have it removed from the index. That’s pretty useful in concept. However, our tests didn’t show 100 percent success, so it does not always work.
Further, bear in mind, even if you block a page from crawling AND use Robots.txt to NoIndex it, that page can still accumulate PageRank (or link juice if you prefer that term). And PageRank still matters. The latest Moz Ranking Factors Results for more information on that still weighs different aspects of links as the two most important factors in ranking.
In addition, don’t forget what John Mueller said, which was that you should not depend on this approach. Google may remove this functionality at some point in the future, and the official status for the feature is “unsupported.”
When to use it, then? Only in cases where 100 percent removal is not a total necessity, and in which you don’t mind losing the PageRank to the pages you are removing in this manner.
Thanks to the IMEC board: Rand Fishkin, Mark Traphagen, Cyrus Shepard, and David Minchala. And, for completeness, here is my Twitter handle.
Thanks also to the publishers participating in this test. Per agreement among the IMEC team, participants remain anonymous, but you guys are awesome! You know who you are.