So here is the question: Does Google scan Gmails to see URLs shared within them, and then does it use these to discover new content? There are many who adamantly maintain that they do. So the IMEC Labs Test Group decided to put that to the test. In this post, we will report our results on whether or not Google is reading your gmails to see stuff that you are linking too.
So here is your answer in a nutshell: Our tests showed no evidence of Google crawling URLs that were shared within Gmails. None. Want more detail on why we say that? Read on to get either the “Short Story”, or the “Full Story”, below.
The Short Story
We posted 4 total pages (all on the PerficientDigital.com domain) in the test, and then asked different groups of users to email links to those pages to either Mark Traphagen, Cyrus Shepard, David Minchala, or myself (these are 4 of the 5 members of the IMEC board, the 5th is Rand, you can read more about the board below). The process of sending the email only required two button clicks for them so the task was made easy. We also set it up so that the actual email content was pre-configured to make that easy as well.
We asked 20 to 22 people to send gmails sharing the links for each article to the various pages. One group was asked to share article 1, a different group was asked to share article 2, and so forth. The goal was to see if Google would spot these links in the gmails, and then crawl and index those URLs. You can see that we had somewhat uneven levels of participation. Article 1 got the most shares and article 4 the least. This was simply reflective of the people we asked to send out gmails following through at different levels. Here were the basic test results:
So as you can see, well, there was very little to see. The results were wholly unremarkable, and that’s the most remarkable thing about them!
The Long Story
This section goes into much greater detail on our methodology, as well as the data collected during the test. The basic concept of this test was as follows:
- We wrote 4 brand new articles for the test.
- These articles were mapped into hand-coded web pages that included NO Google code on it. No Google Analytics, no google plus button, etc.
- We uploaded the 4 results HTML pages by FTP to PerficientDigital.com.
- We did not upload this through WordPress, which is the CMS for PerficientDigital.com.
- No links were implemented to the new web pages.
- After uploading them, I checked them out using Safari. I chose this browser as there was no Google toolbar installed in my Safari browser.
The reason for all these mechanics was to make sure that the pages were completely unknown to Google at the start of the test. We also forbade participants in the test from visiting the web pages. This was critical to the test, as different browsers, or browser plugins, can trigger discovery of content by Google. For example, it’s known that the Google+ +1 button will call the Google+ API on a page load, and this can trigger Google crawling a page.
Various SEO toolbars that people may install in their browser can be a problem too. For example, we had one attempt at this test aborted because of a YouTube related plugin in a Firefox that caused Googlebot Mobile visits and that invalidated that attempt at the test (note that we are going to write this up in a separate article sometime soon!). However, in the final addition of the test, we were able to verify that there were no corrupt elements in the test.
When we launched the test I am reporting on today, each participant was sent to a page similar to this one. All the participants needed to do was click on one of the “Send Test Mail Option x” buttons, and this would pre-populate their gmail client with an email and then they would simply click send and that was it.The page that the test panelist went to in order to execute this looked like this one:
The test was launched on March 9th. We then monitored the pages for 12 days to see what transpired. The basic components of this monitoring were:
After the emails requesting the tweets, we monitored the process for a period of 8 days to see what would happen. This also involved a number of steps:
- We tracked the gmails sent (each gmail was sent to one of 4 of the 5 members of the IMEC board, specifically Mark Traphagen, Cyrus Shepard, David Minchala, or myself)
- We checked the log files every day to see if Googlebot (or other Google programs) visited the page. This also allowed us to monitor that all of our participants followed our instructions and did not visit the pages.
- We used search queries such as [songwriter site:PerficientDigital.com] (without the []), so that the act of executing the query would not inform Google about the existence of the page) to see if the pages appeared in the Google index.
- We monitored Open Site Explorer and Majestic SEO to see if the pages received any external links.
All of this monitoring was done daily to make sure we could verify that the test was as airtight as possible.
Ultimately, the bottom line is that Googlebot never came to any of the test pages. Not even once. In addition, all of our test participants adhered to the instructions and never visited the pages, so we know that there were no corrupting influences. In any event, any corruption would have shown itself as a Googlebot visit, and since we had none, we can be confident in the results.
There were two curiosities in the test worthy of note:
- mail.google.com did visit one of the pages. Why this happened, we do not know. However, it did not lead to a Googlebot visit, or indexation of the impacted page.
- BUbiNG bot visited two of the pages on March 15th. This is a bot implemented by the University of Milan. It is not clear how they discovered the pages visited, but it seems likely that the emails were routed via servers they are monitoring.
However, neither of these curiosities changes the essential result, which is that none of the pages were visited by Googlebot, and none of them were indexed by Google.
Thanks to the IMEC board: Rand Fishkin, Mark Traphagen, Cyrus Shepard, and David Minchala, and the entire group of IMEC participants! And, for completeness, here is my Twitter Handle.
Nice test. I’ve always been curious about other Google tools too – Google Docs, Gchat etc – of those are crawled. However this test makes me think probably not.
Nice to know that Google didn’t become as evil as some people think 😉
Great test! I’ve long argued that Googlebot DOES crawl URLs that are sent by gmail users, so it’s pretty clear that I was wrong.
As a followup, I’d be interested in seeing this tested again with a brand new domain that has never been crawled by Googlebot before.
One piece of information that would be useful to clarify… “Gmail” (i.e. a proper, pukka Gmail account) or “Google for work” (i.e. an account that might be branded Gmail, but has a bespoke domain).
Google Now doesn’t search Google for work email. It does search Gmail.
Secondly – that Milanese bot is by FAR the most concerning. If you sent an email from one Gmail account to another, it shouldn’t have gone anywhere NEAR a Milan server.
A long time ago, in an attempt to provide better ads, gmail would look at a URL in your mail and ask the search engine “has GoogleBot ever indexed this page?” If the answer was yes, then it would ask GoogleBot for the indexed content, and use it to generate more relevant ads. Unfortunately the ads were sometimes too relevant, in some cases giving away the punch lines to jokes and such. And sometimes it was just creepy. So they stopped.
As Eric Schmidt, the CEO of Google at the time, described it: “Google policy is to get right up to the creepy line and not cross it.
I always suspect Google as they have mostly done what they say they don’t and later on bring this is an algorithm change. I remember your blog on Twitter indexation as they looked top profiles for indexing. Can that be case with those tweets as like emails? Not sure. But this study was fabulous one. Thanks!
I was in the camp that thought they would have been using Gmail as a form of discovery similar to how they use Google+ and Tweets. A little relieved they are not.
I had thought they weren’t using it, but like you, it’s a bit of a relief to confirm it!
I would expect them to stay away from *obviously* using any urls that we’re shared in a private setting like Gmail, but I’d also think that Gmail as a gigantic data source for new urls to be extremely hard for them to ignore.
I wonder if these urls would still not be indexed in 6 months? Also, I wonder what would happen if these urls were emailed to 2000 people instead of 20?
Even though Google is not obviously crawling the mails it could actually be a ranking factor, when you see mails as a social tool. But on the other hand it would cross the line to personal privacy…
nice test, thx for sharing the insights!
Great test, thanks for sharing the results publicly! I was also in the camp that thought they would have been using Gmail as some form of discovery, at least for trends on a macro-level.
I can tell you that Gmail does follow links in emails… I just did the test with a custom web-app. Google keeps following the links contained in the verification email despite the error messages…
It’s 2020 now and I can confirm gmail visits some links in emails. It invalidated some of our unique one-time links and users weren’t able to log in.