As a part of the IMEC Labs Test Group, we have been running some tests to see whether or not tweeting a link to a page that is otherwise invisible to Google can cause it to be crawled and indexed. The short story is this is: Yes, it might. [Tweet This!]
In our two test samples, one of the pages did get crawled and indexed, and the other one did not. This means that you can potentially use Twitter to get your page discovered and indexed by Google. However, there is no guarantee that it will. To understand why this is significant, and how we tested it, please read on!
The Short Story
We had 3 total pages (all on the PerficientDigital.com domain) in the test, as follows:
Take note of the last column, as it shows how we tested the page. Basically, we had people publish tweets that included a link to the tested page, with different people being used for each of the two test pages. The goal was to see if these tweets would be enough to get the page crawled and indexed by Google. Here were the basic test results:
Note that the #singersongwriter page got crawled by Googlebot and indexed quite rapidly, but the #searchengines page was never crawled by Googlebot or indexed at all. The Control Page was never accessed by anyone at all, except by my Safari browser when I checked it after the initial upload.
So why did one page get indexed, and not the other? This table provides a summary of the differences between the two:
- Higher authority people tweeted the #singersongwriter page (we used Followerwonk Social Authority to determine this)
- Different hashtags were used for the 2 pages, though we thought the #searchengines tag was more likely to get picked up by Google
- Google indexed two of the tweets from the #singersongwriter group
- Different pages related to the tweets were indexed by Google from Twitter profiles, the actual hashtag page, and tweet replicating sites
One of the pages that Google indexed related to the #singersongwriter page was hub.uberflip.com, a site that has a Moz Domain Authority of 64. Perhaps that trigged the indexation, but that was not in the equation until well after the page was already indexed.
Which one of these factors triggered the crawl and indexation? Given the extremely short time frame between the first tweet, and the first crawl by Googlebot, we believe it’s extremely likely that Google saw the tweet on Twitter first, before it saw it on any 3rd party site (hub.uberflip.com or any other tweet replicating site).
Even if this was not the driving factor, you can still conclude that exposing the world to a new web page for the first time by a tweet can lead to its getting crawled and indexed.
The Long Story
This section goes into much greater detail on our methodology, as well as the data collected during the test. The basic concept of this test was as follows:
- We wrote 3 brand new articles for the test.
- We uploaded them by FTP to this web site, PerficientDigital.com.
- In other words, we did not upload this through WordPress, which is the CMS for PerficientDigital.com.
- No links were implemented to the new web pages.
- After uploading them, I checked them out using Safari. I chose this browser as there was no Google toolbar installed in my Safari browser.
- One of the 3 test files was ignored after that, to act as a control to verify that our procedures outlined in the steps above were execute correctly.
These mechanics were used in order to minimize the chances of Google discovering the pages by any means other than our test. Once this was done, we asked a small number of IMEC panelists to participate in the test by tweeting a link to the web page. They were emailed and sent to this page for instructions. Then some of these panelists executed tweets, such as this one by Rand:
After the emails requesting the tweets, we monitored the process for a period of 8 days to see what would happen. This also involved a number of steps:
- We saw which of our participants tweeted as requested, and logged when their tweets occurred (we also kept links to each of their actual tweets)
- We checked every day to see if the pages got indexed. We did this using a query such as [songwriter site:PerficientDigital.com] (without the []), so that the act of executing the query would not inform Google about the existence of the page.
- We checked the log files for PerficientDigital.com to see what various user agents had come to the site. We looked at every single user agent to see if there were any “corrupting influences” prior to Googlebot first arriving at the page.
- We monitored Open Site Explorer and Majestic SEO to see if the pages received any external links.
- We monitored Google itself to see what related to the test it was indexing using queries such as this one: [“This is a test by IMEC” singersongwriter] (without the []).
All of this monitoring was done daily to make sure we could verify that the test was as airtight as possible. Next, I will take a look at some of the details for the first test page (the #singersongwriter page), the one that got indexed. To start, here are all the log file accesses up until Googlebot’s first visit:
The Visitor Type column shows my notes on who the visitor to the page was, based on their user agent. If you look at this in detail you will see 5 different types of user agents:
- Browser User Agents – note that the first Safari access is by me after I uploaded the files to the Perficient Digital server
- Twitterbot, pretty much immediately after the first tweet
- Twitter scraper/replicators, such as Tweetmemebot, Tweetminster, InAGist, etc.
- Flipboard and GetPrismatic are in there, probably as a result of plugins in the IE browser that accessed the page seconds before they arrived
- Googlebot, 7:39 seconds after Twitterbot first arrived at the site.
Of the 6 people that tweeted the #singersongwriter page, 2 of their tweets were indexed. However, at the time that Googlebot first arrived at the site, only one of the tweets had been sent out, and that person’s tweet was not one of the ones that got indexed. In
addition, that person’s Social Authority was actually the lowest one in that test group (they had a Followerwonk Social Authority of 50). Now isn’t that fun to think about?
What about the other sites/pages that Google indexed that referenced the tweets? Here is a table that shows what we found there:
There are a number of differences between the pages that Google indexed that showed the text related to the test tweets in some fashion. Perhaps the most significant one was hub.uberflip.com because of its moz Domain Authority of 64. However, Googlebot had been to the @singersongwriter page long before hub.uberflip.com had any pages indexed.
We know this because Googlebot had already been at the #singersongwriter page within 7:39 of the first tweet, and that person’s tweet was never picked up by hub.uberflip.com. In fact, the tweets that were picked up by hub.uberflip.com were still more than an hour away from being tweeted at the time Googlebot made its first visit.
Final Thoughts
In summary, we believe it’s almost certainly the case that Google saw the initial tweet on Twitter and it caused that first visit by Googlebot to the #singersongwriter page. Given the Followeronk Social Authority level of 54, this was not triggered by the highest authority people that tweeted that page.
Even if Google did happen to first see the page on a site that replicates tweets, it still does show that it’s possible for you to help get a page initially crawled, and then indexed through Twitter promotion only, even when that page has no initial links to it.
Thanks to the IMEC board (Rand Fishkin, Mark Traphagen, Cyrus Shepard, and David Minchala, and the entire group of IMEC participants! And, for completeness, here is my Twitter Handle.
Eric:
Super post with great data screen-shots.
And, nice use of Schema in your post’s HTML as well.
tnx,
chris
Thanks for your clear description of testing here – will use this in class as an example this spring : )
“One of the pages that Google indexed related to the #singersongwriter page was hub.uberflip.com, a site that has a Moz Domain Authority of 64. Perhaps that trigged the indexation, but that was not in the equation until well after the page was already indexed.”
Domain Authority, of course, has absolutely no impact on Google’s crawl priorities but it can reflect the link pathways that might lead to a quick fetch from a search engine. On the other hand, low-value pages are easily indexed if they are PINGing the search services, and some of the aggregators may PING. Any blog that carries a widget displaying Tweets or Tweet aggregator content will certainly PING. PINGing triggers crawl.
Google can extract and queue a link for crawling from a newly fetched page and therefore crawl the link before the first page appears in the index. Hence, you cannot use index times to gauge when a link was extracted.
“Which one of these factors triggered the crawl and indexation? Given the extremely short time frame between the first tweet, and the first crawl by Googlebot, we believe it’s extremely likely that Google saw the tweet on Twitter first, before it saw it on any 3rd party site (hub.uberflip.com or any other tweet replicating site).
“Even if this was not the driving factor, you can still conclude that exposing the world to a new web page for the first time by a tweet can lead to its getting crawled and indexed.”
Twitter used to drive a lot of Web crawl until it started using “rel=’nofollow'” link attributes. In fact, it was all that crawl which most people in the SEO community mistook for “social signals” (it was simply that the links were being counted by Google).
Uberflip publishes the t.co URLs as plain text very soon after the Tweets are published, and one should ask if Google pulled the URL from Uberflip and then resolved the redirection very quickly. However, Twitter blocks all well-behaved bots on t.co via robots.txt.
So that means that either Google is ignoring both “rel=’nofollow'” and “robots.txt” or else something else triggered the crawl. It could, of course, interpret the redirection destination from content that includes both the t.co and the original destination link. If you can positively rule out all other possible sources of the link your experiment may point toward a process by which Google passively fetches links from Tweets despite its claims to ignore the “nofollow” attribute.
However, this is not a conclusive test since you cannot show a clear pathway for Google to “legally” get to the destination.
does using goo.gl as a link shortner make any difference??
Hello Eric Sir,
Such great explanation you have provide to us in your content.
I think if you dont use hastag but your tweet is retweeted by 5 to 6 followers then it will also indexed. Is it true???
Thanks for sharing such a great case study with us 🙂
Mr. Enge – do me a favor and improve my UX: add your Twitter handle to your Click to Tweet CTA so that we, the users, don’t have to search for it…
Pretty please? 😉
We have done this for clients. We would submit the URL via Google Webmaster Tools and also tweet the new URL, and have seen those pages indexed within a few days…
not clear whether or not the hashtag is important or not. Perhaps we will test it without the hashtag next time. Keep in mind, also, that one of our two tested pages did NOT get indexed, so there is no guarantee that it will help you get indexed. It MIGHT.
Not something we have tested yet!
Hey!
Found this on G+ thanks to Ana’s post. This case study is really detailed and I even thought it search engine hashtag got indexed first!
Nice surprised and thank you for sharing!
Thanks for your explanation and will wait for your next post without hashtag 🙂
Hi Eric,
Recent news about Google gaining access to twitters data again made me think of your post. http://searchengineland.com/report-google-will-get-access-twitters-firehose-214220
Any thoughts on how this impacts the likely hood of links being tweeted getting indexed?
Dylan
Dylan – I would bet that indexation will soar. The crawling effort is a non-trivial burden with respect to the value per tweet. But, if it comes in via firehose, that’s a different matter altogether.
Thank you for sharing such an in-depth study.
I’m assuming that the new pages weren’t added automatically to your sitemap either?
That’s correct, Google had no way to know about them, other than our tweets.
you can index any page from google webmaster there is limit of 50.
Getting backlinks to unindexed page is a stonage method
This is fascinating information and definitely food for thought! Keep us all posted on this topic please! Thank you and best wishes
Hi guys, really interesting article.
You mentioned that you emailed the tweet to the test panel as part of the experiment, which would include a live link to the new pages.
Therefore I was wondering if you had checked which email clients had been used as part of the sending process; if only to rule out the slight possibility of Google picking up these new links from say the Gmail client?
Ian – the email did not contain the link in the body of the email. It sent people to a web page where they then would click a button which automatically put together the tweet for them using a service called Click to Tweet. So it was all pretty encrypted.
In addition, one of the tweets was indexed and the other one was not, yet each panel had 3 Gmail users.
So my guess is that the Gmail usage did not impact the test.
It could of course be that Google is actually adhering to nofollow and robots.txt (which mostly applies to the t.co links) yet be following the data-expanded-url. Google doesn’t exactly inform people of its changes as much as people think and it wouldn’t be hard for them to adopt a process of scanning the data without violating existing terms they set forth. (there are no laws to robots.txt nor nofollow – google created these rules themselves and are not obligated to follow them).
It’s works with another social midia? (facebook, google+, linkedin)
That it works with Google+ has been known since its earliest days (and makes total sense). We haven’t tested other networks yet.
Could it be niche specific, perhaps in a similar way that links are (a spammy link profile in one niche may not be spammy in another)? SEO related terms in general are much more competitive and likely to be gamed then singer songwriter terms. Also, a side note: I am not sure that the results can be considered at all conclusive or even suggestive of anything solid considering this was ran on only three pages in significantly different niches, with significantly metrics etc.
There is some truth the notion that some algos may behave differently in one niche vs. another. With links though, in a major market like the US, I think spam will not have much that’s defined differently. It’s not impossible that it will be treated differently in a market where there are very few quality links implemented.