On May 5th, 2016 I had the opportunity to do a live video event with Google’s Gary Illyes. We focused on strategies for controlling page indexing, Google’s crawl, and overall site quality. Mark Traphagen then took us through a Q&A session with questions from the audience.
Some key areas that brought out some gems include:
- How NoIndex and rel=canonical affect crawling
- Why using NoFollow on internal links is simply a bad idea
- What happens when you implement non-reciprocal hreflang tags
- The presence of link signals other than PageRank
- What pages can you direct a rel=canonical to
- And, much more!
Below is the complete transcript, along with some specific notes from me on key points. Or, you can watch the full video here:
Full Transcript
Eric: Thanks for joining us on today’s Virtual Keynote. Thanks so much for coming. Very excited to have Gary Illyes back with us today.
Eric: The hashtag for today’s event is #VirtualKeynote. He’s a webmaster trends analyst. He’s tinkered in the algorithms in certain specific areas, does a lot of support for people through the Webmaster Central blog and speaking all over the place. An amazing resource. Thanks so much for joining us, Gary.
Gary: Wait, are we live?
Eric: Yes, we are.
Gary: Oh, but I haven’t fixed my hair yet.
Eric: Well, then maybe we should just hang up and start over. Is that what you want to do?
Gary: No. Thank you for having me.
Eric: Awesome. Today we’re going to take on many topics related to robot control, indexation control, SEO tags, and managing overall site quality. We’re not going to spend any time on how to implement these specific tags, because you can just use Google and search and learn how to do that. Instead we’re going to focus on how they work and what they do and the right situations to use them in.
We’re going to go through a lot of real-world scenarios, including some advanced scenarios and the impact they have on a website. So if you’ve got an eCommerce site, you’ve got to know this stuff cold and I’m going to share some examples, tell you just how big an impact these tags have.
And with us of course is Mark Traphagen. He’s going to be the master of the Q&A session, which is starting in about 30 minutes. Also he will be actively interacting with you in YouTube comments for the event. So you can actually start interacting with him now and asking him all sorts of questions.
Gary: So I want to mention one more thing. Another thing that we are not going to talk about is this guy, which is a penguin.
Mark: I do think the audience needs to know that Gary drew that by himself, that Penguin, right live in front of us in about three minutes, just seconds ago. So it’s awesome.
Gary: It was actually four minutes.
Eric: Important skills that are in Gary’s portfolio here. Alright. If we’re ready, what I’d like to do is just start going through the various solutions you can use for helping with robot control and indexation control and these kinds of things, just to make sure we’re clear on what each of them do, and then we’ll get into the advanced scenarios after that. Does that make sense?
Gary: Can I say no?
Eric: You can say no. Go ahead, say no.
Gary: No.
Eric: And then what would you like to do instead?
Gary: Oh, let’s just do it.
Eric: Okay, there you go. For the NoIndex, my understanding is that this is a directive, not a suggestion. Is that correct?
Gary: Yes. Pretty much everything in the robots exclusion protocol is a directive, not a suggestion.
Eric: Yes. If you use this tag on a page, Google will pull your page from the index. The page will still be crawled by Google, and it still accumulates PageRank.
Gary: Yes, that’s correct.
Eric: And it can pass PageRank through the links found on the page.
Gary: And that’s correct as well. Yes.
Eric: It’s fairly inefficient at passing PageRank in one sense because if it’s a page you’ve decided to not have in the index, it just spews it through all the links on the page.
Gary: Yes, no. I don’t think that there is any dissipation in passing PageRank. Another thing that I want to mention, you are focusing on PageRank, but there are a lot of other signals that are passed on or together play very important role in ranking pages. For example, hreflang. I just want to say that passing signals is a better way to phrase this than just PageRank.
Eric: Right. Other signals other than hreflang that you can mention?
Gary: Let’s not go there now. Maybe if we have time at the end, then we will cover that.
Eric: Okay. So then the other question that people typically have about NoIndex is, does the crawl frequency of a NoIndex page decline over time?
Gary: Yes. Typically, it will decline for any page that we cannot index for whatever reason. Basically, we will try to crawl a few more times to see if the NoIndex is gone or if that page recovered from a 500 or whatever. And if the NoIndex is still there, then we will slowly start to move or to not crawl that page that often. We will still probe that page every now and then. Probably every two months, every three months, we will visit the page again to see if the NoIndex is still there, but we will very likely not crawl it that often.
Eric: Right. Yes. So it will decline and, and you do get some robot control in NoIndex in some sense because the crawl rate gets dropped down a bit. So that’s good to know.
Next up, rel=canonical. Basically the rel=canonical, as I understand it, says, “Hey, I’m not the page you need to be paying attention to, Google. The real page I want you to pay attention to is over here where I’m pointing. Pass all my signals over there.” And the assumption is that the page with the rel=canonical is a subset or a duplicate of the page which it’s pointing their canonical to.
Gary: Yeah. It can be pretty much any page actually. It doesn’t have to be a subset or anything. Basically, if you have a random page on a site and you don’t want to have that indexed anymore, then you can suggest search engines with the rel=canonical to actually pick up another page instead of that one and they will pass on pretty much all signals, all relevant signals.
Eric: And for the most part, so that the page with the rel=canonical on it doesn’t end up in the index, it still gets crawled. It still accumulates signals, but it passes all those signals to the target of its canonical.
Gary: Correct.
Eric: And then presumably then the crawl frequency of that will also decline over time.
Gary: Yes. Since it doesn’t become canonical, we will not spend that much time crawling it. It’s probably not worth it anyway. We probably had a good reason to point the right canonical to somewhere else and we would rather spend time on crawling that page instead, where you pointed to with the rel=canonical.
Eric: Now for a fun fact, by the way, one of the world’s largest retailer’s canonicals all the facets in their navigation to their home page. Actually, there’s also a site that for a period of time this past fall, for three months…this is a site that is a really well-known site on the Internet, I won’t name them, but was really known for their SEO expertise…and they were doing exactly the same thing. They were canonicalizing all their facets to their home page. That seems like a broken implementation.
Gary: Well, it’s up to the webmaster, right? It’s up to you if you want your facets indexed. Probably it’s not the brightest thing to do. Probably it’s better to pick a few important pages rather than the home page where you rel=canonical. I do agree that you probably don’t want to have all the facets indexed.
It’s probably not worth it because some of them will be just duplicates of previously crawled and indexed facets. So I would probably rather just pick or create maybe a category page for the most important facets in the rel=canonical there if you don’t want all the facets indexed.
Eric: Right, and what you’re sharing here today though, it sounds like relevancy isn’t as major an issue as most of us in the SEO community tend to believe.
Gary: In this case, no.
Eric: Yeah. But does it still make sense then to start with the assumption like probably you want a canonical to something highly relevant, but just not get too hung up with being religious about that?
Gary: Yes.
Related: Do eCommerce Sites Completely Mess Up Their SEO? (study)
Eric: Okay, excellent. Alright, next up, robots.txt. Directive. Google won’t crawl the page. The page could still be indexed if other signals suggest it should be. And of course, it doesn’t pass signals because you can’t actually see what signals it’s trying to pass.
Gary: Yes. That’s correct.
Eric: Awesome. That was an easy one. Okay, now rel=”prev”/”next”. So this is used to identify a paginated sequence of pages. Example is you might have a hundred shoe products and show 10 shoes per page and it takes 10 pages to show it and you would normally have HTML links that users would click to travel between those pages, but “prev”/”next” is a tagging system to help Google make sure it understands that this set of pages is a sequence.
Gary: Yes, that’s correct.
Eric: The pages aren’t removed from the index, however.
Gary: They become canonicals on their own as well, I think.
Eric: Yes. So the group of pages to some degree are treated by Google like one block.
Gary: Yes.
Eric: And does that mean that all the signals into any of the pages in that block accrue to the benefit of the whole block?
Gary: Yes. And that’s the whole point of the rel prev/next.
Eric: Correct. But you might still return a page other than the first in the sequence if a user query was more relevant to a specific product on the seventh page?
Gary: Exactly.
Eric: Anything that I left out with “prev”/”next”?
Gary: I don’t think so. I think you covered pretty much everything.
Eric: Alright, excellent. So you’re the host of the show. No, never mind. Okay.
Gary: Let me give you a bit of background. I know that Eric has a ton of questions and I’m trying to be very short with my answers so we can cover as much as possible. And if he’s actually correct with what he’s saying, then I will not try to argue with that.
Eric: You’ll argue with me just for the sake of arguing with me, but that’s a wholly different…
Gary: Yes, that’s something different.
Eric: We’re trying to get through the basics and lay the foundation, because we do have a bunch of questions. So now hreflang. Alright. So this is where we have different international versions of pages and you implement a tag to tell Google where to find the French version of the page or the French version in Switzerland version of the page. I did that right?
Gary: Yes.
Eric: And handle different language cases so that it just helps Google find all the international versions. Yeah?
Gary: Yes. It’s not just helping find the pages but also rank the pages. So for example, if we know that there’s a French version for Swiss users and the query is in French or maybe even if the user interface is in French, then we will prefer to show first the French version of the page.
Eric: Right. So this helps sites a lot with multiple languages in countries better target…and can show up the correct versions of the pages in the local search engines.
Gary: Yes.
Eric: One thing that people get tripped up on is, as I understand it, you can specify language only or language-and-country, but you can’t specify country only.
Gary: That’s correct.
Eric: Right. So the big question there, does tagging need to be 100% reciprocal?
Gary: Yes. If it’s not, then we will ignore those links from the cluster. So if A links to B but B doesn’t link back, then B will not get into the hreflang cluster.
Eric: This is a very common mistake we see with sites, by the way, is that they do this one-way tagging. And so you just heard it from Gary, that means the tag is ignored, so all that effort you put into putting that tag up there is now wasted.
Gary: Well, for that particular link. If the other links are correct, then we want to still use those.
Eric: Yes. So that handshaking or reciprocality…is that a word?
Gary: You just made one.
Eric: Well, I’ll call it reciprocal-ness instead. If I’m going to invent a word, I might as well make it harder to say. So that’s an essential part of this. Is this a problem if you’ve got 50 different language-country variations? That’s a lot of code to put on every page.
Gary: From that point of view…so from our point of view, processing-wise, I think we do limit the number of hreflang go-links that a page can have, but the number is something on the order of hundreds because we believe that after a certain number of hreflang links, it’s probably just a mistake or spam or something like that.
Bytes-wise, probably you want to limit how much you put in the HTML or in the header, which is even worse. But you can definitely use sitemaps to specify your hreflang tags and then you don’t put additional payload in the HTML that can slow down the page.
Eric: That’s important, if there are those alternative ways to put up hreflang. Cool. So we start to get into some of the more specific questions.
Gary: Oi oi.
Eric: Yes, I knew you were excited for this part. So historically people have talked about Google having a crawl budget. Is that a correct notion? Like, Google comes and they’re going to take 327 pages from your site today?
Gary: It’s not by page, like how many pages do we want to crawl? We have a notion internally which we call host-load, and that’s basically how much load do we think a site can handle from us. But it’s not based on a number of pages. It’s more like, what’s the limit? Or what’s the threshold after which the server becomes slower
I think what you are talking about is actually scheduling. Basically how many pages do we ask from indexing side to be crawled by Googlebot? That’s driven mainly by the importance of the pages on a site, not by number of URLs or how many URLs we want to crawl. It doesn’t have anything to do with host-load. It’s more like, if…this is just an example…but for example, if this URL is in a sitemap, then we will probably want to crawl it sooner or more often because you deem that page more important by putting it sitemap.
We can also learn that this might not be true when sitemaps are automatically generated. Like, for every single URL, there is a URL entering the sitemap. And then we’ll use other signals. For example, high PageRank URLs…and now I did want to say PageRank…probably should be crawled more often. And we have a bunch of other signals that we use that I will not say, but basically the more important the URL is, the more often it will be re-crawled.
And once we re-crawl a bucket of URLs, high-importance URLS, then we will just stop. We will probably not go further. Every single…I will say day, but it’s probably not a day…we create a bucket of URLs that we want to crawl from a site, and we fill that bucket with URLS sorted by the signals that we use for scheduling, which is site minutes, PageRank, whatever. And then from the top, we start crawling and crawling. And if we can finish the bucket, fine. If we see that the servers slowed down, then we will stop.
Eric: Right. But you do have URLs that you probably touch multiple times a day and others you touch every day and others that rotate and you might have URLs that get touched only once a month and others that get touched once a week or something like that.
Gary: I’m teaching internally a class called Life of Query for new Google engineers, which is how search really works and there I explain this with an example. Like, take the home page of CNN and then take About page of CNN or Turner. The About page changes probably once a year maybe, if it changes. The home page changes every minute, let’s say. You have to crawl the home page much, much more often than the About page because users typically want fresh pages or fresh content in the search results, so you will just prioritize the crawling of the CNN home page over the About page.
Eric: Right. So is it correct to say then…and this is the reason why I went down this discussion path…that if I have pages that are duplicates or that I’m not allowing be indexed by one means or another, that if Google is spending time on your site crawling those pages, then they’re spending less time crawling pages on your site that might get indexed.
Gary: Yes, definitely.
Eric: So the general concept of the reason why you need to master all these tags…there are many reasons, but one of them is so that Google spends more time on pages which you actually want them to index and that help you make money. Basically. So cool.
Alright, some eCommerce basics. What would you say is the best way to deal with a page that has a sort order for various products? So default might be lowest to highest price and the user switches from highest to lowest price, there’s really no difference in the products listed on the page, just the order.
Gary: I would probably try to find or to pick one sort order and rel=canonical the other one to that one.
Eric: I agree. You’re right. I thought I’d say that.
Gary: Thank you.
Eric: What’s the best way then to deal with a filter?
Gary: Same thing.
Eric: Okay, excellent. And then how about minor facet variations, such as having different pages for each color and each size of the particular shirt for sale?
Gary: I guess the same thing. Is that correct?
Eric: Yes, it is. Actually, the reason why I always give that advice is because in principle, rel=canonical and NoIndex will both take the page out of the index, but the rel=canonical gives you more control of where the signals get passed to. Right?
Gary: Yes. Also just a reminder that rel=canonical is not a directive. It’s a strong suggestion and it’s only one of the few dozen signals that we use for canonicalization. It’s a strong signal. It’s one of the strongest signals, but it’s still just a signal.
Eric: So, when do you choose not to respect a rel=canonical?
Gary: Probably a combination of other signals or something like, if the target URL is roboted or not indexable in some way, then we would probably not pick that one. Actually, that happened a few days ago. I was debugging a case where a rel=canonical was pointing to an HTTPS URL or its HTTPS counterpart, but the HTTPS counterpart had certificate issues which would render the page for the user…how to say that?
Eric: Insecure?
Gary: No, inaccessible. In Chrome, at least. So it doesn’t make sense to respect rel=canonical in that case, for example.
Eric: And what about the idea of using something like Ajax to actually just eliminate the creation of new URLs for sort orders and filters? So you have my…
Gary: I answered this question a while ago. You actually published an article about this.
Eric: I did publish an article about this, but I’m just asking you for the purposes of the…
Gary: Okay. That would probably work and we would just default to the page that we see on page load time.
Eric: Okay. So I’m going to jump through some questions here so we can make sure we don’t delay the Q&A with the audience too long.
Gary: This smells like a trap.
Eric: It could be a trap, but it isn’t. I’ll design a trap for next time, though, because now I feel obligated to do so.
Gary: Oh, great. My big mouth.
Eric: It’s just something to look forward to. So we haven’t talked about NoFollow, but last I understood about NoFollow, basically it said don’t pass signals through this link, but any such signals that aren’t passed through the link are essentially discarded. It’s not redistributed to the rest of the page or all the other links out from the page. That a fair summary?
Gary: Yes, yes.
Eric: Okay. So in that light, does it ever makes sense to have a NoFollow from a page on your site to another page on your site? So just so you know, this is one of my golden rules with clients is never NoFollow link to internal pages. If you have a page that you don’t value, then NoFollow isn’t the answer. It’s probably canonical and NoIndex.
Gary: Yes, NoFollow is probably never the answer, especially on your own site. I can think of case scenarios where the target page would be roboted for whatever reason. And then if it’s roboted but not indexed yet, then if you don’t want to get that page indexed, then you probably don’t want to point to it with anchors.
Eric: Okay. Right, so that way you take a roboted page and minimize its chances of showing in the index by passing it fewer signals.
Gary: Yes.
Eric: There we go. Alright, we have a counterpoint to my rule. Excellent.
Gary: But quite honestly I can’t think of other reasons. You probably can do PageRanks sculpting, but…
Eric: But it doesn’t work because you lose the PageRank.
Gary: Well, as far as I remember, we frown upon that thing. Or at least Matt was frowning upon that thing. So we probably don’t want to do that.
Eric: Right, but my understanding was, because you just throw away…since you used the word PageRank, I’ll use it this time, too…you throw away the PageRank. It doesn’t get reallocated through the other links.
Gary: Yes.
Eric: So you don’t gain anything.
Gary: Okay. Fair enough.
Eric: I love that answer. “Okay, fair enough,” he says because he doesn’t want to complete the thought’s that’s running through his brain right now.
Gary: I can’t comment on any details of any operation without the explicit approval of the director.
Eric: There you go. Okay, so another scenario for you. When you have a large number of user-generated content pages, does it make sense to do something like the following…again, lots of UGC, quality uneven, and in general, you in this case only want to allow the things that are deemed to be quality elements to be indexable…so imagine this is on an article site, for example…I know you don’t like those in general.
But it’s a how-to site. I’m just using it as a reference point. Or it could be a lot of reviews, pages of reviews. The idea is, you place the unreviewed submissions in a folder blocked in robots.txt and then when they are deemed of quality, move them at that point to a place where they are crawlable and indexable. So the idea is you’re accepting lots of user-generated content and you’re using this system to make sure you’re only showing the good content to your audience. Or to Google’s audience.
Gary: Well, it would really depend on the quality of those comments or reviews or whatever, user interactions. Because I know several sites where I do actually enjoy finding just questions in the search results because for example, if I see that on that particular site, there are lots of questions about robots txt, then I can just jump in and answer a few of them at least, providing content or giving content to the site.
I think it depends on the quality of the comments or user content. If it’s mostly, “Buy Viagra in a Canadian casino without prescription,” then probably you don’t want to get those indexed.
Eric: Unless you’re trying to rank for that.
Gary: Yes. But whoa, that would be tough.
Eric: Oh, come on, there’s not much competition for that kind of search phrase.
Gary: No, no. At all. Yes, it depends on the content. And of course, moderation. That’s also a really good thing to do in many UGC cases. Pick a few trusted members from your community if you don’t have time to moderate comments and user-generated content in general and let them be the authority, I guess. I think Stack Overflow, for example, does this amazingly well. Or the stack exchange sites in general. If they can do it, I don’t see why others wouldn’t be able to do it.
Eric: Alright, great.
Gary: But back to the how-to sites. I would categorize Lifehacker as a how-to site, but I do actually love that site. That guy publishes awesome ideas about how to hack random stuff and I think users also love it. It has a large community formed around the site. I don’t think there is a huge problem with how-to sites. It’s more about how you do the how-to sites.
Eric: And how you keep the quality level really high.
Gary: Yes.
Eric: Absolutely. For those of you waiting for the Q&A, we’re going to start in just a couple of minutes. I have two more questions to cover with Gary and then we’re going to dive right in.
Gary: If you remember, you said the same thing like 10 minutes ago, as well. But anyway just go on with those two questions.
Eric: Why, thank you. I appreciate your permission. So this is what all our conversations are like, by the way, when I have you out here. Very dry, almost inane sometimes. But anyway. Okay, so here’s a theory related to a very large site, tens of millions of pages.
And remember we’ve talked a little bit about conserving our signals, of which PageRank is only one. And then we’ve talked about some of the issues with having too many pages for Google to crawl so it’s not effectively discovering the pages that you want them to value more, and helping Google find those more easily.
So the conceptual idea is, in the very high levels of the site where the pages have more (stronger) signals including PageRank because they’re closer to the home page and may well have more external links, to focus on for facets at that level, potentially using canonical because the reasons we’ve talked about that before.
But keeping in mind that we’re now talking about a site with tens of millions of new pages may be lower in the hierarchy where you have lots of incremental facets, which we’re going to stipulate here are of real value to users, maybe we just block those in robots.txt and worry less about the other signals and more on getting Google’s crawl to focus on the pages that are more important.
Gary: So what’s the question? Sorry.
Eric: How did that sound as a theory? That at one level, it may be more important to preserve the signals, but as you get further and further down in a very large site, maybe you’re better off conserving, well, crawl. You know, crawl bandwidth. Or it doesn’t matter.
Gary: You have to get really, really, really big to have problems with this. I don’t think that it’s worth it to focus on this unless you…I don’t even know what scale to tell you, like probably Amazon size. If you are on that scale, then you probably want to think about these things but otherwise we have lots of disk, we have lots of bandwidth to burn.
Eric: Sure.
Gary: Yes, I probably wouldn’t worry about this.
Eric: Yeah, I’ll just give you an idea. We had one site that we worked on a few years back. Seven hundred million pages in the Google index and they had a tagging structure that allowed tags to be randomly applied even in orders and combinations that were completely nonsensical and it would build the page. And we did a clean-up where we greatly constrained the way the tags could interact with each other.
And on that particular site…it was really interesting, over the first six months after we did that, traffic actually didn’t go up. The number of indexed pages went way down. But revenue went up significantly because they were bringing people in for queries that were much better matched to the content of the pages.
Gary: Okay, that makes sense.
Eric: And then we had another one that had a very large faceted navigation. Over a hundred million pages as well and a lot of thin content pages, just way too minor variations in their facets that were getting indexed and we cleaned that one up. And within three months, 50% lift in traffic. I mean, these things, I’m just giving these two examples as illustrations for people watching that doing this really well is a big deal.
Gary: Right, but you are not talking about crawler traffic. You are talking about search.
Eric: I’m talking about the application of all the tagging and robots.txt that we’ve been addressing…
Gary: So the thing is that as soon as you have more focalized pages, it’s natural that you will get higher quality traffic from search engines in general.
Eric: That’s right.
Gary: So from that point of view, yes, narrow it down as much as you can. Don’t create low-quality, no-value-add pages. It’s just not worth it because one thing is that we don’t necessarily want to index those pages. We think that it’s a waste of resources. The other thing is that you will just not get quality traffic. And if you don’t get quality traffic, then why are you burning resources on your server?
Eric: Yeah, exactly. Okay, so now I’m going to keep my promise. It’s Q&A time and our Q&A master…
Gary: Wasn’t that just one question?
Eric: Sorry, what?
Gary: You said you have two questions.
Eric: I lied.
Gary: Oh, okay.
Eric: Yes.
Mark: That was the trap. He sprung a trap.
Eric: But yes, with that, Mr. Mark Traphagen who I know has been busily interacting with people, is going to start bringing us questions from the audience. Many of these, by the way, are more broad than the topic of today’s show, but they will not talk about that red thing behind Gary’s head there. So we won’t be able to get the magic date for when Penguin’s coming out. It’s, oh, wait, I’m not supposed to say that. So with that…
Mark: Almost spilled it there. Almost spilled it.
Eric: Indeed.
Mark: We have a freight-load of questions this time, I’d say three to four times as many as we had. First of all, want to thank everybody for the live interactions. It’s been terrific, both on the YouTube channel and on Twitter. The hashtag #VirtualKeynote, keep that coming. So that is the hashtag for the show, #VirtualKeynote, but we’ve got more than enough questions here. We’re not going even to get through a percentage of them here.
Eric: Yes, but what we will do…if it’s okay with you Gary, if we could do what we did last time where we take the questions we don’t answer live and Tweet them at you.
Gary: Yes, sure. That sounds great.
Eric: So if you don’t get your question answered live, we are going to try to get it answered by Gary with a personal guarantee of a 20% increase of traffic associated with that as well. Sorry, that last part, I made up.
Gary: Mark, do you have any questions from Jesse McDonald?
Mark: Yes, I do. You want to go straight to…
Gary: Can we start with that one?
Mark: Sure.
Gary: Thank you.
Mark: Let’s see, let me find it here. I don’t have it sorted by people. You know what I’m going to do is, he asked the question in the…are you saying that because you have a specific question you knew he was going to ask?
Gary: No, we had an interaction a few weeks ago on Facebook, and I feel bad about the interaction.
Mark: Let me go to a question that he asked in the live chat here on YouTube. Okay, says, “Have you guys seen any issues with setting NoIndex tags using Yoast, the Yoast WordPress plugin? We’ve seen instances where Google will ignore this and index the page anyway.”
Gary: With NoIndex, if the NoIndex is correctly put in the page, then we will not ignore that.
Mark: So it sounds like he should be checking…not just depending on the plugin to do it right…but after using the plugin, check the actual code of the page and see if the NoIndex file is correctly placed.
Gary: Correct.
Eric: The other thing to point out here that’s really important is that Google won’t apply the NoIndex until the next time they crawl the page and see it. So there can be a delay and that might be what you’re seeing.
Gary: Yes, the other thing is that you also have to allow crawling the page. Otherwise, we will not see the NoIndex. If we can’t crawl the page because it’s blocked by robots.txt or whatever, then we will not see the NoIndex and we can’t apply it.
Eric: Right. A very common mistake people make is when they block pages in robots.txt, Google doesn’t read the page. They actually respect that correctly. Mark: Alright. So there you go, Jesse. Don’t say we never did anything nice for you.
Eric: Well, you can say that, but we won’t believe you.
Mark: Here’s a question from Stephanie Salmon at U.S. News. She says, “What is the best way to deal with search filtering for a page? For example, the page should be indexed, but there’s a filter option using Ajax to narrow the results. Should that page have a rel=canonical back to the original page?”
Gary: I would have to see the specific scenario, but I think we talked about this with Eric. Basically if you have an Ajax filter, then probably we most likely will index the default content that’s loaded when we crawl and render the page and ignore pretty much everything that happens after.
Eric: Well, if you’re using Ajax and you’re doing at least the way that I think it’s being done here, you don’t have a new URL. So there isn’t a new thing to index. So you’ve executed the Ajax and you just repainted part of the content in the page. So Google may or may not be able to see that content if they choose to go through the process of loading it, but as I’ve said at other times along the way recently, when it’s that kind of action, they generally speaking ignore it. Is that fair, Gary?
Gary: Yes.
Mark: Right. This may be a similar question, so if it’s already been answered, just let me know but this is from David Pratt at Carfax. He says, “Why would Google index a faceted page if it has a canonical to the root URL without the facet?”
Gary: Because rel=canonical is just a suggestion.
Eric: I think we did cover that a little bit. There are other signals that Google uses to help with canonicalization, although rel=canonical, if I remember this right, is a very strong one. It must be that some other signal is overriding it.
Gary: Yes.
Mark: Okay, fair enough. Avinash Conda at Shutterfly wants to know, “How to compete with sites which are aggressively acquiring links, not following the best practices, and even shady links, and still winning on the search?” In other words, how do you compete with black hat SEO when it’s still working?
Gary: Report it. I don’t remember the number, but I think it’s 60% of the spam reports are acted on. If you are convinced that someone is using black hat techniques to get better rankings, then report. Use the spam report form.
Mark: What’s the best way for people to report that? How do they do that?
Gary: We have a spam report form. I think it’s being used inside search results and you can use that. The vast majority of the spam reports are acted on.
Mark: Okay, very good. This is from Jeff Angel at Graco. He says, “When hreflang tags are used in a sitemap but the sitemap does not have all of its URLs indexed by Google, the hreflang tags pointing to unindexed URLs are showing errors of no return tags because Google has an index on the sitemap. How do you recommend implementing hreflang tags in sitemaps when we are unable to control the amount of indexed URLs in sitemaps?
Gary: You would get the same error if you would have the hreflang tags in the URL or in the HTML, as well. With a sitemap, you can ask for the discovery of the pages, but it’s not a guarantee that we would index those pages. If we haven’t indexed those pages, then you probably have to figure out why we haven’t indexed them. There are a buttload of reasons why we wouldn’t index pages. Typically, it boils down to quality of the page itself and its parents in the URL hierarchy.
So for example, if you have example.com/vases/flowers/orchids, then if we know for sure the flowers’ parent has high-quality pages, then we are more likely to give the benefit of doubt to a newly discovered page under that parent. But if we know that the flowers’ parent typically doesn’t have high-quality content, then we are more reluctant about indexing URLs from under that parent.
Does it make sense?
Eric: Yes.
Mark: Okay, Clint Borrill, Balsam Brands, asked, “How important is schema markup to SEO going forward?”
Gary: I would say very. We use…well, not just us, but other search engines, as well…use schema or structured data in general to better understand the content of the pages. And I think in the future, we will rely more and more on that. Not just Google, but other search engines as well. Why? Because it’s easier to understand what’s happening on the page, what the page is about if we have structured information in general.
Eric: Right. So it’s a reminder to people really…we all want to think that the job that Google does is fantastic, but we tend to think that Google is nearly omnipotent in its ability to understand what’s happening on websites. But schema, at the very least, it’s a strong confirming signal when it is in sync with the content of the page, which of course is a must. We need to remember that.
And then in certain cases, it results in enhanced markup in the search results, which is a bonus.
Mark: Doc Sheldon wants to know, “Gary, when you Tweeted most meta and link tags outside the HTML head section are ignored, should we assume that includes link rels?”
Gary: Oh, link rels, definitely.
Mark: Okay, simple answer. I’m going to jump down to this one because I think this is one that I hear from a lot of people. I get asked it a lot. This is from Laura Crest, “Is the prevailing wisdom that webmasters should tag all outbound links as NoFollow?”
Gary: No.
Mark: No, right? “Does the common WP plugin that opens links in a new window accomplish that, according to Googlebot?”
Gary: I’m really angry at news sites in general, for example, because of two reasons. Well, some new sites for three reasons, but the two major reasons are that they don’t link back anymore to smaller publications or blogs or whatever because they are probably afraid of linking back, which is stupid.
And the other thing is that they would NoFollow everything and my grandmother, for whatever reason. Probably still because they are afraid of getting hit by something.
The Internet is built upon links. Links are essential for the Internet. I was reading a news article yesterday night and there was…I think it was on, no, I don’t remember what news site was it…but they mentioned several times a certain site, but they only used the name of the site. They never linked out to it.
And after I searched for the site, I realized that there are two companies that are called the same way. One is in the UK, one is in the U.S. And then I had to figure out which site they are talking to, which is stupid because I could have avoided the whole searching situation by just clicking a link on the article itself.
Mark: It’s a personal frustration for us. We’ve had stats from our studies cited by major news sites with…maybe they mention our name at most, but no link. And to me, that’s a disservice to the user because how can the user vet that our study is even correct or used good methodology if they can’t see it?
Gary: I think you would use NoFollow when you really, really don’t trust an article or an external resource. Or if you think that the external resource is just really shady or whatever. I don’t even know when you would use it. Just for, if you are a news site and you’re linking up, then it’s unlikely you would get hit for that.
Eric: So here’s an (https://www.linkedin.com/pulse/20140915164532-219045-stop-the-nofollow-madness) article I wrote a while back by the way about this topic because it really does…it drives me nuts, too. I get angry about it, too, because this idea is, if I write an article…I write for Search Engine Land or Moz, for example…hopefully it’s because they trust me.
And so if that’s the case, then they should show that trust not just by printing my article, but by sharing a link which is not NoFollow. There’s no reason for it. If you don’t trust it, that’s the criteria that I have in mind…in common places people shove links in blog comments or user-generated content, or it’s a purchased link. An ad, sorry. Not a purchased link but an ad.
Mark: Alright, this is from Jose Sotelo Cohen, who asks, “If I NoIndex,” he says “all my pages” but I think this would apply to any NoIndexing, “If I NoIndex pages, and they don’t show up on Google for a while, if I remove the NoIndex tag, will I be able to recover the positions in a short time or will it start from zero?”
Gary: It generally depends how much is that short time…
Eric: (whispers) Depends on the crawl rate.
Gary: Oh, I think someone just…
Mark: Somebody jumped in there. Who was that?
Gary: Oh my God.
Eric: You were going a little slow in the answer, so I just thought I’d fill something in.
Gary: So if you NoIndex…okay, can you repeat the question? Sorry. I over-complicated it in my mind.
Mark: Oh, he said, “If I NoIndex pages and they don’t show up for a while,” so Google honored the NoIndex, “I remove the tag, will I recover my positions in short order or…”
Gary: So it depends how much is that “for a short time.” If it’s just a few days, then it’s very likely that the signals are still lingering around and you will just regain pretty much everything that you had. If it’s weeks, months, then you will pretty much start from the bottom.
Mark: So you do lose the benefit over time. It depends.
Gary: Yeah.
Mark: Okay. That’s good to know.
Gary: I mean, the links pointing to the page will still be there. But there are other signals that don’t stick around if we see that the URL doesn’t exist anymore.
Mark: Alright, I’m going to ask a RankBrain question here just because it’ll give you a chance I think for many of the people listening to clarify some of the misconceptions that are out there, so just deal with it quickly. But Frank Macmillan asked, “Does RankBrain mean that links will someday be irrelevant and Google’s assessment of the page’s authority will be strictly based on its content as interpreted by Google?”
Gary: I think that’s a general misconception of RankBrain. RankBrain helps in a certain subset of queries and it helps us better understand which pages work better for certain queries. I don’t know how many examples we gave for this, but my favorite one is, “Can I win a game of Super Mario without consulting a walk-through?”
Without RankBrain, it would be very hard for us to understand the negation in the query…the “without” part…and to rank the appropriate pages for this or actually retrieve even the page that should rank first for this. RankBrain helps with that and that’s it. Whether the links will go away eventually, I don’t see that happening anytime soon. I don’t think it was ever mentioned in our launch meetings. I doubt that anyone would bring it up anytime soon.
Eric: My limited dabbling in machine learning, of which I have done some, tells me that if you were to implement an algorithm to improve the way you look at links or something like that, that would be a different machine learning algorithm with different training signals. And maybe I’m wrong about that, but the training algorithm that handled one kind of thing…in this case, improved language understanding and query processing…is not really going to resemble the training algorithm to understand some other aspect of sites.
Gary: Yes, I will not offer more than you did.
Mark: We’ve got only a couple minutes left here, so I’m going to group together several questions and just ask, can you tell us something about…
Gary: Isn’t that like cheating?
Eric: Yes, it is. That’s what Mark does, he cheats.
Mark: That’s my job.
Eric: He’s our black hat.
Mark: The closer we get to the end of the show, the more I cheat. So AMP, A-M-P, accelerated mobile pages. What can you tell us about how that’s going? Adoption, plans for the future? How’s that working out for Google?
Gary: So I’m less involved with AMP. John is focusing a lot on AMP. I will give a few keynotes on AMP in the next couple months, I guess. But I still have to ramp myself up on it because I just wasn’t paying too much attention to it. I was focused on other things. How is it going for us? I think it’s going excellent.
Of course, there is lots to do around AMP. I think the specification turned out to be excellent and the proof of concept works really great. When you find an AMP search result and you click on it, you will understand very fast why we are pushing for it. Basically the content loads in very fast.
Eric: So we did some tests on our end, by the way. There are AMP pages available for the Perficient Digital blog and then we went and measured things like the page size. And the original page size before AMP version was just under 100 kilobytes. And then the AMP version was around 28 kilobytes. That explains right away why it load faster. And then we also ran the pages in Page Speed Insights and we get a rather dismal 42 for the native page. And then with AMP page loading out of the Google cache, we get an 88.
Gary: The cache makes a huge difference for loading. If your servers are in the United States and the user in Japan, the data has to cross a trans-Pacific cable. That already takes time. It’s a long distance, it’s really a long distance, and there are lots of network components in the way that will slow down the data transfer.
Just a tiny bit but enough that you will notice it. You can see this right away if you try to…or for me, it’s very visible if I try to load a page form Australia for example and it’s hosted in Australia. It will be freaking slow.
If it’s cached somewhere near you…and CDNs can do this similarly…then it will load much, much faster. One thing to remember about AMP is that currently, it’s only available…or in the search results anyway…it’s only available for news sites.
I’m not 100% sure about what our plan is on this. I did hear a few things, but I would not say it out loud until I confirm it with my superiors, I guess. Or the AMP people, at least. I think it’s doing great. Stay tuned. There will be more announcements. And definitely watch I/O, not just because of AMP, but other things, as well.
Mark: I just want to emphasize to people that we have like 30, 40 more questions. We will tend to attempt to get to those and answer those to you personally. It might take a little time because we have so many, but we do thank everybody that submitted a question for today’s show.
Eric: Absolutely. And we’ll start working on those later today and tomorrow. Gary, you can expect to start seeing some Tweets tomorrow. So, unfortunately, we are at the end of our time.
Thanks, everybody. I am going to stop the broadcast now and have a great day.
Great post on “SEO Tags with Gary Illyes and Eric Enge” offcourse. They way it is explained. I appreciate this. Thank you!
Hey Eric & Mark,
Love the interviews you’ve been doing with Gary (the man with the unpronounceable last name).
And this interview/post is quite timely as I’ve been trying to untangle use of nofollow attribute on internal links, and/or meta robots noindex, and/or rel canonical usage.
I essentially have 2 questions, and here’s the context and details:
I work with some extremely large and reputable sites with technical SEO messes that need correct solutions to duplicates, proper canonicalization, as well as properly handling the issue of crawl budget because these sites are huge.
Yes you’ll note that I’ve not included robots.txt Disallow directives in my “menu” above because in the real world cases I’m referring to, it’s page-level tags that need to be right, we’ve got the robots.txt file(s) properly tuned to control crawl to restrict bots from specific areas they should stay out of – and for the real-world scenario(s) that prompt my questions, robots.txt (even advanced wild card matching) is simply not a solution to fix the technical issues (and preserve crawl budget, and do proper canonicalization).
First issue: focusing on using the rel canonical tag in conjunction with the meta robots noindex tag on the same page, I asked John Mueller about this recently because Adam Audette wrote a post some time ago noting that those 2 tags are in conflict with each other (but Adam did not specify how or give examples).
An example I’m thinking of and see in the real world is where there’s a dupe page we don’t want in Google’s index, but we do want to use rel canonical to send the signals/PageRank of that dupe page to the canonical page. Assume that we can’t make the dupe(s) go away “at the root cause” and we’re stuck with them, hence the need for using tags instead.
John Mueller said to me (via Twitter) exactly this: “The rel=canonical says the 2 pages are equivalent. If one is noindex, & the other indexable, they’re very “unequivalent”.”
Asking the same question via Twitter to Ian Lurie, Adam Audette, Rand Fishkin, and Scott Hendison essentially the same question while waiting for John Mueller to reply, the general consensus (to paraphrase) was “noindex probably trumps the rel canonical” (and also “worth testing” – but please kindly set aside testing for the moment, for me there are constraints preventing this).
The second issue / question is the use of the nofollow attribute on internal links (yes, I know this breaks Eric’s golden rule and Gary said there’s no reason to do so) in this case strictly for the purpose of controlling crawl (again, not via robots.txt, but page-level, and completely setting aside “PageRank sculpting”, I completely understand what Matt Cutts & Gary Illyes have said on the matter).
The Google Support page on this says at the very start:
“”Nofollow” provides a way for webmasters to tell search engines “Don’t follow links on this page” or “Don’t follow this specific link.”
That would seem to say that you can control crawl via nofollow attribute on internal hyperlinks, and I want to know if that’s the case or not – here’s more:
Later in the same support doc they of course say:
“How does Google handle nofollowed links? In general, we don’t follow them. This means that Google does not transfer PageRank or anchor text across these links.
But it’s the “we don’t follow them” part that seems to again say that you can control crawl via nofollow attribute on internal hyperlinks.
Later in the same support doc they say:
“What are Google’s policies and some specific examples of nofollow usage?”
“Crawl prioritization: Search engine robots can’t sign in or register as a member on your forum, so there’s no reason to invite Googlebot to follow “register here” or “sign in” links. Using nofollow on these links enables Googlebot to crawl other pages you’d prefer to see in Google’s index. However, a solid information architecture — intuitive navigation, user- and search-engine-friendly URLs, and so on — is likely to be a far more productive use of resources than focusing on crawl prioritization via nofollowed links.”
The “enables Googlebot to crawl other pages you’d prefer to see in the index” further seems to support the idea that you can control crawl via nofollow attribute on internal hyperlinks.
So my 2 questions boil down to:
1. If you have a dupe page (or pages) and place the meta robots noindex in the AND also rel canonical to the canonical version of the page, what’s the deal? Is John Mueller right when he says that you’re saying with the rel canonical that the pages are equivalent but if one page has noindex and one is indexable then those 2 pages are not equivalent (and by extension, you should NOT use those 2 tags on the same page?).
2. If you can’t use other solutions (don’t ask why, just go with me on this) and want to preserve crawl budget by not having Googlebot follow links & crawl pages you don’t want crawled (and to repeat, in this context, robots.txt or other solutions are just not feasible), then (setting aside PageRank loss), why can’t you just put the nofollow attribute on internal hyperlinks (if indeed Googlebot will NOT follow the nofollow as the support doc seems to say in several places) to preserve crawl budget?
Eric – in your video intro you say “SEOs need to know this stuff cold” and I could not agree more. I love technical SEO and don’t have solid, clear answers on all of this and would really appreciate it if you could help me get to the bottom of all of this / my burning questions.
Thanks guys!
David
Hi David,
Here’s my take on it …
Rel canonical should remove the page from the index, so no need to consider using a NoIndex tag there. NoIndex is very inefficient, as PageRank is passed through the page, but voted through all the links found on the page. In contrast, rel canonical says send all the PageRank to my preferred target page. The chances that Google will ignore a rel=canonical are small in my experience, so this seems like it’s always the best course for things like sort orders and filters.
Important note: Gary specifically debunked the notion that rel=canonical needs to point to a page that is a dupe or strict super-set of the page with the canonical tag on it. He pretty explicitly said that you can implement that tag to any target page you want. Now, I’m not sure that I’m comfortable with that extreme a view.
But to your specific question above, I would never place a rel=canonical and a NoIndex tag on the same page. The result behavior by Google is unpredictable at best.
As for the NoFollow issue, one important thing we confirmed in the Virtual Keynote is that crawl rates to pages with a rel=canonical tag, or a NoIndex tag, decline significantly over time. If you are concerned with time spent crawling, than those two tags should already help you, so no need to go with a NoFollow.
Keep in mind that the NoFollow tag cuts off the link juice and throws it away. Better to send that link juice through to a page that isn’t in the index (because of a rel=canonical), but that can then direct it back to somewhere else on the site.
Cheers,
Eric
Hi Eric,
Great interview. Just wondered what your thoughts were when Gary hesitated over PageRank sculpting. It seemed to me like he was suggesting this could actually work, which took you (and me!) by surprise.
I was also hoping that Gary would go into more detail about the crawl frequency of pages containing canonical tags.
The reason being that a duplicate URL could feasibly continue to gain links and authority even after the preferred page has been signalled to Google, but this new value would not be passed until the page and its canonical tag are re-crawled.
This could effectively render the increased value of this page as worthless for months.
What do you think?
In regards to rel=canonical for eCommerce, does it make sense to rel=canonical content pages (blog) about products or categories to the associated shopping pages?
I don’t think so. You want those content pages to show up for search queries that your eCommerce pages wouldn’t show up for.