In this study, we set out to see just how well some of the world’s top eCommerce sites use SEO tags and robots.txt to manage their faceted navigation. The results I am reporting here today will show you how often these sites get it right, and in some cases, just how horribly wrong they get it.
TL;DR
I’ll give it to you straight out – it’s bad out there. Only 38% of the URLs we looked at across 20 highly prominent e-tail sites were using SEO tags in what we consider an optimal manner. Even worse, 23% of the URLs we looked at were using overtly conflicting tags.
[Tweet “Study: Only 38% of 20 top eCommerce sites use SEO tags correctly. More at”]
(Jump to the bottom of the page to watch a video summary of this study!)
What is Faceted Navigation?
What is faceted navigation? There are 3 types of navigation that are typically considered faceted navigation. These are:
- Sort orders: Examples are: showing products from highest to lowest price vs. showing them from lowest to highest price.
- Filtered navigation: Examples are: showing only products that are under $100, or showing only products that are red.
- Pagination: Example: when you have 100 products and show only 10 products a page, so they get split across 10 pages.
The reason we want to use SEO tags on these pages is that they can easily be seen by search engines as thin, poor quality, or duplicate content. The correct SEO tagging strategy will instruct search engines on how to view the various facets, and reduce the chances of that becoming a problem.
Best Use of SEO Tags
The optimal way to use these tags is quite simple, and is as follows:
Key to all of these tagging schemes is that no other tagging should be done. That means no NoIndex, NoFollow, or Disallowing in robots.txt. These will only confuse the situation and potentially break your SEO for these pages altogether. Use ONLY the tagging outlined in the above table.
Categories Examined
Different eCommerce sites have different types of product attributes in need of faceted navigation. The elements we looked at included:
These are all pretty common types of attributes, and we looked for those categories on all 20 sites we included in this study. For some of the sites not all categories applied. For example, a site might not have had a “Size” filter.
The Gross and the Stupid
Two of the sites we looked at used rel=canonical on their sorts and filters, but pointed the rel=canonical to their home page. This is a violation of a basic concept of what a rel=canonical is supposed to do which is to point to a page which contains a substantial portion of the content on the page containing the canonical. Here is exactly what Google says about it: “A large portion of the duplicate page’s content should be present on the canonical version.”
The normal response of Google to seeing something like this is to ignore the canonical tag altogether, and it appears that Google has done so in the case of the two sites we saw with this problem.
The proper implementation is to point the canonical to a page that contains a superset of information on it. That is usually a nearby parent category page. You can learn more about how to implement a rel=canonical here.
And, just for fun, there is one other scenario we saw worth mentioning in this section. We saw one site that implemented rel=prev/next tags on pages that had no pagination!
Other Examples of Conflicting Tag Scenarios
Of course, implementing a rel=canonical to the home page of the site is not the only example of problems we saw. Some of the other scenarios we saw were:
Some of you may question why I saw the use of a canonical with a NoFollow as a problem. Put simply, the NoFollow provides search engines instructions on how to handle PageRank flow from the page, basically by telling it to block all PageRank flow out of the page. Yet, the rel=canonical says to pass all link value to the page that the canonical tag points to – hence the conflict.
Some of you may also be wondering what a “hashbang URL” is. Basically, it’s part of a method for allowing Google to crawl AJAX pages that was supported by Google for a number of years. This method is now officially been deprecated by Google, meaning that they no longer recommend it.
It’s likely that Google is deprecating this protocol because they don’t plan to index any content on a page that requires a user to click on something to expose it. You can read more about that here.
Detailed Data
So let’s get a bit of a closer look at the details! Note that we won’t be outing anyone for their mistaken SEO practices here. With that settled, here is how it broke out when looking at what percentage of SEO tags were conflicting per what was outlined above:
Ouch! 23% of the time, the sites in question provide two or more conflicting instructions to search engines on the target web pages. What’s a poor search engine to do? From the publisher’s perspective, the reason this is bad is that it leaves it up to the search engine to figure it out. And, as a result, there is a chance that the search engine will get it wrong – and the whole point of this exercise is to make it easier for them in the first place.
Next up, let’s look at sub-optimal use of tags. In addition to the conflicting tag scenarios, this includes pages where publishers implemented no tags, a NoIndex instead of a rel=canonical, and similar situations:
Double Ouch! Optimal use of tagging was done on these sites only 38% of the time. And, this is on some of the top eCommerce sites on the planet. That’s pretty frightening.
Important footnote here – I did count implementing NoIndex on sort orders and filters as an optimal solution. It does solve the problem of pulling those pages out of the index. I still consider the rel=canonical the superior solution (as outlined in the above table on expected tagging behavior) because it passes all its PageRank back to the target page, whereas NoIndex simply passes PageRank through all the links on the page, and that’s quite inefficient.
Summary
The bottom line is that slightly less than 2/3 of the scenarios we examined were implemented incorrectly. Clearly people are confused about how these tags work, and how to use them. We saw similar things in a study we published about a year ago on the use of real author tags.
The reality is that figuring out how to use them is not hard (see the table above), but most developers appear to not be taking the time to figure out their proper use. Put another way, getting this right is simply not getting enough priority from these major eCommerce sites.
It’s very important that you do take the time to get this right. Google, and the other search engines, defined these things for a reason, and that is to make sure they handle your pages properly. By definition, that means there is the possibility that they will get it wrong. The last thing you want to do is to make it even harder on them by misusing the tags.
Eric, another really common issue in eCommerce is a product having multiple URLs by virtue of being assigned to multiple categories. Magento is notorious for this issue.
You bring up layered navigation, a really common pain-point of mine when creating the IA for an eCommerce site. Is there anything I can do to help capture the long-tail of the layered navigation any better?
For instance, take a search term ‘Black leather jackets.’ I’m constantly struggling with the trade off of letting layered navigation in a leather jackets category handle it or creating a separate, static subcategory for black leather jackets. The creation of mid to long tail static pages like that is a slippery slope, any suggestions on how to help optimize for that long tail search with layered nav?
Definitely a complex issue! One way to handle that is to make sure that the leather jackets page has a user visible list of the available colors, so that page has a chance of ranking for the color related phrases, and therefore you don’t need to have a separate black jackets page at all. If you go down the path of creating the page, as you know, you run into the risk of creating too many pages. Unfortunately, no magic bullet for you there.
Thanks, I appreciate the feedback.
Great advice.
One other eCommerce SEO element I struggle with is whether to show “people who bought this also bought these things.”
Although better for sales, engagement, and overall user experience….I worry that it tends to make the page less *about* the thing itself – potentially affecting rankings and traffic…and maybe even a little more Panda-risky.
Have you guys ever studied that?
Have noticed Google is showing in the SERPs pages that contain multiple colors even when the search is for a particular color. Try searching on “black lace trim” and check out the top 6-7 results. So if your version of “black lace trim” shows a less rich page that could be considered thin – definitely go for one that shows more and a broader choice for the user.
Is i am not using rel=canonical to any of pages to my website , so it is an impact on SEO ?? even i have more than 20 pages website please share your thought Eric Enge
This was really an interesting topic and I kinda agree with what you have mentioned here!
I am linking this comment to our issue tracker. Seems our stable theme release plus the new one we’re working on did not include a link in the header to take care of canonical links. Faceted navigation is currently available for product and category pages using feature sets, attributes and price.
Eric,
I had the same thought about the conflict between canonical and nofollow – so I asked John Mueller when I was looking into this issue previously.
He replied with:
“The nofollow doesn’t affect the rel=canonical (the nofollow is for links on the page, the canonical is a general page-level signal that we should focus on the other page, it’s not like a link).”
I did ask this over a year ago now but unless something has changed would this suggest there is no conflict?
You only need it if you have pages that are duplicates that you don’t want Google to index.
Eric, thanks for this great post and video. Very helpful and informative!
I’m not sure if you have talked about this in another post or not, but I will ask it here.
I’m currently building an eCommerce site and found that the SEO tags are in place, however I have a problem with thin content.
Google will index the product pages, but will not rank it because of poor content. Some products might get the same or very similar description because there are actually very similar products.
Do you have any recommendation on how to handle these pages? Should apply the no-index tag for products with same description and optimize only one product page? What is your opinion?
Hi David – a bit of devil in the details here, but the current situation does not sound like a great one. As you suggested, I would consider NoIndex-ing some of those near duplicate pages. That said, if we have something like this scenario:
widgets
blue widgets
red widgets
I’d rel canonical the blue and red widgets to the main widget page, as that would be a better solution for that scenario.
This article gives the complete information, best use of SEO tags in eCommerce Sites. It can be seen easily by search engine as poor quality and duplicate content.
Hi, Eric. I’m a bit confused by your comment regarding “hashbang URL”. I developed my site using Wix which implements this type of methodology. No content is hidden from users when visiting the pages, nor does it require a click or action to view.
My site (created in March/2016) is being indexed by Google according to results I see when searching for: “site:mke-eCommerce.com”.
According to Google Webmaster Central Blog: “we’ll generally crawl, render, and index the #! URLs.”
I was looking for some clarification or to get your thoughts on this. Thanks!
One probem is that the internal SEOs know this but can’t get the buy-in from other internal stakeholders. Or can’t ustify the IT hours. Or have a project in place but there’s a 3 month code freeze. I can tell you I work at one of these big places and the best I can do is a possible March fix (now early December) – and have been working on it since September.
Ajax sites and pages will be indexed just fine but it won’t cache the HTML. You’ll need a Prerender solution or get off of Wix.
Canonical to base PDP URL, make sure those other pages aren’t in the sitemaps, and generate some content via product reviews.
A canonical is a strong suggestion – robots are directives.
No risk for Panda as long as those links and products are legit. This is a great user experience and unless somehow used to circumvent Google, there should be no concern with related products and personalization.
DJ – I removed the hostile aspects of your comment, because I agree with you, there are many reasons why organizations have a problem with basic SEO. We work with more than a dozen large enterprises at any given time, and the reasons why they can’t get these things fixed are many and varied. Just because the organization collectively does a bad job with SEO doesn’t mean that there is no one at all in the organization that doesn’t know it (such as their in-house SEO).
PS – if you object to the edits, LMK, and I’ll remove the comment altogether and add notes based on your observations directly you the article, because your observations are worth including.
Hi Eric,
Great article and not surprised at all. In you view, what are the best CMS/enterprise CMS in terms of avoiding these SEO issues. It seems that most of them are not very orientated towards being more SEO-friendly.
Hi Eric,
A couple years late here, but I just moved in-house with a distribution company that built its platform on ROCcommerce almost 2 years ago.
Besides a huge issue where all the facets link to http and then 302 to the https version (this will be fixed soon, believe you me), I was curious your thoughts on crawl budget, relative to this post topic.
Doing a crawl, there’s over 4M URLs just based on faceted URLs alone from the PL (product listing) pages on the site, not including the redirects (which would double the amount of URLs found).
All the facet pages canonicalize back to the root sub-category (e.g. domain.com/category/sub-category/).
Given the fact that there’s a ridiculous amount of opportunity to go down rabbit holes with this, my major concern is that crawl budget is being largely wasted on these parameterized pages and that we’re guilty of copious amounts of thin pages (as none of the PC/PL pages have any content on them besides global elements, faceted search and product grid list).
I started to analyze log files from the past few months and, upon a cursory analysis, have found over 75% of URLs crawled are http & https versions of the parameterized URLs (e.g. http 302 -> https 200).
Even though all of the https are canonicalized back to the main sub-cat URL, I don’t believe the parameters provide any unique value to bots, and therefore we have a ton of waste on our hands.
My initial thought process was that the canonicals have been in place since day one, so that shouldn’t be an issue (“that” meaning GG should have seen them before and understand what our intentions are with them), but the next step would be to block via robots.txt.
After reading through your thoughts here, I’m a little torn on that approach.
An alternative approach I’ve been considering is setting parameter controls via GSC to dictate how each attribute parameter should be handled, in addition to the current canonicalization.
My thought was to set the parameters to “No URLs”, since all the content which would be found via narrowing by facet is accessible via the main PC/PL page.
In any event, I’d love to hear your thoughts on this if you have any and would be willing to share them.
Thanks in advance and happy holidays!
Hey Kyle – Interesting data, As I understood it, Google typically will crawl pages less that is sees containing either a rel=canonical to a different page or a NoIndex on them. But, you’re seeing that 75% of the crawl is of those pages. But on the other hand, if you were to count all of your pages that are not facets, how many would be left? If it’s only 50,000 or fewer pages, then I probably would leave it as is.
The other thing you could check is … during a 30 day period, what percentage of your PC/PL URLs are getting crawled? If that’s a large percentage of all your PC/PL pages, I’d definitely not touch this.
In general, I like canonicals because they redirect all the PageRank that those faceted pages receive back to the parent PC/PL page. If you block these in robots.txt, then this PageRank is lost.
Hey Eric,
Thanks so much for getting back, and on Christmas Eve no less – very much appreciated!
Your thought about GG crawling less w/ canonicals was my original thought as well, but then once I got my hands on the log files and found out that it was spending most of its time in the parameter pages.
To answer your question about page count, the site itself is around 12,000k URLs all in with static, PC/PL, and product detail pages. There are approximately 6,600 URLs indexed, found using the site search in GG. This matches up with what is shown in GSC.
Doing crawls, however, I’m over 7M and counting.
In any event, I greatly appreciate your feedback here – it validated my initial thoughts. That said, I have this gut feeling there are still issues to be resolved relative to this matter, so I’m going to be running some tests in terms of how things are setup (robots.txt, GSC parameters, etc.) and see what the outcomes of those tests are. I’ll let you know what I find.
Thanks again and happy holidays!
Cheers
Kyle
No risk for Panda as long as those links and products are legit. This is a great user experience and unless somehow used to circumvent Google, there should be no concern with related products and personalization.