Skip to main content

Digital Marketing

Controlling Google’s Indexation of Your Site: Quality > Quantity

Have you ever wanted to see your site through Google’s eyes? There are ways to monitor your site’s indexation, identify and remove thin and duplicate content from Google’s index, and keep Google from crawling and indexing poor-quality pages in the future.
Our recent blog post regarding what to do when you cannot find your page in the index was a great overview of the tools available to encourage search engines to crawl and index your site. One underlying problem that can cause a delay in search engines’ detection of your new pages is allowing them to crawl and index every URL your site creates. Content management systems can easily spin off thousands or millions of URLs, which can flood the search engines with poor-quality pages if you do not stop them.
Besides commonly understood issues with diluted link equity and internal competition, having a large number of thin or duplicate pages in the index will ensure Google tires of your site quickly. This problem is referred to as “index bloat.” You want Google to focus on the content and pages that you find important and not waste time on pages that are empty, or pages that have no value to a user coming from search results. To make the best use of your “crawl budget,” you should only let Google crawl and index pages that serve a singular purpose and provide a good user experience.
Google offers many tools and guidelines to aid webmasters in controlling the crawling and indexation of their sites, which help to optimize Google’s view of your website. The sections below go in-depth into what type of content should be removed from the index, why having fewer pages crawled and indexed can be a good thing, and how to remove poor-quality content from Google’s index.

What Counts as Thin or Duplicate Content?

Google defines duplicate content as “substantive blocks of content within or across domains that either completely match other content or are appreciably similar.”  This can be as simple as when two or more pages of a website have a very similar URL or when two or more pages have a substantially similar product selection. If there is no unique content present to differentiate a page with a similar URL from the page you want to rank, it is likely considered duplicate. You can see how sites with faceted navigation might have trouble getting Google to understand which page they want to rank.
After understanding duplicate content, let’s discuss what makes content “thin.” Obvious examples of “thin” pages would be when there is little-to-no content on the page, such as empty category pages, or if your site has broken pages that still serve 200 status codes. However, the definition of “thin,” that the industry has gleaned from the Panda algorithm’s multiple iterations, is any low-quality content that does not answer a user’s query and should not be crawled or indexed. These thin pages can be determined somewhat by on-page user metrics, but it really starts with asking the following questions before a site is launched or during an audit of a new site:

  • Is this page helpful for a user that has never been to my site before?
  • Does this page add unique content to my site?
  • Is this page relevant to a specific set of keywords or keyword niche?

How to Identify Thin or Duplicate Content

In order to perform the following analysis, I encourage you to review Dr. Pete’s “25 Killer Combos for Google’s Site:Operator” article. I give Dr. Pete full credit for my ability to share these tips with you. Site:operator searches allow you to see what Google has indexed in its primary or secondary indices and effectively search only your site with Google’s algorithm to see which pages are seen as most relevant.

  • Thin Product Pages – If you have an eCommerce site, go to a product listing page and find the verbiage that tells the user how many products are in that category. Then do a site:operator search with that verbiage, but with zero products. For example, if product listing pages says, “Found 206 products,” then the site search would be site:domain.com “Found 0 products.”
  • Search Result Pages – It is very common for on-site search pages to be indexed unless there were safeguards implemented to prevent this from happening. More often than not, these pages are both thin, due to having no optimized content, and duplicates of existing categories. To find out if your search result pages are indexed, simply do a search on your site and note the URL you land upon. For example, out-of-the-box Magento internal search result pages follow the /catalogsearch/result/ URL path. Perform a site:operator search for your search results page to determine if your site’s search pages are being indexed by Google.
  • Parameterized URLs – Does your site utilize a good number of filters or parameters to allow users to sort, narrow, or filter content? Assuming your site has canonical tags to consolidate these filters, this should ensure that these parameterized URLs are not indexed. However, that is not always the case. To determine if your parameterized URLs are indexed, go to Google Search Console and click on Crawl > URL Parameters to discover what parameters Google is monitoring on your site. From there, you can perform a site:operator search with each of those parameters to determine if those URLs are indexed.
  • Duplicate or Thin Subdomains – Do you know if your primary subdomains are the only ones indexed in Google? To find out, you can perform the following search to see what other subdomains are indexed besides the “www” version of your domain: “site: domain.com –site:www.domain.com.” From there you can review the subdomains present and determine if they should or should not be indexed. The following subdomains that are examples of domains that should not be indexed:
    • Non-www domain when the www is the primary.
    • Developer or staging site where changes are made and reviewed, and then pushed to the live site.
    • Mobile subdomains in desktop search results.
    • Vendor-specific subdomains.
    • Secure sites.

  • Duplicate Content Blocks – Analyzing your site for duplicate content blocks can be done from both an internal perspective and an external perspective.
    • For an external review, take a sentence or two of content from a page, put quotes around it, and search for the content in Google. This will tell you what other pages have that exact string of characters. The sort order indicates which page or domain is seen as the owner of the content. If your site comes up after another site, that indicates that Google does not see your site as the owner.
    • For internal review, simply use the site:operator and search the snippet of content. This will show you all the pages where that exact content appears on your site.
  • Semantic or Topical Duplicates – This analysis will tell you which page Google sees as the best match for a certain keyword or topic on your site. Simply add the keyword after your site:operator and observe the pages that come up in the results.
    • If your target page comes up first, that indicates Google agrees that is the most relevant page.
    • If your target page comes of up second or further down the first page, this indicates Google does not see your target page as the most relevant page.
    • If your target page does not come up on the first page at all, this indicates that Google does not consider it relevant for the keyword.

Tools to Control Indexation

In addition to implementing canonical tags or 301 redirects where applicable, the following technical tools are your best bet to de-index and prevent indexation of your site:

Meta Robots “NOINDEX, FOLLOW”

The most reliable way to keep HTML pages out of the index is to add a meta robots “NOINDEX, FOLLOW” directive. Meta robots could be utilized to remedy all of the thin and duplicate content issues mentioned in the previous section. These tags can be added dynamically with the help of your dev team and a set of rules about when and where they should be added.
Note: Caution should be taken when implementing these tags on a grand scale to ensure valuable pages are not removed from the index.
Once you have determined what types of pages receive traffic on your site, what types of pages have the potential to receive traffic, and what types of pages have no value to the search engines, you are ready to determine your rules. For sites with faceted navigation, you might want to consider de-indexing every page with five or more filters applied, or every page with a price range filter applied in combination with another filter (assuming you have already determined these pages do not receive substantial traffic).
After implementing “NOINDEX, FOLLOW” tags on a site with faceted navigation, we were able to reduce the number of pages indexed from over 1.8 million to just over 200,000 over the course of 15 months. Subsequently, this lead to what I believe was a recovery from a Panda penalty in May 2014. See the chart below. The red line indicates the month the noindex tags were implemented.

Note: In order for the “NOINDEX, FOLLOW” tag to be recognized, the pages you are attempting to de-index must not be present on the robots.txt file.

URL Parameters Tool

As mentioned recently at SMX West, Google Search Console’s URL Parameters tool allows webmasters to instruct Google on how they would like Google’s crawlers to view and process the URL parameters utilized on their site. The URL parameters utilized on the site will automatically be discovered by Googlebot within the tool, or can be manually added for the webmaster to review and configure. From there, it is up to the webmaster to go down the list of the URL parameters and decide whether or not the parameters should be crawled. Per Google’s instruction, selecting “No URLs” should prevent Google’s crawlers from crawling those URLs when they are internally linked. However, it is unknown whether this will also block Google from crawling external links to these pages and passing link equity.
After implementing this on some of our clients’ sites, we have seen a dramatic decrease in indexation of URLs with parameters. This is due to the decrease in what we call “index churn” which is the crawling of URLs with parameters, their indexation, and subsequent de-indexation when the canonical tag is recognized. Google now no longer crawls the parameters and does not index them, thus making Google’s crawl of the site more efficient while cleaning up the index.
See the chart below. The blue dot indicates the week the URL parameters directives were set.

Remove URL Tool

Google Search Console also has a tool that allows webmasters to inform Google that a subdomain, subdirectory, or a certain page on a site should be removed completely from Google’s index. This will take effect within a few hours and will last 90 days. Beyond 90 days, Google will reindex the site or page, unless you have taken measures to block the URL. This tool is especially helpful when you have found that entire subdomains are indexed and they should be completely removed and blocked from indexation.
Note: This tool is very powerful and should be used with caution. However, if you accidentally request removal of your primary domain, they will re-index it within 24 hours after your “Reinclude” request (assuming there were no measures taken to block the URL).

Prevention & Maintenance

After you have identified and started to remove problematic pages from Google’s index, it is recommended that you benchmark the indexation of your site using the same site:operator searches that were utilized to identify the problem areas. This enables you to monitor the effectiveness of your de-indexation measures and the rate of removal. These searches should be done every two to four weeks to keep your finger on the pulse of Google’s indexation of your site. This will allow you to identify any spikes or dips in indexation much sooner than just monitoring Google Search Console’s “Index Status” report, which has a 30-day delay (some have also questioned its accuracy).
To prevent future index bloat, whenever a specific page, set or pages, or section is launched, it is recommended to ask the questions mentioned earlier to determine if that page should be indexed. If the answer to all three of those questions is “No,” you will want to consider adding the new content to an existing page instead of launching a new page or adding a noindex meta tag to the new page so it is not indexed.

Conclusion

If Google is indexing too many pages on your site or showcasing the wrong pages in search results, you can use Google Webmaster Console’s tool set to limit indexation, set specific guidelines for crawlers, request the removal of specific URLs, and ensure that your site is properly formatted to promote quality pages over “thin” or empty pages.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Joanna Cheney, Organic Search Strategist

More from this Author

Categories
Follow Us
TwitterLinkedinFacebookYoutubeInstagram