Duplicate Content is one of the most perplexing problems in SEO. In this post, I am going to outline 15 things about how Google handles duplicate content. This will include my leaning heavily on interviews with Vanessa Fox and Adam Lasnik. If I leave something out, just let me know, and I will add it to this post.
- Google’s standard response is to filter out duplicate pages, and only show one page with a given set of content in its search results.
- I have seen in the SERPs evidence that large media companies seem to be able to show copies of press releases and do not get filtered out.
- Google rarely penalizes sites for duplicate content. Their view is that it is usually inadvertent.
- There are cases where Google does penalize. This takes some egregious act or the implementation of a site that is seen as having little end-user value. I have seen instances of algorithmically applied penalties for sites with large amounts of duplicate content.
- An example of a site that adds little value is a thin affiliate site, which is a site that uses copies of third-party content for the great majority of its content, and exists to get search traffic and promote affiliate programs. If this is your site, Google may well seek to penalize you.
- Google does a good job of handling foreign language versions of the site. They will most likely not see a Spanish language version and English language versions of sites as duplicates of one another.
- A tougher problem is US and UK variants of sites (“color” v.s. “colour”). The best way to handle this is with in-country hosting to make it easier for them to detect that.
- Google recommends that you use Noindex metatags or robots.txt to help identify duplicate pages you don’t want indexed. For example, you might use this with “Print” versions of pages you have on your site.
- Vanessa Fox indicated in her Duplicate Content Summit at SMX that Google will not punish a site for implementing NoFollow links to a large number of internal site links. However, the recommendation is still that you should use robots.txt or NoIndex metatags.
- When Google comes to your site, they have in mind a number of pages that they are going to crawl. One of the costs of duplicate content is that when the crawler loads a duplicate page, one that they are not going to index, they have loaded that page instead of a page that they might index. This is a big downside to duplicate content if your site is not (more) fully indexed as a result.
- I also believe that duplicate content pages cause internal bleeding of page rank. In other words, link juice passed to pages that are duplicates is wasted, and this is better passed on to other pages.
- Google finds it easy to detect certain types of duplicate content, such as print pages, archive pages in blogs, and thin affiliates. These are usually recognized as being inadvertent
- They are still working on RSS feeds and the best way to keep them from showing up as duplicate content. The acquisition of FeedBurner will likely speed the resolution of that issue.
- One key thing they use as a signal as to what page to select from a group of duplicates is that they look at and see what page is linked to the most.
- Lastly, if you are doing a search and you DO want to see duplicate content results, just do your search, get the results, and append the “&filter=0” parameter to the end of your search results and refresh the page.
Here is a summary of Ways to Create Duplicate Content, and Adam Lasnik’s post on Deftly Dealing with Duplicate Content that explains how you handle this problem on your site.