Home Blog SEO Duplicate content: what it is and how to avoid SEO issues

Duplicate content: what it is and how to avoid SEO issues

admin 26 February 2021

Today we are dealing with a particularly thorny issue for SEO, the one about duplicate content: that is, to simplify, content that is repeated identical or broadly similar in various web pages, within the same site or on different sites. This practice may have a deceptive intent, but it is generally the result of poor optimization or laziness, and may lead to a worsening in ranking of pages and, in general, a difficulty in positioning these contents. That is all there is to know about what duplicate content is, how to detect it, how to correct the problem and avoid its reappearance.

SEO duplicate content, a knot to face

Wanting to provide a definition, duplicate content is content that is reproduced as an identical or very similar copy in multiple locations on the Web, within a single Web site or across multiple domains, and therefore each content that is located in more than one single Web address or URL.

More precisely, the Google guide explains that the expression refers to “blocks of important content within or between domains that are identical or very similar”, which can give rise to what we have referred to as a serious and frequent SEO error.

Duplicate portions of text in different languages are not considered to be content, just as quotes (even full paragraphs) are not identified as an error, especially if we use the semantic <cite> markup within the source code.

Why duplicate content represent an issue and an error

While not technically leading to a penalty, duplicate content can still sometimes negatively affect search engine rankings: when faced with multiple parts of “significantly similar” content in more than one location on the Internet, Google has difficulty deciding which version is most relevant for a given search query.

In general, duplicate content is considered a node to be solved because – basically – it does not offer added value to the user’s experience on the pages of the site, which should be the focal point of each content published. Thinking like a user, would we regularly visit a site that presents non-original articles, or will we try to read directly the original source of this information?

In addition to problems in organic search, duplicate content may also be in violation of the policies of Google Adsense Publisher, and this may prevent the use of Google Ads in sites with copyrighted or copyrighted content: or, in pages which copy and republish content from others without adding any original wording or intrinsic value; pages which copy content from others with slight modifications (rewriting it manually, replacing some terms with simple synonyms, or using automatic techniques) or sites dedicated to insert content such as videos, images or other media from other sources, always without adding substantial value to the user.

The several different typologies of duplicate content

Among the examples of non-malicious and non-deceptive duplicate content Google quotes:

Discussion forums, which can generate both regular and “abbreviated” pages associated with mobile devices.
Items from an online store displayed or linked via multiple separate URLs.
Web page versions available for printing only.

IIn fact, it must be made clear that there are two big categories of duplicate content, those on the same site and those on other websites, which of course represent two different problems of order and scale.

Duplicate content on the web

Duplicate content on the Web, or external duplicate content, occurs when an entire content or a portion of it (such as a paragraph) is repeated on several different sites (domain overlapping).

This error can result from a number of factors, and for example it is common in e-commerce that publish as a reprocessing without variations the product sheets provided by the original manufacturer of an article for sale, but sometimes it can also be a manipulative technique in the “attempt to control search engine rankings or acquire more traffic”.

An eventuality that Google knows and tries to punish (penalizing the ranking of the sites involved or even removing from the Index the sites themselves), because “deceptive practices like this can cause an unsatisfactory user experience”, showing visitors “always the same contents repeated in a set of search results”.

Apart from this diversity problem for users, a duplicated external content also embarrasses Googlebot that in the face of identical content in different Urls, does not initially know what the original source is and is therefore forced to take a decision to favor one page on the other, considering elements such as the date of indexation, the authority of the site and so on.

Duplicate content on the same site

Another issue is duplicate content on the same site, also called internal duplicate content, which instead are referred to the level of identical domain or host name.

In this case, the damages are minor and mainly concern a possible worsening of the possibility of good positioning in SERP of the pages concerned, always due to the difficulty for search engine crawlers to determine which version is to be preferred and show users as a pertinent response to their query.

E-commerce duplicate content

This second type of problem, too, is frequently found in e-commerce sites, for example in case of bad management of URL parameters and faceted navigation, which then creates multiple pages with identical content reachable at different addresses, all indexed by search engines, or inaccurate use of tags, which create an overlap with the category pages.

The causes of duplicate content

We mentioned some potential elements that lead to internal or external duplicate content situations on sites, but now it is the case to list in a more analytical way the five unintended technical causes of the problem.

Variants of the URL

URL parameters, such as click tracking and some analysis codes, can cause duplicate content issues, as well as session Ids that assign a different ID stored in the URL to each user visiting the site or, again, printable versions (when several versions of pages are indexed).

The advice in this case is to try to avoid adding URL parameters or alternative versions of Urls possibly using scripts to transmit the information they contain.

Separate versions of pages

You may experience a duplicate content problem with a site that has separate versions with www prefix and without, or if you have not completed the transition from HTTP:// to HTTPS://, and keeps both versions active and visible to search engines. Other separate versions include pages with and without trailing-slash, case sensitive Urls (i.e., case sensitive), mobile-optimized Urls and AMP versions of pages.

Thin content

They are defined thin or thin content that content is generally short and poorly formulated, with no added value for users nor originality, which can represent portions of the site already published in other URLs.

It also includes CMS archive pages such as tags, authors and dates and especially pagination pages (archives of post lists after the first page), which are not properly optimized or blocked with a meta tag “noindex, follow”.

Boilerplate content

An element that can generate duplicate content is also the boilerplate content, that is the text in header, footer and sidebar that for some sites can even be the predominant part of the content on page: being present on all Urls, can therefore become a problem if not adequately treated (for example, implementing variations according to the section of the site where the user is located).

Scraped or copied content

In this case not only the problems with plagiarism (which violates the copyright law and against which Google has activated a specific procedure to request the removal of the guilty page from search results pursuant to the Digital Millennium Copyright Act, the US Copyright Act)but all the circumstances in which on the pages there are repropositions that are scraped or explicitly copied.

Copying objects can primarily be blog posts and editorial content, but also product information pages, whose contents end up in multiple locations on the Web.

Negative consequences of duplicate content

Duplicate content is a problem at various levels for all actors on the Web – search engines, site owners and users – and this already makes us understand why it is important to take action to correct these cases and to avoid their appearance.

In detail, to search engines duplicate content may present three main problems:

Inability to decide which versions to include or exclude from their indices.
Indecision whether to direct the link metrics (trust, authority, anchor text, link equity and so on) to a specific page or keep them separated between multiple versions.
Difficulty in deciding which version to place for the different query results.

For site owners, on the other hand, the presence of duplicate content can lead to worsening rankings and traffic losses, which usually result from two main problems that, in both cases, do not allow the content to achieve visibility in the Search that otherwise might have:

A dilution of the visibility of each of the pages with duplicate content – because search engines rarely show multiple versions of the same content and therefore are forced to choose for themselves which version is more likely to be the best result.
UA further dilution of link equity, because even other sites will have to choose between duplicates and therefore backlinks will not point to a single content.

When duplicate contents are responsible for fluctuations in the SERP ranking the cannibalization problem occurs: Google fails to understand which page offers the most relevant content for the query and then alternatively test the target Urls in search of the most relevant one.

Lastly, to users, duplicate content is not useful and does not offer any added value, since it is not unique.

How to solve issues with duplicate content

Generally speaking, solving problems related to duplicate content is reduced to a single goal: specify which of the duplicates is the “correct” one. There are therefore some interventions that can serve to avoid the presence of duplicate internal content, and more generally serves to enter into the perspective of always communicating to Google and search engines the preferred version of the page in front of those possibly duplicate.

These are the operations to be carried out to “proactively solve duplicate content problems and be sure that visitors see the content intended for them”, as the aforementioned Google guide says.

Use a redirect 301 from the “duplicated” page to the original content page in the file. htaccess to redirect users, Googlebot and other spiders intelligently. When multiple pages with the potential to rank well are combined into a single page, they not only stop competing with each other, but also create greater relevance and a general signal of popularity. This will have a positive impact on the ability of the “correct” page to place itself well.
Use the rel= canonical to specify the official version of the page and instruct Google to overlook the indexing of any variants it may find during the crawling of the site (but, be careful, Google can also choose a canonical page other than the one set).
Maintain consistency even with internal links.
Use top-level domains to allow Google to view the most appropriate version of a document.
Pay attention to the dissemination of content on other sites, even in cases of distribution in syndication (possibly, use or ask to use the tag noindex to prevent search engines from indexing the version of duplicate content).
Minimize the repetition of boilerplate text.
Use the parameter management tool in Search Console to indicate how we would like Google to manage URL parameters.
Avoid publishing incomplete pages, such as those for which we do not yet have actual content (placeholder pages, so you can use the noindex tag to block them and prevent them from being indexed).
Familiarize with the content management system and its ways of displaying content: for example, a blog entry can appear with the same label on the home page of a blog, on an archive page and on a page of other entries.
Minimize similar content, possibly by expanding the pages too similar or consolidating them all in one page. For example, the guide says, “if your travel site contains separate pages for two cities but the information is the same on both pages, you could merge the two pages into a single page covering both cities or expand each of them to present unique content on each city”.

How to verify the presence of duplicate content

To check if there are duplicate content on our site we have various tools: remaining within our suite, we can launch a scan with the SEO Spider that will highlight the existence of pages that have the same title tag, the same description or heading (a potential indicator of the problem), signaling even if we have correctly set a canonical. In addition, from this scan we can also view the list of site URLs and analyze them to verify that you have not used problematic parameters.

More complex is the search for duplicate content external to the site: in this case, you can rely on specific tools such as copyscape or launch manual searches on Google. In practice, you must select a portion of text “incriminated” (or that we think may have been copied) and insert it in quotes in the search engine search bar to find out if that content is actually duplicated on other sites.