December 14, 2010
Search engines today are very adamant on why duplicate content should not exist and how fresh, unique and relevant content is better for the user. Duplicate content is frowned upon because it’s a common practice used by black hatters to manipulate the SERPs. However, this is not always the case. Duplicate content can be accidental and not aimed at manipulating the SERPs. Either way, it is a troublesome issue for search engines such as Google – which is why canonical URLs were introduced.
What is a Canonical URL?
A canonical URL is used to help determine if two or more URLs are the exact same, even if they’re syntactically different from each other. Search engines do possess the intelligence to analyse and conclude whether URL A and URL B are related and/or identical, however they’re unable to be sure whether it’s intentional or not.
Before canonical URLs were introduced and used in the SEO world, it was common practice that if you had duplicate content on URL A and URL B, you would 301 redirect one to the other to prevent any potential duplicate content issues. However, sometimes this isn’t always feasible and content which is on URL A may be duplicated onto URL B, URL C, URL D etc because of a number of factors.
- Session IDs – In programming, sessions are used to help preserve data for a limited time to each visitor. Every unique visitor will have their own Session ID. However, the problem is created when Google crawls a URL with a Session ID and indexed it as a completely different URL, thus creating duplicate content.
URL with Session ID: http://www.myurl.com.au/index.php?PHPSESSID=183249374871234872314
In order to help Google and other search engines determine which in fact, is the source URL, a canonical URL can be placed within the <head></head> section of the website. For example:
- <link rel=”canonical” href=”http://www.myurl.com.au/index.php” />
This way, whenever Google or another search engine crawls and indexes multiple session IDs, it will be known to them that only 1 URL is the source (real URL).
Another good example which I experience regularly is pages with search results. Let’s say we have a WordPress blog on the following URL:
We’ve already gone ahead and filled our blog up with various articles related to Electronics and added categories such as Televisions, Computers, Phones and Cameras. Now when a category is clicked by a user, they’re taken to a page with a list of articles within that category. For example:
Looking at our list of computer related articles is nice but they’re only excerpts and there is no real information about what we are writing about or what computers are. So, we add 250-300 words of content, briefly describing what computers are and what we’re discussing. However, after a few weeks and over 40 articles added, Google has come along and crawled the computers category and indexing all the page URLs, thus creating duplicate content.
Therefore, to prevent the possibility of duplicate content in this situation, we would add a canonical URL (like our first example) to the computers category within the <head></head> section of our website:
- URL: http://www.myblog.com.au/category/computers/
- <link rel=”canonical” href=”http://www.myblog.com.au/category/computers/” />
For more information on Canonical URLs including a list of commonly asked questions, you can visit Google’s webmaster central blog:
Alternatively, Wikipedia has an article on canonical URLs or more specifically, URL normalisation which includes a list of examples: