|
Google's duplicate content patent
This month, Google was granted a patent with the name Duplicate
document detection in a web crawler system. The patent explains
how a content filter from the search engine can work with
a duplicate content server.
What is duplicate content?
The patent contains a definition of duplicate content:
"Duplicate documents are documents that have substantially
identical content, and in some embodiments wholly identical
content, but different document addresses."
The patent describes three scenarios in which duplicate documents
are encountered by a web crawler:
Two pages, comprising any combination of regular web page(s)
and temporary redirect page(s), are duplicate documents if
they share the same page content, but have different URLs.
Two temporary redirect pages are duplicate documents if they
share the same target URL, but have different source URLs.
A regular web page and a temporary redirect page are duplicate
documents if the URL of the regular web page is the target
URL of the temporary redirect page or the content of the regular
web page is the same as that of the temporary redirect page.
A permanent redirect page is not directly involved in duplicate
document detection because the crawlers are configured not
to download the content of the redirecting page.
How does Google detect duplicate content?
According to the patent description, Google's web crawler
consults the duplicate content server to check if a found
page is a copy of another document. The algorithm then determines
which version is the most important version.
Google can use different methods to detect duplicate content.
For example, Google might take "content fingerprints"
and compare them when a new web page is found.
Interestingly, it's not always the page with the highest
PageRank that is chosen as the most important URL for the
content:
"In some embodiments, a canonical page of an equivalence
class is not necessarily the document that has the highest
score (e.g., the highest page rank or other query-independent
metric)."
How does this affect your website?
If you want to get high rankings, it is easier to do so with
unique content. Try to use as much original content as possible
on your web pages.
If your website must use the same content as another website,
make sure that your website has better inbound links than
the other websites that carry the same content. It's likely
that your website will be chosen as the most important URL
for the content then.
If your web site has unique content, you don't have to worry
about potential duplicate content penalties. Optimize that
content for search engines and make sure that your web site
has good inbound links. It's hard to outrank a website with
good optimized content and many good inbound links.
|