New Duplicate Content and Mapping Patents from Google – January 2, 2007

Google was granted new patents this week on a methods for estimating similarity between web pages and documents which may help to filter duplicate content, and upon a digital mapping system which appears to be the foundation for Google Maps, and has a number of related map-based patent applications in its wake.

Finding Similarity in Pages and Objects

The patent on duplicate and near duplicate content, Methods and apparatus for estimating similarity (US Patent 7,158,961), list Moses Samson Charikar as its inventor, and was granted this morning, and originally filed on December 31, 2001.

Abstract

A similarity engine generates compact representations of objects called sketches. Sketches of different objects can be compared to determine the similarity between the two objects. The sketch for an object may be generated by creating a vector corresponding to the object, where each coordinate of the vector is associated with a corresponding weight. The weight associated with each coordinate in the vector is multiplied by a predetermined hashing vector to generate a product vector, and the product vectors are summed. The similarity engine may then generate a compact representation of the object based on the summed product vector.

Between completing his PhD from Stanford University in the summer of 2000, and joining Princeton in the Fall of 2001, Dr. Charikar worked at the research department of Google. He has continued to work on methods for determining similarity and one of the latest papers that he has co-authored on the subject is Ferret: A Toolkit for Content-Based Similarity Search of Feature-Rich Data (pdf).

Some features of the similarity comparison process:

1. Similarity sketches for pages, created by the similarity engine, can be used to reduce the amount of redundant or nearly redundant documents crawled and returned in response to a user’s search query. They can also help a spidering program become more efficient by avoiding crawling the sites determined to be substantial duplicates.

2. Similarity sketches can be created for pages based upon lists of hyperlink in those documents

3. Similarity of snippets at the time of serving can be compared, and snippets and/or search results that exceed a similarity threshold can be excluded.

4. Under this similarity comparison, different elements can be given different weights -

In FIG. 3A, each non-zero coordinate is given an equal vector weight (i.e., it has the value one). More generally, however, different elements can be given different weights. This is illustrated in FIG. 3B, in which words that are considered “more important” are more heavily weighted in object vector 302. For example, “years” and “score,” because they are less common words, may be given a higher weighting value, while “and” is given a low weight. In this manner, the object vector can emphasize certain elements in the similarity calculation while de-emphasizing others.

This isn’t the only granted patent that has been assigned to Google which discusses duplicate content. Detecting duplicate and near-duplicate files and Detecting query-specific duplicate documents also discuss methods to identify and filter duplicate and near duplicate documents.

Digital Mapping

The patent on a mapping, Digital mapping system (US Patent 7,158,878), is one of a number of patent filings from Google which describe many details which appear in Google Maps. The inventors listed are Jens Eilstrup Rasmussen, Lars Eilstrup Rasmussen, Bret Steven Taylor, James Christopher Norris, Stephen Ma, Andrew Robert Kirmse, Noel Phillip Gordon, Seth Michael Laforge. It is listed as having been filed February 5, 2005.

Abstract

Various methods, systems, and apparatus for implementing aspects of a digital mapping system are disclosed. One such method includes sending a location request from a client-side computing device to a map tile server, receiving a set of map tiles in response to the location request, assembling said received map tiles into a tile grid, aligning the tile grid relative to a clipping shape, and displaying the result as a map image. One apparatus according to aspects of the present invention includes means for sending a location request from a client-side computing device to a map tile server, means for receiving a set of map tiles in response to the location request, means for assembling said received map tiles into a tile grid, means for aligning the tile grid relative to a clipping shape, and means for displaying the result as a map image. Such an apparatus may further include direction control or zoom control objects as interactive overlays on the displayed map image, and may also include route or location overlays on the map image.

This patent provides a framework for a digital mapping system which includes searches for locations, local searches, and driving directions. Some additional patent applications from Google that focus upon the display and functionality of Google Maps (implemented or not) include:

My standard disclaimer regarding patents. Patents are filed to protect ideas and methods developed as part of the intellectual property of a company, and may be used to exclude others from using the same, or similar processes, but the granting of a patent or publication of a patent application doesn’t necessarily mean that the processes involved have been fully developed, or will be in the future. Yet, the documents can provide some insight into the ideas that an organization is working upon, and may act as a starting point for more research.

Opinions expressed in the article are those of the guest author and not necessarily Search Engine Land.

Related Topics: Channel: SEO | Google: Maps & Local | Google: Patents | SEO: Duplicate Content

Sponsored


About The Author: is the Director of Search Marketing for Go Fish Digital and the editor of SEO by the Sea. He has been doing SEO and web promotion since the mid-90s, and was a legal and technical administrator in the highest level trial court in Delaware.

Connect with the author via: Email | Twitter | Google+ | LinkedIn



SearchCap:

Get all the top search stories emailed daily!  

Share

Other ways to share:
 

Read before commenting! We welcome constructive comments and allow any that meet our common sense criteria. This means being respectful and polite to others. It means providing helpful information that contributes to a story or discussion. It means leaving links only that substantially add further to a discussion. Comments using foul language, being disrespectful to others or otherwise violating what we believe are common sense standards of discussion will be deleted. Comments may also be removed if they are posted from anonymous accounts. You can read more about our comments policy here.
  • http://seowebmaster.com/ ★ ★ Search Engines WEB ★ ★

    Patent of no patent, Google had already aggressively phrasing in their de-duplication Algos as of about two years ago.

    Most SEOs noticed it when the many directories using DMOZ data began disappearing, and Websites that owned much of their link popularity success to their DMOZ listings – began a sudden, sharp drop in the SERPs (some even virtually disappeared)

    The second phrase of Google’s de-duplication process appeared to drop or severly punish OBVIOUS links pages, and dropping links directories that were OBVIOUS duplications of automatic link uploading pages.

    Also, around that time, certain high profile automatic link exchange Web sites were banned.

    One very high profile one actually made a public acknowlegement to their customers and eventually changed their domain – thus starting all over because even after changing their strategies – stayed permanantly at a PR0.

  • http://www.aaronshear.com/blog/ Aaron Shear

    This patent seems to sound like the shingles conversation that was started a few months back. Relating to how a Shingle or in this case a Sketch can be viewed as similar or duplicative. Interesting concept makes it very difficult to scrape and succeed any longer.

    Great write up!

  • http://www.adscriptor.com Jean-Marie Le Ray

    Hi Bill,

    Nice post, as usual. Anyway, even if I know you’ll have to translate my post, what do you think a similarity engine could do in this case :
    http://adscriptum.blogspot.com/2007/01/10-tips-for-writing-profit-producing-ad.html
    and how to find the original author ?
    Jean-Marie

  • http://www.seobythesea.com Bill Slawski

    Hi Jean-Marie,

    Thanks. If I understand correctly, your question is more about which page might appear in search results when a search engine has determined that pages are duplicates or are very similar.

    The best description of how a search engine might behave when filtering out pages to be shown to a searcher is in a patent application from Microsoft – System and method for optimizing search results through equivalent results collapsing.

    I wrote about it at SEO by the Sea, and I’m not going to duplicate that here, so I’ll just point to it – Microsoft Explains Duplicate Content Results Filtering. Chances are very good that what Microsoft describes there is very similar to what Google and Yahoo are doing when deciding which pages to show.

    I don’t think that it makes a difference whether Google is using a similarity engine, as described in this patent, or one of the shingles methods from their other patents, or a phrase-based indexing method to identify duplicates, or some other method. Regardless of what method is being used to identify duplicates, the decision of which pages to show is likely independent of that.

  • http://www.adscriptor.com Jean-Marie Le Ray

    Hi Bill,

    “If I understand correctly, your question is more about which page might appear in search results when a search engine has determined that pages are duplicates or are very similar.”

    Yes and no. In this case, maybe it’s more about plagiarism than duplicate content, and I guess no similarity engine nor algorithm will be able to determine who is the original author (so the one and only result the search engine should show in SERPs), but a human validator.
    I think just this solution would be trully healthy for the Web ecosystem.

    Jean-Marie

  • http://www.adscriptor.com Jean-Marie Le Ray

    Bill, hi again

    a bit off-topic, but did you read that : http://www.ificlaims.com/press_release012007a.htm
    I’ve seen somewhere than Microsoft is not enough innovative, it doesn’t seem! 1463 patents in 2006, rank 12
    J-M

  • http://www.seobythesea.com Bill Slawski

    Searching through the granted patents and published patent applications every week, I do see a lot of patent filings from Microsoft.

    Some of them are innovative, and some of them maybe less so. I’m not sure that volume of patent filings by itself is a clear indication of innovation.

    But there is some interesting stuff amongst those patent applications.

Get Our News, Everywhere!

Daily Email:

Follow Search Engine Land on Twitter @sengineland Like Search Engine Land on Facebook Follow Search Engine Land on Google+ Get the Search Engine Land Feed Connect with Search Engine Land on LinkedIn Check out our Tumblr! See us on Pinterest

 
 

Click to watch SMX conference video

Join us at one of our SMX or MarTech events:

United States

Europe

Australia & China

Learn more about: SMX | MarTech


Free Daily Search News Recap!

SearchCap is a once-per-day newsletter update - sign up below and get the news delivered to you!

 


 

Search Engine Land Periodic Table of SEO Success Factors

Get Your Copy
Read The Full SEO Guide