Google was granted new patents this week on a methods for estimating similarity between web pages and documents which may help to filter duplicate content, and upon a digital mapping system which appears to be the foundation for Google Maps, and has a number of related map-based patent applications in its wake.
Finding Similarity in Pages and Objects
The patent on duplicate and near duplicate content, Methods and apparatus for estimating similarity (US Patent 7,158,961), list Moses Samson Charikar as its inventor, and was granted this morning, and originally filed on December 31, 2001.
A similarity engine generates compact representations of objects called sketches. Sketches of different objects can be compared to determine the similarity between the two objects. The sketch for an object may be generated by creating a vector corresponding to the object, where each coordinate of the vector is associated with a corresponding weight. The weight associated with each coordinate in the vector is multiplied by a predetermined hashing vector to generate a product vector, and the product vectors are summed. The similarity engine may then generate a compact representation of the object based on the summed product vector.
Between completing his PhD from Stanford University in the summer of 2000, and joining Princeton in the Fall of 2001, Dr. Charikar worked at the research department of Google. He has continued to work on methods for determining similarity and one of the latest papers that he has co-authored on the subject is Ferret: A Toolkit for Content-Based Similarity Search of Feature-Rich Data (pdf).
Some features of the similarity comparison process:
1. Similarity sketches for pages, created by the similarity engine, can be used to reduce the amount of redundant or nearly redundant documents crawled and returned in response to a user’s search query. They can also help a spidering program become more efficient by avoiding crawling the sites determined to be substantial duplicates.
2. Similarity sketches can be created for pages based upon lists of hyperlink in those documents
3. Similarity of snippets at the time of serving can be compared, and snippets and/or search results that exceed a similarity threshold can be excluded.
4. Under this similarity comparison, different elements can be given different weights -
In FIG. 3A, each non-zero coordinate is given an equal vector weight (i.e., it has the value one). More generally, however, different elements can be given different weights. This is illustrated in FIG. 3B, in which words that are considered “more important” are more heavily weighted in object vector 302. For example, “years” and “score,” because they are less common words, may be given a higher weighting value, while “and” is given a low weight. In this manner, the object vector can emphasize certain elements in the similarity calculation while de-emphasizing others.
This isn’t the only granted patent that has been assigned to Google which discusses duplicate content. Detecting duplicate and near-duplicate files and Detecting query-specific duplicate documents also discuss methods to identify and filter duplicate and near duplicate documents.
The patent on a mapping, Digital mapping system (US Patent 7,158,878), is one of a number of patent filings from Google which describe many details which appear in Google Maps. The inventors listed are Jens Eilstrup Rasmussen, Lars Eilstrup Rasmussen, Bret Steven Taylor, James Christopher Norris, Stephen Ma, Andrew Robert Kirmse, Noel Phillip Gordon, Seth Michael Laforge. It is listed as having been filed February 5, 2005.
Various methods, systems, and apparatus for implementing aspects of a digital mapping system are disclosed. One such method includes sending a location request from a client-side computing device to a map tile server, receiving a set of map tiles in response to the location request, assembling said received map tiles into a tile grid, aligning the tile grid relative to a clipping shape, and displaying the result as a map image. One apparatus according to aspects of the present invention includes means for sending a location request from a client-side computing device to a map tile server, means for receiving a set of map tiles in response to the location request, means for assembling said received map tiles into a tile grid, means for aligning the tile grid relative to a clipping shape, and means for displaying the result as a map image. Such an apparatus may further include direction control or zoom control objects as interactive overlays on the displayed map image, and may also include route or location overlays on the map image.
This patent provides a framework for a digital mapping system which includes searches for locations, local searches, and driving directions. Some additional patent applications from Google that focus upon the display and functionality of Google Maps (implemented or not) include:
- Generating and serving tiles in a digital mapping system
- Generating, storing, and displaying graphics using sub-pixel bitmaps
- Visually-oriented driving directions in digital mapping system
- Method and apparatus for customizing travel directions
- Secondary map in digital mapping system
- Combined map scale and measuring tool
My standard disclaimer regarding patents. Patents are filed to protect ideas and methods developed as part of the intellectual property of a company, and may be used to exclude others from using the same, or similar processes, but the granting of a patent or publication of a patent application doesn’t necessarily mean that the processes involved have been fully developed, or will be in the future. Yet, the documents can provide some insight into the ideas that an organization is working upon, and may act as a starting point for more research.
Opinions expressed in the article are those of the guest author and not necessarily Search Engine Land.