One of the key elements of how the Google search engine works involves the use of the words, or anchor text, that appear in a link on a source page, to describe a page targeted by the link.
We know this from statements about anchor text made in documents like the Lawrence Page and Sergey Brin-scribed The Anatomy of a Large-Scale Hypertextual Web Search Engine, and the early PageRank patents authored by Lawrence Page – Method for node ranking in a linked database and Method for scoring documents in a linked database.
A newly granted patent from Google, Anchor tag indexing in a web crawler system, may provide a more detailed look at the mechanics of using anchor text as a relevancy signal for a page being linked to by the search engine. It also describes some other processes about using links to rank pages and about crawling websites. I’ve written a detailed breakdown of the patent at SEO by the Sea in Google Patent on Anchor Text and Different Crawling Rates.
Danny asked me if I might hit on some of the highlights of the document here.
Link Discovery and Crawling Layers
Links are at the heart of the patented process, and the discovery of links is done in at least three different ways – direct submissions of URLs, crawling of URLs, and submissions of content containing links through syndication methods like RSS.
The crawling of URLs may be done in three separate layers, based upon factors that could involve how frequently the content at those URLs may be updated, and what PageRank or page ranking they may have:
- A base layer, in which most known URLs are sectioned into segments, and those segments are crawled during a specific period such as a day, in a round robin manner until all are visited by robots programs
- A daily layer, in which a smaller group of URLs that have a higher crawl score, crawl frequency, or both, may be visited over the same period of time that segments are crawled in the base layer.
- A real time layer, in which an even smaller group of URLs which have even higher crawl scores, crawl frequencies or both, may be visited in much shorter intervals such as minutes or hours.
The patent provides some simple formulas which define crawl scores and crawl frequencies, and also a directed approach that may favor URLs in specific categories, such as news sites and pages in specific languages or in certain file formats.
Link Logs, Anchor Maps, Duplicates, and Annotations
When a crawling program visits a URL, it may collect lists of links and content from pages in a link log which can be sent back to other programs that look at page content, at duplicate content on pages, at duplicate file structures at hosts, and at text both from anchors of links and from a distance surrounding the links.
URLs that contain duplicate content may be reviewed, and one URL may be chosen as a canonical, or best, version with the possibility that the other duplicate or duplicates are then ignored.
Identifying duplicate file/linking structures at different hosts may also result in one version being identified as a version to continue being indexed, and the other or others as versions to be ignored in the future.
The patent tells us that it is possible that anchor text in links pointing to duplicate URLs may be considered as anchor text pointing to the canonical version of those URLs.
Information about changes to pages is determined at this stage, and link maps and anchor maps are made from the link logs.
The change information may impact the frequency with which specific URLs are crawled, and together with something like PageRank, may determine which of the three layers a URL may be placed within.
The link maps may be used to determine a page ranking, such as PageRank, for documents at the different URLs.
The anchor maps may be used to associate anchor text and additional “annotation” information with the URLs that they point at, and that text and annotation information may be used in conjunction with other information to determine relevancy of a page to different words and phrases.
Here’s an example from the patent that I paraphrased in my post:
For example, a link pointing to a picture of Mount Everest might read “to see a picture of Mount Everest click here.” The anchor text might be the “click here” but the additional text “to see a picture of Mount Everest” could be included in the link record.
Robots and Temporary and Permanent Redirects
A robot crawling through links found at URLs might come across redirected links, and the patent tells us that temporary (302) and permanent (301) redirects are treated differently.
Temporary redirects are identified and recorded, but will be followed by a robot.
Permanent redirects are also identified and recorded, but instead of being followed by a robot, information about them is sent back to a scheduling program that may crawl the URL being redirected to at another time.
It’s important to note that this is a patent written to protect Google’s intellectual property in the processes described, but may not describe the processes that Google has actually implemented, or may only describe some of the processes being used. The patent is also 4 years old at this point, and there’s a possibility that Google may be doing some things very differently now.
But the processes that are described do seem to correspond well with many observations about things such as the behavior of Google’s crawling processes and the use of anchor text as a relevancy signal, helping to determine the relevance of pages being pointed towards by those links for certain queries.
Opinions expressed in the article are those of the guest author and not necessarily Search Engine Land.