Supplemental Results and Google’s Extended Databases
Until very recently, you might have seen a label next to a search result in Google that indicated it was a “supplemental” result. A couple of patents from Google, one of which was granted this week and one from earlier this year, discuss how a search query might return results from an extended database that […]
Until very recently, you might have seen a label next to a search result in Google that indicated it was a “supplemental” result. A couple of patents from Google, one of which was granted this week and one from earlier this year, discuss how a search query might return results from an extended database that sound a lot like a supplemental results.
The Official Google Webmaster Central Blog announced that they would stop labeling their supplemental results in a post from July 31st, titled Supplemental goes mainstream. The authors, Prashanth Koppula and Matt Cutts, tell us that the system for crawling and indexing supplemental results has been improved, and those results are fresher and more comprehensive than ever.
Danny wrote a detailed post on the same day – Google Dumps The Supplemental Results Label
The patents may provide some insight into how a supplemental or extended index works, and how partitions are used to speed up a search of extended results.
These patents don’t use the word “supplemental” but it is possible that they describe the way supplemental results work, or worked. In the Webmaster Central post, we are told that supplemental results were introduced in 2003. These patents were also originally filed in 2003.
What I found interesting about them is that they provide a view of how indexing could work in a search engine, and answer some questions such as: (1) when are extended database results triggered, (2) how search result numbers are estimated, and (3) why you sometimes see a link at the bottom of results that tell you there are more results that aren’t being shown, that you can see if you click upon that link.
Some information about the patents:
System and method for selectively searching partitions of a database
Invented by Kourosh Gharachorloo, Fay Wen Chang, Deborah Anne Wallach, Sanjay Ghemawat, and Jeffrey Dean
Assigned to Google
US Patent 7,254,580
Granted August 7, 2007
Filed: September 30, 2003
When a search query is received, a plurality of partition indexes are searched using the set of search terms in the search query. Each partition index corresponds to a partition of a document index. The search of each respective partition index identifies a subset of a plurality of document index sub-partitions corresponding to the respective partition index. Next, the search query is executed by only those document index sub-partitions identified by the subsets, thereby identifying documents that satisfy the search query. By using the partition index to reduce the number of document index sub-partitions searched while executing a search query, the execution of the search query is made more efficient.
System and method for searching an extended database
Invented by Kourosh Gharachorloo, Fay Wen Chang, Deborah Anne Wallach, Sanjay Ghemawat, and Jeffrey Dean
Assigned to Google
US Patent 7,174,346
Granted February 6, 2007
Filed September 30, 2003
Once a search query is received from a user, a standard index is searched based on the search query. The standard index forms part of a set of replicated standard indexes having multiple instances of the standard index. A signal is then determined based on the search of the standard index. When the received signal meets predefined criteria, an extended index is searched. The extended index forms part of a set of extended indexes having at least one instance of the extended index. There are fewer instances of the extended index than instances of the standard index. Extended search results are then obtained from the extended index and at least a portion of the extended search results is transmitted towards a user.
What follows is a walk through of the process of returning search results from the standard index and the extended extended index, when necessary. Some of the alternative approaches mentioned in the patents aren’t covered or discussed in detail. The processes described may be very different from the reality, but hopefully this view of an extended index will provide you with some insights into how documents could be indexed by a search engine, and give you a slightly different perspective on the process of returning results in response to a query.
Searching the cache and standard index
A searcher submits a query to the search engine, and the query is received at one of a number of datacenters and sent to one of the query servers at the datacenter.
The query server receives the query and sends it to a mixer. The mixer transmits the query to the cache, to search the cache for results. The mixer might first normalize and hash the search request.
A hash value representing the query is received by the cache, and the cache is searched.
If a match for the hash value is found, those results would be sent back to the mixer. Results might be a list of located documents, with or without snippets, or an indication that there were no results in the cache.
The mixer or query server receives that response and determines whether results were located. If there are results, and snippets weren’t returned with them, they may be requested from the cache, and if they aren’t in the cache, they might be requested from the standard document server.
If no results were located, then the query is sent to standard index server. The search request could be first transmitted to multiple standard balancers (one within each partition) that transmit the search onward to the standard index server.
Each balancer transmits the search request to a set of standard index servers.
Each standard index server stores and searches one or more partitions of the standard index to produce a set of search results. Each balancer may send the search query to between ten and one hundred standard index servers, and each standard index server is set up to store and search multiple (e.g., two to ten) index sub-partitions.
When the query is received by the standard index servers, those are searched, and the results are sent back to the mixer. Those results could be a list of located documents or an indication that no results were found.
The mixer receives a response, and if no search results were located, notifies the searcher that there were none.
If search results were located, snippets might be requested from the standard document servers, or the results might be sent to the query server, which might request the snippets.
The standard document servers receive that request for snippets, generate them from the documents identified in the search results, and send the snippets back to the mixer.
The mixer then sends the results and snippets to the cache, where they are saved in memory for future searches for that query.
At this point, a decision needs to be made as to whether more results are needed.
Signals indicating whether or not a search of the extended index should be conducted
- Number of results – for instance, if there are less than ten results (and that is the signal threshold value)
- Whether the amortized cost of performing the extended search is small, comparing the cost of performing the search to the quality of search results
- Deciding if the user is not satisfied with the standard results returned from the standard index server by looking at something like when a user selects a “next set of results” button repetitively
- When the query scores (frequency and PageRank) of the results are low on average
- if the load on the extended index servers is low
- If for a given query the cost is low (different queries have different costs), or
- Any combination of these signals.
How Estimates of the number of results might be calculated
An estimate might be calculated on the fly, while the search is being performed, based upon how frequently results are being obtained from the standard index servers. For example, the estimate might be based upon a search of a small percentage of the full index – less than 10 percent, and perhaps even less than 2 percent.
Queries sent to the extended server
When the standard index and cache were searched, and there were enough results, as measured by the threshold values for the signals listed above, then results are sent to the searcher.
If not, then the query is sent to an extended mixer, and an extended cache is checked for results. If enough results are received there, those are sent to the extended mixer, and extended search results, with associated snippets, are transmitted to the mixer from the extended mixer.
The mixer would take those results and aggregate them with any standard search results, if there were any. Those would be sent to the query server, and then the searcher.
But, imagine that there weren’t any extended search results located in the extended cache. The search request may be sent to the extended index servers.
Filtering at the extended server
Like in the standard index, there are multiple extended balancers that transmit the search onward to the extended index servers.
Balancer procedures in the extended balancer use a balance filter to perform a lookup operation for each term in the received search query to locate corresponding information in the partition index.
The balancer filter uses the information in the partition index to produce a sub-partition map for each of the terms in the search query.
A map of the extended document index sub-partitions is produced for each term of the search query.
The map can be encoded a few different ways, including as a bit map. The map would contain a bit for each sub-partition of the extended index partition serviced by the extended balancer, with a first value of the bit indicating that the term is found in at least one document in the corresponding sub-partition of the extended index, and a second value of the bit indicating that the term is not found in any document in the corresponding sub-partition of the extended index.
Using combined bit maps
Each term has a bit map made for it. A combined map is made from the bit maps for each term in the query, using Boolean logic matchng what was used in the query itself.
In Google, by default, if you don’t use a boolean operator for your search, the search engine will attempt to perform a search using “AND” for all of the terms (or at least all of the non-stopwords). But, you could use the “OR” operator in your search query, or place a minus sign in front of a term, indicating a “NOT” for that term. The way the bit maps for each term would combine would be based upon your use of those Boolean operators.
This combined map would indicate the document index sub-partitions that may index one or more documents that satisfy the search query, and which document index sub-partitions don’t.
The query is sent only to the extended document index sub-partitions indicated by the combined map as potentially indexing documents combining the search query.
By limiting, or filtering the extended search to only those sub-partitions containing the searched terms, there is a significant reduction of sub-partitions that need to be looked at. This makes extended searches more efficient and faster.
The maps produced could even be based upon sub-sub-partitions of an extended document index partition, instead of sub-partitions. There are fewer documents in the sub-sub-partitions.
A sub-partition might index the terms in approximately a half million documents, and those index sub-partitions are each partitioned into 128 sub-sub-partitions, which means that each sub-sub-partition will therefore index about 4,000 documents.
Returing results from the extended index
If results are found and sent to the extended mixer, snippets are requested from the documents associated with the query terms, and the results and snippets may be saved for future searches in the extended cache, and then sent to the standard mixer.
If no extended search results are found, the extended mixer informs the standard mixer of the lack of results.
If there are extended results, the mixer takes those (with snippets) and aggregates them, with standard search results from the cache or standard index server. Aggregated search results are then sent to the query server, and then to the searcher.
An alternative method for performing an extended search
Similar to the above method, after results have been found in the extended database, the extended mixer determines how many extended search results there are.
The number of extended search results is sent to the mixer, which has already received standard search results and snippets.
The standard results and the number of extended results are sent to the query server, and those are shown to the searcher.
The standard search results and snippets are presented to the searcher, as well as a link stating that the number of extended results can be viewed by selecting the link. That link may be provided without showing the number of extended search results, or before the extended results have even been obtained.
If the searcher selects the link, the search is repeated, but with the extended search results shown to the searcher, providing the user with the standard results, and with results from more uncommon or obscure documents.
Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.