Google’s OneBox Patent Application
One of my favorite search related articles is one that Danny wrote a few years back titled Searching With Invisible Tabs. It stands out because it describes one of the major difficulties involving how search engines work – making a user interface as simple as possible, while still somehow providing information that can meet a wide range of the intentions behind a search.
Danny also introduced us to the one box results name used internally by Google for what he described as “invisible tab promotion of some of its specialty content.” Are these inserted vertical search results the way to serve invisible tab results to searchers?
OneBox results have been the topic of sessions during Search Engine Strategies conferences under the name Vertical Creep Into Regular Search Results, which provided a chance for conference attendees to talk about these more narrowly defined types of searches appearing above organic results in Web searches at Google. During one of these sessions which I attended, a question during the Q & A part of that session was “how does Google determine whether or not to show OneBox results?” That may have been the only question unanswered.
Earlier this month, Google published a patent application that may provide a little insight into how and why different OneBox results are shown.
What Google has Told Us About OneBox Results
Before describing the patent application, I want to briefly explore some of what we’ve learned about these additional results directly from Google.
The Google Help Center Search Results Page, describes OneBox results:
Google’s search technology finds many sources of specialized information. Those that are most relevant to your search are included at the top of your search results. Typical onebox results include news, stock quotes, weather and local websites related to your search.
A tour of OneBox features for both Web Search and Enterprise search appears on the Google OneBox for Enterprise page (see the link labeled “Tour of OneBox features”).
Brian Smith recently interviewed Google Product Marketing Director Debbie Jaffe about these listings in A Closer Look at Google OneBox Results.
The OneBox Patent Application
Many patent filings include a “Description of Related Art” section where they often define a reason for the creation of their invention. This one tells us that:
Some search engine systems can provide various types of information as the search results. For example, a search engine system might be capable of providing search results relating to web pages, news articles, images, merchant products, usenet pages, yellow page entries, scanned books, and/or other types of information. Typically, a search engine system provides separate interfaces to these different types of information.
When a user provides a search query to a standard search engine system, the user is typically provided with links to web pages. If the user desires another type of information (e.g., images or news articles), the user typically needs to access a separate interface provided by the search engine system.
While Google shows tabs that searchers can select to view results for other kinds of information repositories, it’s not unusual for people to ignore those, or as Danny writes in his article on invisible tabs, to suffer from “tab blindness.” The OneBox is a solution to that problem. But how does Google know when to show which types of results?
Determination of a desired repository
Invented by Michael Angelo, David Braginsky, Jeremy Ginsberg, and Simon Tong
US Patent Application 20070005568
Published January 4, 2007
Filed: June 29, 2005
A system receives a search query from a user and searches a group of repositories, based on the search query, to identify, for each of the repositories, a set of search results. The system also identifies one of the repositories based on a likelihood that the user desires information from the identified repository and presents the set of search results associated with the identified repository.
A Mix of Possible OneBox Determination Methods
The patent lists at least seven different variations that it might follow to possibly determine whether OneBox results appear for a search, and which type of results appear within the OneBox, but they are mostly subtle variations of each other. All of them involve looking closely at the query used, a likelihood that the searcher is looking for information from a number of different data repositories, somehow scoring results from those repositories, and serving results from one or more of them.
One variation describes a process in which log data is collected about searchers and searches of repositories. The log data is represented as triples of data (u, q, r), with u being information about the searchers, q as information about the query, and r is information about repositories from which search results were provided. Labels for each of the triples of data (u, q, r) are created, where the label includes information about whether the user u desired information from the repository r when the user provided the search query q. Instructions are created to train a model based on the triples of data (u, q, r) and their associated labels, to predict whether a particular user desires information from certain repositories when providing a particular search query.
This log data, with triples of information, are referred to as “instances” and the system that uses then may include millions of instances.
Hundreds of thousands of distinct features may be included for any given (u, q, r), for example:
- The country in which user u is located,
- The language of the country in which user u is located,
- A cookie identifier associated with user u,
- The language of query q,
- Each term in query q,
- The time of day user u provided query q, the documents from repository r that were presented to user u,
- Each of the terms in the documents from repository r that were presented to user u, and/or; each of the terms in the titles of the documents from repository r that were presented to the user u.
- The fraction of queries that were provided to the interface for repository r,
- The fraction of queries that were provided to the interface for repository r versus the interfaces for other repositories,
- The fraction of queries that contain a term in query q that were provided to the interface for repository r versus the interfaces for other repositories,
- The overall click rate for queries provided to the interface for repository r,
- The click rate for queries provided to the interface for repository r for user u,
- The click rate for queries provided to the interface of repository r for users in the same country as user u,
- The click rate for query q provided to the interface of repository r.
- The click rate of query q provided to the interface of repository r for user u, and,
- The fraction of queries q that were provided to the interface of repository r for user u.
This data might be used to create a model may be created based on the data, which could possibly be used to predict, given a new (u, q, r), whether a searcher wants information from a specific repository if they provided a certain query. That model might be used to then make a decision as to whether or not to search a specific repository and present results from it on a search results page.
The patent filings lists a number of different types of repositories of documents, such as:
- A web page repository,
- A news repository,
- An image repository,
- A products repository,
- A usenet repository,
- A yellow pages repository
- A scanned books repository, and/or;
- Other types of repositories.
A High Level Overview
1. A query is received from a searcher.
2. Information about the searcher may be collected, such as an IP address, cookie information, language preferences, and/or geographical information.
3. A search might be performed on each of the repositories based on the query, and sets of search results could be obtained for each.
4. Decisions would then be made as to which results would be presented to that searcher. This would be based upon information about the searcher, the search query used, and input from each of the repositories. There are at least three alternative approaches to returning results from more than one repository:
a) The results from the two highest scoring repositories would be presented.
b) Results from one repository may always be presented, and one or more of the highest scoring of the others would be shown.
c) Only results with scores above a certain threshold would be shown, and if there are none above that threshold, then the highest scoring result would be returned.
The scores, and whether or not they are above a certain threshold may determine the order or manner in which they are presented to a searcher. So, results from one repository which is shown, but is not above a threshold score may appear at the bottom of results, or may display only a link to more results of that type instead of appearing as results on the initial results page.
The model may also contain an “exploration” policy that lets it gather information on different repositories. So, it might provide search results from a lower scoring repository (e.g., presenting news documents rather than images) to a small fraction of users at random, or show documents from a repository in proportion to the score (e.g., if the score for images is twice the score for news articles, then images may be presented twice as often as news articles).
If I read this patent filing correctly, user data about queries in the different vertical searches may influence which documents or objects appear in OneBox results. So, if a lot of people go to Google Image Search and look for pictures of “lions”, then OneBox results may show images of lions. If suddenly, a lot of people are looking for “lions” on Google news searches, then we might also see news results the OneBox area, instead of the images or in addition to them.
If that’s correct, then a OneBox approach to invisible tabs means that we will still see tabs for some types of searches because individual searches in the different repositories influence which results are returned in the OneBox.
As a patent application, the methods described may or may not reflect accurately how OneBox results are chosen, but the document provides some insights from Google on considerations that may be taken into account in the decisions to provide those results. It is interesting to see how large a role user behavior could have in those decisions.
Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.