Mar 29, 2007 at 2:06am ET by Bill Slawski
A new paper from Microsoft Research, Web Object Retrieval (pdf), discusses an approach towards Web indexing that changes focus from page level, to that of indexing objects found upon pages.
OK, so that does that mean? It’s easiest to show you first, rather than tell you…
Microsoft Product Search
Take a look at Microsoft’s Products Search (http://products.live.com/). Brian Smith went into a lot of detail on the Microsoft’s product search last May in eCommerce, Microsoft Style. Microsoft’s Live Product Search allows people to upload product information into their database, but it also crawls the Web, and extracts information about products.
Libra Academic Search
Another example of indexing on the object level from Microsoft Research Asia, Libra Academic Search, is a computer science bibliography search engine. The page “About the academic search” includes links to a number of papers upon object level retrieval, including an earlier technical report version of the Web Object Retrieval paper.
More than Products and Papers
The product search and the paper search are narrow vertical searches that focus upon crawling Web pages, and finding information that fits within those areas. The academic paper search not only tries to find the names of papers, but also authors, conferences, journals, and research communities. The Web Object Retrieval paper focuses upon extracting that information from pages. The goal of the research extends beyond products and papers. As the authors tell us:
We believe object-level Web search is particularly necessary in building vertical Web search engines such as product search, people search, scientific Web search, job search, community search, and so on.
Incorporation of Object Indexing into Live Search
The product search and the academic paper search are useful, but how well would they do as part of the Web search that Microsoft offers? According to a news article from Microsoft, Search Objective Gets a Refined Approach, those searches have already been integrated into Windows Live:
The “vertical” in Object-Level Vertical Search refers to a specific domain, such as academic search or product search, both of which have been incorporated into Windows Live™. The “object” is an item embedded in Web pages or Web databases, such as a product, a person, a paper, or an organization.
The Object-Level Vertical Search Process
The news article also describes the process of extracting and indexing objects in a nice summary:
The first three steps are:
In other words, after finding the information, and understanding that it relates to a specific category, they are putting it into a structured format so that, for instance, products can be compared to one another. There’s more to the process, though:
As they note in the article, this method could be used for job searches, for restaurant searches, and even for blog searches.
Ranking Objects by Link Analysis, or PopRank
The last item in the list above talks about ranking objects, and discusses two different parts to that ranking. One is a matter of relevance. The other is a query independent ranking, which they refer to as Poprank. They state that ranking objects may be especially difficult because the object descriptions may come from more than one Web page. So, what is this Poprank?
The answer to that question is likely in another Microsoft paper, Object-Level Ranking: Bringing Order to Web Objects (pdf):
Because it is clear that the more popular the objects are, the more likely the user will be interested in them. So a natural question is: could the popularity of Web objects be effectively computed by also applying link analysis techniques? This paper targets to answer this question. Our answer to the question is yes, but quite different technologies are required because of the unique characteristics of object graph.
To see Poprank in action, try out the Libra Academic Search linked to above.
Ranking for Relevance
Another Microsoft paper that provides an overview of this object extraction and indexing process, Object-level Vertical Search (pdf), introduces the concept of relevancy ranking in its last section, but doesn’t go into much detail on the topic.
Our newest paper (pdf), referred to at the top of this post, does explain how Microsoft might use different language models to estimate the relevance between an object and a query.
Opinions expressed in the article are those of the guest author and not necessarily Search Engine Land.
Share, Bookmark & Discuss This Article
More:
Keep Updated: News Via Email | News Via RSS Feed | News Via Twitter
See more stories like this in the Members Library! Check out the Microsoft: Live Search, Microsoft: Live Search Academic, Microsoft: Live Search SEO sections of the Members Library where this story is filed. Members also get access to exclusive video content, a members-only weekly & monthly newsletter, plus more. Check out all the benefits!
TOP STORIES
SEARCH NEWS BRIEFS
FEATURES & ANALYSIS
RECENT COMMNENTS
Stay on top of all the search news with our daily summary, the SearchCap newsletter. View a sample ›
Search Engine Land produces SMX, the Search Marketing Expo conference series. SMX events deliver the most comprehensive educational and networking experiences - whether you're just starting in search marketing or you're a seasoned expert.
SMX Web Site » | SMX Difference » | SMX News »
Join us at an upcoming SMX event:
Learn more about search marketing with our free online webcasts and webinars from our sister site, Search Marketing Now. Upcoming online events include:
Featured sites from our Blogroll
Become a premium member today and receive:
Bill, your overview leaves it unclear to me whether you understand what they mean by a “Web object”. At the very least, you are not providing a definition for a “Web object”.
In the context of these papers, Web objects are concepts or topics and they are independent of Web pages. The proposed methodologies are looking at organizing data by topic.
i.e., given a topic (a Web Object), the search engine needs to extract as much information as possible about that topic and organize it into a coherent presentation unit.
This is very high-level stuff that essentially proposes aggregating all known sources of information about any given topic under a unified structure (an “object model”).
You won’t be able to influence “rankings” through links. Attributions would be a better marker of value, but the model breaks down once you get outside the academic paper archive they tested this methodology against. Business, news, and hobbyist content is rarely organized for peer review (although, oddly enough, fan fiction sites do often incorporate peer review structures and methods into their presentations).
I don’t see much applicability in this brand of information retrieval science to the World Wide Web. It’s very specialized but they may be able to build on the principles and develop methods for extracting peer review information from non-academic sources.
Wish we could edit our comments:
“the model breaks down once you get outside the academic paper archive…” should have used the plural form “archives”, not “archive”.
Hi Michael,
Sorry I didn’t include a definition. I thought it was obvious from the papers, and from the description of extracting information from pages and integrating it together as an object, as to what an object is. Here’s a definition from the first linked paper:
They’ve been testing this model on products in addition to the academic papers.
I think it’s an interesting paper, but I know many people look to you (not to put the pressure on you) to condense all the academic-speak down to something everyone else can shake and nod their heads over. That’s why I was being nit-picky.
Sounds fair, Michael. I did try to ease people into the concept by introducing the searches that it is being used in, and then by describing the process of finding information, extracting it, and integrating it together into a Web Object.
You’re probably quite right that I should have rephrased what a Web Object is within a sentence or two, perhaps at the end of the post.