Subscribe Via Web Feed Subscribe with Google Add to My Yahoo! Subscribe with Bloglines Add to netvibes Subscribe with Live.com

« Register For Search Marketing Expo Seattle & Save $200 | Main | Live Search To Offer Webmaster Tools Section Soon? »

Mar. 29, 2007 at 2:06am Eastern by Bill Slawski

Microsoft On Poprank And Indexing Objects For Vertical & Web Search

A new paper from Microsoft Research, Web Object Retrieval (pdf), discusses an approach towards Web indexing that changes focus from page level, to that of indexing objects found upon pages.

OK, so that does that mean? It's easiest to show you first, rather than tell you...

Microsoft Product Search

Take a look at Microsoft's Products Search (http://products.live.com/). Brian Smith went into a lot of detail on the Microsoft's product search last May in eCommerce, Microsoft Style. Microsoft's Live Product Search allows people to upload product information into their database, but it also crawls the Web, and extracts information about products.

Libra Academic Search

Another example of indexing on the object level from Microsoft Research Asia, Libra Academic Search, is a computer science bibliography search engine. The page "About the academic search" includes links to a number of papers upon object level retrieval, including an earlier technical report version of the Web Object Retrieval paper.

More than Products and Papers

The product search and the paper search are narrow vertical searches that focus upon crawling Web pages, and finding information that fits within those areas. The academic paper search not only tries to find the names of papers, but also authors, conferences, journals, and research communities. The Web Object Retrieval paper focuses upon extracting that information from pages. The goal of the research extends beyond products and papers. As the authors tell us:

We believe object-level Web search is particularly necessary in building vertical Web search engines such as product search, people search, scientific Web search, job search, community search, and so on.


Incorporation of Object Indexing into Live Search

The product search and the academic paper search are useful, but how well would they do as part of the Web search that Microsoft offers? According to a news article from Microsoft, Search Objective Gets a Refined Approach, those searches have already been integrated into Windows Live:

The “vertical” in Object-Level Vertical Search refers to a specific domain, such as academic search or product search, both of which have been incorporated into Windows Live™. The “object” is an item embedded in Web pages or Web databases, such as a product, a person, a paper, or an organization.


The Object-Level Vertical Search Process

The news article also describes the process of extracting and indexing objects in a nice summary:

The first three steps are:

  • Web Crawling: to collect relevant information on the Web efficiently
  • Classification: Does a page contain information on products, papers, people, or some other desired category?
  • Extraction: pulling specific information about the search query from the relevant Web pages. For a product, for instance, that could mean product name, brand, image, description, and price.

In other words, after finding the information, and understanding that it relates to a specific category, they are putting it into a structured format so that, for instance, products can be compared to one another. There's more to the process, though:

  • Integration: Combining the gathered object information into a concise whole. This includes resolving Web-page idiosyncrasies and naming conventions and making sure that similarly named objects are integrated only if they relate to the actual object being sought.
  • Ranking: There are two types of ranking. One, static rank, is handled well by the PopRank algorithm. The second, relevance, is trickier, because an object might be popular, but irrelevant to the query at hand. Because the object description is integrated from multiple Web pages, developing a ranking mechanism is a challenge.

As they note in the article, this method could be used for job searches, for restaurant searches, and even for blog searches.

Ranking Objects by Link Analysis, or PopRank

The last item in the list above talks about ranking objects, and discusses two different parts to that ranking. One is a matter of relevance. The other is a query independent ranking, which they refer to as Poprank. They state that ranking objects may be especially difficult because the object descriptions may come from more than one Web page. So, what is this Poprank?

The answer to that question is likely in another Microsoft paper, Object-Level Ranking: Bringing Order to Web Objects (pdf):

Because it is clear that the more popular the objects are, the more likely the user will be interested in them. So a natural question is: could the popularity of Web objects be effectively computed by also applying link analysis techniques? This paper targets to answer this question. Our answer to the question is yes, but quite different technologies are required because of the unique characteristics of object graph.

To see Poprank in action, try out the Libra Academic Search linked to above.

Ranking for Relevance

Another Microsoft paper that provides an overview of this object extraction and indexing process, Object-level Vertical Search (pdf), introduces the concept of relevancy ranking in its last section, but doesn't go into much detail on the topic.

Our newest paper (pdf), referred to at the top of this post, does explain how Microsoft might use different language models to estimate the relevance between an object and a query.

Like The Story? Vote For It On Yahoo Buzz!
Subscribe To Our Daily Search News Recap!
Your Email:
Send me the monthly search newsletter too! (Learn more about our newsletters and feeds)
Subscribe To Our Search Feed!
Subscribe Via Web FeedSubscribe with GoogleAdd to My Yahoo!Subscribe with BloglinesAdd to netvibes
Subscribe with Live.comSubscribe in NewsGator OnlineSubscribe in RojoAdd to My AOL
Share & Bookmark This Story!
By Bill Slawski Permalink Jump To Comments See Related Stories In: Microsoft: Live Search, Microsoft: Live Search Academic, Microsoft: Live Search SEO



Reader Comments

Bill, your overview leaves it unclear to me whether you understand what they mean by a "Web object". At the very least, you are not providing a definition for a "Web object".

In the context of these papers, Web objects are concepts or topics and they are independent of Web pages. The proposed methodologies are looking at organizing data by topic.

i.e., given a topic (a Web Object), the search engine needs to extract as much information as possible about that topic and organize it into a coherent presentation unit.

This is very high-level stuff that essentially proposes aggregating all known sources of information about any given topic under a unified structure (an "object model").

You won't be able to influence "rankings" through links. Attributions would be a better marker of value, but the model breaks down once you get outside the academic paper archive they tested this methodology against. Business, news, and hobbyist content is rarely organized for peer review (although, oddly enough, fan fiction sites do often incorporate peer review structures and methods into their presentations).

I don't see much applicability in this brand of information retrieval science to the World Wide Web. It's very specialized but they may be able to build on the principles and develop methods for extracting peer review information from non-academic sources.

Wish we could edit our comments:

"the model breaks down once you get outside the academic paper archive..." should have used the plural form "archives", not "archive".

Hi Michael,

Sorry I didn't include a definition. I thought it was obvious from the papers, and from the description of extracting information from pages and integrating it together as an object, as to what an object is. Here's a definition from the first linked paper:

We define the concept of Web Objects as the principle data units about which Web information is to be collected, indexed, and ranked. Web objects are usually recognizable concepts, such as authors, papers, conferences, or journals that have relevance to the application domain. A Web object is generally represented by a set of attributes { , ,..., } 1 2 m A = a a a . The attribute set for a specific object type is predefined based on the requirements in the domain.

They've been testing this model on products in addition to the academic papers.

I think it's an interesting paper, but I know many people look to you (not to put the pressure on you) to condense all the academic-speak down to something everyone else can shake and nod their heads over. That's why I was being nit-picky.

Sounds fair, Michael. I did try to ease people into the concept by introducing the searches that it is being used in, and then by describing the process of finding information, extracting it, and integrating it together into a Web Object.

You're probably quite right that I should have rephrased what a Web Object is within a sentence or two, perhaps at the end of the post.

Search:

Search Marketing Expo

Save the date for:
SMX Local & Mobile - San Francisco, CA (July 24-25) See the agenda, and register now!
SMX Sao Paolo - Brazil - (Aug. 7-8)
SMX China - September 23 & 24
SMX Stockholm - September 23 & 24
SMX East - NYC - (Oct. 6-8) Registration is now open.
SMX London - November 4 & 5

Search Marketing Now

Learn more about search marketing through free online webcasts and webinars from our sister site Search Marketing Now.

Upcoming Webcasts:

Most Recent News Posts

About Search Engine Land

Stay Updated!

Get Our Search Newsletters:
Email:
Daily Monthly

Get Our Search Feed:
Subscribe Via Web FeedSubscribe with Google
Add to My Yahoo!Subscribe with Bloglines
Add to netvibesSubscribe with Live.com
Subscribe in NewsGator OnlineSubscribe in Rojo
Add to My AOL
More About Our Feeds & Newsletters

Add to Technorati Favorites

Track Us Socially:
Facebook: Our Search News App
Facebook: Search Engine Land Page
Facebook: Search Engine Land Group
Flickr: Search Engine Land
LinkedIn: Search Engine Land Group
Twitter: Search Engine Land Feed

Bragroll