Microsoft On Poprank And Indexing Objects For Vertical & Web Search


A new paper from Microsoft Research, Web Object Retrieval (pdf), discusses an approach towards Web indexing that changes focus from page level, to that of indexing objects found upon pages.

OK, so that does that mean? It’s easiest to show you first, rather than tell you…

Microsoft Product Search

Take a look at Microsoft’s Products Search (http://products.live.com/). Brian Smith went into a lot of detail on the Microsoft’s product search last May in eCommerce, Microsoft Style. Microsoft’s Live Product Search allows people to upload product information into their database, but it also crawls the Web, and extracts information about products.

Libra Academic Search

Another example of indexing on the object level from Microsoft Research Asia, Libra Academic Search, is a computer science bibliography search engine. The page “About the academic search” includes links to a number of papers upon object level retrieval, including an earlier technical report version of the Web Object Retrieval paper.

More than Products and Papers

The product search and the paper search are narrow vertical searches that focus upon crawling Web pages, and finding information that fits within those areas. The academic paper search not only tries to find the names of papers, but also authors, conferences, journals, and research communities. The Web Object Retrieval paper focuses upon extracting that information from pages. The goal of the research extends beyond products and papers. As the authors tell us:

We believe object-level Web search is particularly necessary in building vertical Web search engines such as product search, people search, scientific Web search, job search, community search, and so on.

Incorporation of Object Indexing into Live Search

The product search and the academic paper search are useful, but how well would they do as part of the Web search that Microsoft offers? According to a news article from Microsoft, Search Objective Gets a Refined Approach, those searches have already been integrated into Windows Live:

The “vertical” in Object-Level Vertical Search refers to a specific domain, such as academic search or product search, both of which have been incorporated into Windows Live™. The “object” is an item embedded in Web pages or Web databases, such as a product, a person, a paper, or an organization.

The Object-Level Vertical Search Process

The news article also describes the process of extracting and indexing objects in a nice summary:

The first three steps are:

  • Web Crawling: to collect relevant information on the Web efficiently
  • Classification: Does a page contain information on products, papers, people, or some other desired category?
  • Extraction: pulling specific information about the search query from the relevant Web pages. For a product, for instance, that could mean product name, brand, image, description, and price.

In other words, after finding the information, and understanding that it relates to a specific category, they are putting it into a structured format so that, for instance, products can be compared to one another. There’s more to the process, though:

  • Integration: Combining the gathered object information into a concise whole. This includes resolving Web-page idiosyncrasies and naming conventions and making sure that similarly named objects are integrated only if they relate to the actual object being sought.
  • Ranking: There are two types of ranking. One, static rank, is handled well by the PopRank algorithm. The second, relevance, is trickier, because an object might be popular, but irrelevant to the query at hand. Because the object description is integrated from multiple Web pages, developing a ranking mechanism is a challenge.

As they note in the article, this method could be used for job searches, for restaurant searches, and even for blog searches.

Ranking Objects by Link Analysis, or PopRank

The last item in the list above talks about ranking objects, and discusses two different parts to that ranking. One is a matter of relevance. The other is a query independent ranking, which they refer to as Poprank. They state that ranking objects may be especially difficult because the object descriptions may come from more than one Web page. So, what is this Poprank?

The answer to that question is likely in another Microsoft paper, Object-Level Ranking: Bringing Order to Web Objects (pdf):

Because it is clear that the more popular the objects are, the more likely the user will be interested in them. So a natural question is: could the popularity of Web objects be effectively computed by also applying link analysis techniques? This paper targets to answer this question. Our answer to the question is yes, but quite different technologies are required because of the unique characteristics of object graph.

To see Poprank in action, try out the Libra Academic Search linked to above.

Ranking for Relevance

Another Microsoft paper that provides an overview of this object extraction and indexing process, Object-level Vertical Search (pdf), introduces the concept of relevancy ranking in its last section, but doesn’t go into much detail on the topic.

Our newest paper (pdf), referred to at the top of this post, does explain how Microsoft might use different language models to estimate the relevance between an object and a query.

Opinions expressed in the article are those of the guest author and not necessarily Search Engine Land.



Bill Slawski

See more articles by Bill Slawski >


Share, Bookmark & Discuss This Article
More:


Keep Updated: News Via Email | News Via RSS Feed | News Via Twitter


See more stories like this in the Members Library! Check out the Microsoft: Live Search, Microsoft: Live Search Academic, Microsoft: Live Search SEO sections of the Members Library where this story is filed. Members also get access to exclusive video content, a members-only weekly & monthly newsletter, plus more. Check out all the benefits!

5 COMMENTS ON Microsoft On Poprank And Indexing Objects For Vertical & Web Search

Michael Martinez,

Bill, your overview leaves it unclear to me whether you understand what they mean by a “Web object”. At the very least, you are not providing a definition for a “Web object”.

In the context of these papers, Web objects are concepts or topics and they are independent of Web pages. The proposed methodologies are looking at organizing data by topic.

i.e., given a topic (a Web Object), the search engine needs to extract as much information as possible about that topic and organize it into a coherent presentation unit.

This is very high-level stuff that essentially proposes aggregating all known sources of information about any given topic under a unified structure (an “object model”).

You won’t be able to influence “rankings” through links. Attributions would be a better marker of value, but the model breaks down once you get outside the academic paper archive they tested this methodology against. Business, news, and hobbyist content is rarely organized for peer review (although, oddly enough, fan fiction sites do often incorporate peer review structures and methods into their presentations).

I don’t see much applicability in this brand of information retrieval science to the World Wide Web. It’s very specialized but they may be able to build on the principles and develop methods for extracting peer review information from non-academic sources.



Michael Martinez,

Wish we could edit our comments:

“the model breaks down once you get outside the academic paper archive…” should have used the plural form “archives”, not “archive”.



Bill Slawski,

Hi Michael,

Sorry I didn’t include a definition. I thought it was obvious from the papers, and from the description of extracting information from pages and integrating it together as an object, as to what an object is. Here’s a definition from the first linked paper:

We define the concept of Web Objects as the principle data units about which Web information is to be collected, indexed, and ranked. Web objects are usually recognizable concepts, such as authors, papers, conferences, or journals that have relevance to the application domain. A Web object is generally represented by a set of attributes { , ,…, } 1 2 m A = a a a . The attribute set for a specific object type is predefined based on the requirements in the domain.

They’ve been testing this model on products in addition to the academic papers.



Michael Martinez,

I think it’s an interesting paper, but I know many people look to you (not to put the pressure on you) to condense all the academic-speak down to something everyone else can shake and nod their heads over. That’s why I was being nit-picky.



Bill Slawski,

Sounds fair, Michael. I did try to ease people into the concept by introducing the searches that it is being used in, and then by describing the process of finding information, extracting it, and integrating it together into a Web Object.

You’re probably quite right that I should have rephrased what a Web Object is within a sentence or two, perhaps at the end of the post.




RECENT COMMNENTS

  • Buy Advertising said " I've been experimenting with the merger of advertising and entertainment. I think that it can be bot"
  • nickstamoulis said " Wow, this is very interesting, I was not aware of the the Google Books case at all, I will be sure t"
  • nickstamoulis said " These are all very cool, my personal favorite 4th logo is the Ask.com layout, it is very creative!"

See All »


FREE DAILY SEARCH NEWS RECAP!

Stay on top of all the search news with our daily summary, the SearchCap newsletter. View a sample ›

STAY CURRENT THROUGHOUT THE DAY

RSS Feeds

The Search Engine Land feed keeps you informed as news happens. SEE ALL FEEDS »

Upcoming Search Engine Land Conferences

Advertise With Us »

Search Engine Land produces SMX, the Search Marketing Expo conference series. SMX events deliver the most comprehensive educational and networking experiences - whether you're just starting in search marketing or you're a seasoned expert.


SMX Web Site » | SMX Difference » | SMX News »


Join us at an upcoming SMX event:

Search Marketing Now Learn more about search marketing with our free online webcasts and webinars from our sister site, Search Marketing Now. Upcoming online events include:


See more webcast topics »

TRACK US SOCIALLY
Upcoming Search Engine Land Conferences

Get Your Search Engine Land
Premium Membership!

Become a premium member today and receive:

  • Express commenting privileges & photo.
  • Exclusive videos & newsletters.
  • Discounts to our SMX conferences.
  • Access to "How To" & Other Archives.

Learn More

Upcoming Search Engine Land Conferences
Add to GoogleAdd to My Yahoo!Add to BloglinesAdd to NetvibesAdd to Windows Live