Microsoft On Poprank And Indexing Objects For Vertical & Web Search

A new paper from Microsoft Research, Web Object Retrieval (pdf), discusses an approach towards Web indexing that changes focus from page level, to that of indexing objects found upon pages.

OK, so that does that mean? It’s easiest to show you first, rather than tell you…

Microsoft Product Search

Take a look at Microsoft’s Products Search (http://products.live.com/). Brian Smith went into a lot of detail on the Microsoft’s product search last May in eCommerce, Microsoft Style. Microsoft’s Live Product Search allows people to upload product information into their database, but it also crawls the Web, and extracts information about products.

Libra Academic Search

Another example of indexing on the object level from Microsoft Research Asia, Libra Academic Search, is a computer science bibliography search engine. The page “About the academic search” includes links to a number of papers upon object level retrieval, including an earlier technical report version of the Web Object Retrieval paper.

More than Products and Papers

The product search and the paper search are narrow vertical searches that focus upon crawling Web pages, and finding information that fits within those areas. The academic paper search not only tries to find the names of papers, but also authors, conferences, journals, and research communities. The Web Object Retrieval paper focuses upon extracting that information from pages. The goal of the research extends beyond products and papers. As the authors tell us:

We believe object-level Web search is particularly necessary in building vertical Web search engines such as product search, people search, scientific Web search, job search, community search, and so on.

Incorporation of Object Indexing into Live Search

The product search and the academic paper search are useful, but how well would they do as part of the Web search that Microsoft offers? According to a news article from Microsoft, Search Objective Gets a Refined Approach, those searches have already been integrated into Windows Live:

The “vertical” in Object-Level Vertical Search refers to a specific domain, such as academic search or product search, both of which have been incorporated into Windows Live™. The “object” is an item embedded in Web pages or Web databases, such as a product, a person, a paper, or an organization.

The Object-Level Vertical Search Process

The news article also describes the process of extracting and indexing objects in a nice summary:

The first three steps are:

  • Web Crawling: to collect relevant information on the Web efficiently
  • Classification: Does a page contain information on products, papers, people, or some other desired category?
  • Extraction: pulling specific information about the search query from the relevant Web pages. For a product, for instance, that could mean product name, brand, image, description, and price.

In other words, after finding the information, and understanding that it relates to a specific category, they are putting it into a structured format so that, for instance, products can be compared to one another. There’s more to the process, though:

  • Integration: Combining the gathered object information into a concise whole. This includes resolving Web-page idiosyncrasies and naming conventions and making sure that similarly named objects are integrated only if they relate to the actual object being sought.
  • Ranking: There are two types of ranking. One, static rank, is handled well by the PopRank algorithm. The second, relevance, is trickier, because an object might be popular, but irrelevant to the query at hand. Because the object description is integrated from multiple Web pages, developing a ranking mechanism is a challenge.

As they note in the article, this method could be used for job searches, for restaurant searches, and even for blog searches.

Ranking Objects by Link Analysis, or PopRank

The last item in the list above talks about ranking objects, and discusses two different parts to that ranking. One is a matter of relevance. The other is a query independent ranking, which they refer to as Poprank. They state that ranking objects may be especially difficult because the object descriptions may come from more than one Web page. So, what is this Poprank?

The answer to that question is likely in another Microsoft paper, Object-Level Ranking: Bringing Order to Web Objects (pdf):

Because it is clear that the more popular the objects are, the more likely the user will be interested in them. So a natural question is: could the popularity of Web objects be effectively computed by also applying link analysis techniques? This paper targets to answer this question. Our answer to the question is yes, but quite different technologies are required because of the unique characteristics of object graph.

To see Poprank in action, try out the Libra Academic Search linked to above.

Ranking for Relevance

Another Microsoft paper that provides an overview of this object extraction and indexing process, Object-level Vertical Search (pdf), introduces the concept of relevancy ranking in its last section, but doesn’t go into much detail on the topic.

Our newest paper (pdf), referred to at the top of this post, does explain how Microsoft might use different language models to estimate the relevance between an object and a query.

Opinions expressed in the article are those of the guest author and not necessarily Search Engine Land.

Related Topics: Channel: SEO | Microsoft: Bing | Microsoft: Bing SEO | Microsoft: Live Search Academic

Sponsored


About The Author: is the Director of Search Marketing for Go Fish Digital and the editor of SEO by the Sea. He has been doing SEO and web promotion since the mid-90s, and was a legal and technical administrator in the highest level trial court in Delaware.

Connect with the author via: Email | Twitter | Google+ | LinkedIn



SearchCap:

Get all the top search stories emailed daily!  

Share

Other ways to share:
 

Read before commenting! We welcome constructive comments and allow any that meet our common sense criteria. This means being respectful and polite to others. It means providing helpful information that contributes to a story or discussion. It means leaving links only that substantially add further to a discussion. Comments using foul language, being disrespectful to others or otherwise violating what we believe are common sense standards of discussion will be deleted. Comments may also be removed if they are posted from anonymous accounts. You can read more about our comments policy here.
  • http://www.seo-theory.com/ Michael Martinez

    Bill, your overview leaves it unclear to me whether you understand what they mean by a “Web object”. At the very least, you are not providing a definition for a “Web object”.

    In the context of these papers, Web objects are concepts or topics and they are independent of Web pages. The proposed methodologies are looking at organizing data by topic.

    i.e., given a topic (a Web Object), the search engine needs to extract as much information as possible about that topic and organize it into a coherent presentation unit.

    This is very high-level stuff that essentially proposes aggregating all known sources of information about any given topic under a unified structure (an “object model”).

    You won’t be able to influence “rankings” through links. Attributions would be a better marker of value, but the model breaks down once you get outside the academic paper archive they tested this methodology against. Business, news, and hobbyist content is rarely organized for peer review (although, oddly enough, fan fiction sites do often incorporate peer review structures and methods into their presentations).

    I don’t see much applicability in this brand of information retrieval science to the World Wide Web. It’s very specialized but they may be able to build on the principles and develop methods for extracting peer review information from non-academic sources.

  • http://www.seo-theory.com/ Michael Martinez

    Wish we could edit our comments:

    “the model breaks down once you get outside the academic paper archive…” should have used the plural form “archives”, not “archive”.

  • http://www.seobythesea.com Bill Slawski

    Hi Michael,

    Sorry I didn’t include a definition. I thought it was obvious from the papers, and from the description of extracting information from pages and integrating it together as an object, as to what an object is. Here’s a definition from the first linked paper:

    We define the concept of Web Objects as the principle data units about which Web information is to be collected, indexed, and ranked. Web objects are usually recognizable concepts, such as authors, papers, conferences, or journals that have relevance to the application domain. A Web object is generally represented by a set of attributes { , ,…, } 1 2 m A = a a a . The attribute set for a specific object type is predefined based on the requirements in the domain.

    They’ve been testing this model on products in addition to the academic papers.

  • http://www.seo-theory.com/ Michael Martinez

    I think it’s an interesting paper, but I know many people look to you (not to put the pressure on you) to condense all the academic-speak down to something everyone else can shake and nod their heads over. That’s why I was being nit-picky.

  • http://www.seobythesea.com Bill Slawski

    Sounds fair, Michael. I did try to ease people into the concept by introducing the searches that it is being used in, and then by describing the process of finding information, extracting it, and integrating it together into a Web Object.

    You’re probably quite right that I should have rephrased what a Web Object is within a sentence or two, perhaps at the end of the post.

Get Our News, Everywhere!

Daily Email:

Follow Search Engine Land on Twitter @sengineland Like Search Engine Land on Facebook Follow Search Engine Land on Google+ Get the Search Engine Land Feed Connect with Search Engine Land on LinkedIn Check out our Tumblr! See us on Pinterest

 
 

Click to watch SMX conference video

Join us at one of our SMX or MarTech events:

United States

Europe

Australia & China

Learn more about: SMX | MarTech


Free Daily Search News Recap!

SearchCap is a once-per-day newsletter update - sign up below and get the news delivered to you!

 


 

Search Engine Land Periodic Table of SEO Success Factors

Get Your Copy
Read The Full SEO Guide