A new paper from Microsoft Research, Web Object Retrieval (pdf), discusses an approach towards Web indexing that changes focus from page level, to that of indexing objects found upon pages.
OK, so that does that mean? It’s easiest to show you first, rather than tell you…
Microsoft Product Search
Take a look at Microsoft’s Products Search (http://products.live.com/). Brian Smith went into a lot of detail on the Microsoft’s product search last May in eCommerce, Microsoft Style. Microsoft’s Live Product Search allows people to upload product information into their database, but it also crawls the Web, and extracts information about products.
Libra Academic Search
Another example of indexing on the object level from Microsoft Research Asia, Libra Academic Search, is a computer science bibliography search engine. The page “About the academic search” includes links to a number of papers upon object level retrieval, including an earlier technical report version of the Web Object Retrieval paper.
More than Products and Papers
The product search and the paper search are narrow vertical searches that focus upon crawling Web pages, and finding information that fits within those areas. The academic paper search not only tries to find the names of papers, but also authors, conferences, journals, and research communities. The Web Object Retrieval paper focuses upon extracting that information from pages. The goal of the research extends beyond products and papers. As the authors tell us:
We believe object-level Web search is particularly necessary in building vertical Web search engines such as product search, people search, scientific Web search, job search, community search, and so on.
Incorporation of Object Indexing into Live Search
The product search and the academic paper search are useful, but how well would they do as part of the Web search that Microsoft offers? According to a news article from Microsoft, Search Objective Gets a Refined Approach, those searches have already been integrated into Windows Live:
The “vertical” in Object-Level Vertical Search refers to a specific domain, such as academic search or product search, both of which have been incorporated into Windows Live™. The “object” is an item embedded in Web pages or Web databases, such as a product, a person, a paper, or an organization.
The Object-Level Vertical Search Process
The news article also describes the process of extracting and indexing objects in a nice summary:
The first three steps are:
- Web Crawling: to collect relevant information on the Web efficiently
- Classification: Does a page contain information on products, papers, people, or some other desired category?
- Extraction: pulling specific information about the search query from the relevant Web pages. For a product, for instance, that could mean product name, brand, image, description, and price.
In other words, after finding the information, and understanding that it relates to a specific category, they are putting it into a structured format so that, for instance, products can be compared to one another. There’s more to the process, though:
- Integration: Combining the gathered object information into a concise whole. This includes resolving Web-page idiosyncrasies and naming conventions and making sure that similarly named objects are integrated only if they relate to the actual object being sought.
- Ranking: There are two types of ranking. One, static rank, is handled well by the PopRank algorithm. The second, relevance, is trickier, because an object might be popular, but irrelevant to the query at hand. Because the object description is integrated from multiple Web pages, developing a ranking mechanism is a challenge.
As they note in the article, this method could be used for job searches, for restaurant searches, and even for blog searches.
Ranking Objects by Link Analysis, or PopRank
The last item in the list above talks about ranking objects, and discusses two different parts to that ranking. One is a matter of relevance. The other is a query independent ranking, which they refer to as Poprank. They state that ranking objects may be especially difficult because the object descriptions may come from more than one Web page. So, what is this Poprank?
The answer to that question is likely in another Microsoft paper, Object-Level Ranking: Bringing Order to Web Objects (pdf):
Because it is clear that the more popular the objects are, the more likely the user will be interested in them. So a natural question is: could the popularity of Web objects be effectively computed by also applying link analysis techniques? This paper targets to answer this question. Our answer to the question is yes, but quite different technologies are required because of the unique characteristics of object graph.
To see Poprank in action, try out the Libra Academic Search linked to above.
Ranking for Relevance
Another Microsoft paper that provides an overview of this object extraction and indexing process, Object-level Vertical Search (pdf), introduces the concept of relevancy ranking in its last section, but doesn’t go into much detail on the topic.
Our newest paper (pdf), referred to at the top of this post, does explain how Microsoft might use different language models to estimate the relevance between an object and a query.
Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.