PhraseRank, Not PageRank, To Fight Search Spam

Can indexing phrases from pages be an effective approach in identifying and filtering keyword stuffed pages, and honeypot pages aimed at attracting visitors solely to have them click upon ads?

A new patent application published yesterday and assigned to Google, Detecting spam documents in a phrase based information retrieval system, presents a reasonable argument in favor of the method.

Ok, so “Phraserank” doesn’t appear in the document. But it’s a term that might be worth thinking about. It may do much more than just help fight spam.

Danny noticed that I had a long writeup this morning on the Anna Patterson penned filing, and I think that this passage from the document jumped out at both of us:

From the foregoing, the number of the related phrases present in a given document will be known. A normal, non-spam document will generally have a relatively limited number of related phrases, typically on the order of between 8 and 20, depending on the document collection. By contrast, a spam document will have an excessive number of related phrases, for example on the order of between 100 and 1000 related phrases. Thus, the present invention takes advantage of this discovery by identifying as spam documents those documents that have a statistically significant deviation in the number of related phrases relative to an expected number of related phrases for documents in the document collection.

This is the sixth published patent application from Anna Patterson on some aspect of phrase-based indexing. Three of them are listed in the USPTO assignment database as being assigned to Google. Here are the others:

*assigned to Google

The inventor, Anna Patterson, wrote a search engine for the Internet Archive a couple of years back, as a demo, which disappeared sometime around when she joined Google. Her four paged article, Why Writing Your Own Search Engine is Hard, is an excellent introduction to phrase based indexing. My favorite quote:

There is a major field of study about the different things to index on. Don’t get a Ph.D.; just index on words. Words are what people search for; they don’t search for N-Grams or letters or PTrees or locations in streams, so any other method other than the simplest will make you seem clever. But, hey, writing your own search engine is hard enough. Save what cleverness you own for ranking.

