Subscribe Via Web Feed Subscribe with Google Add to My Yahoo! Subscribe with Bloglines Add to netvibes Subscribe with Live.com

« SearchCap: The Day In Search, Dec. 28, 2006 | Main | Google's Not So Top Terms & Top US Gainers For 2006 »

Dec. 29, 2006 at 7:22am Eastern by Bill Slawski

PhraseRank, Not PageRank, To Fight Search Spam

Can indexing phrases from pages be an effective approach in identifying and filtering keyword stuffed pages, and honeypot pages aimed at attracting visitors solely to have them click upon ads?

A new patent application published yesterday and assigned to Google, Detecting spam documents in a phrase based information retrieval system, presents a reasonable argument in favor of the method.

Ok, so "Phraserank" doesn't appear in the document. But it's a term that might be worth thinking about. It may do much more than just help fight spam.

Danny noticed that I had a long writeup this morning on the Anna Patterson penned filing, and I think that this passage from the document jumped out at both of us:

From the foregoing, the number of the related phrases present in a given document will be known. A normal, non-spam document will generally have a relatively limited number of related phrases, typically on the order of between 8 and 20, depending on the document collection. By contrast, a spam document will have an excessive number of related phrases, for example on the order of between 100 and 1000 related phrases. Thus, the present invention takes advantage of this discovery by identifying as spam documents those documents that have a statistically significant deviation in the number of related phrases relative to an expected number of related phrases for documents in the document collection.

This is the sixth published patent application from Anna Patterson on some aspect of phrase-based indexing. Three of them are listed in the USPTO assignment database as being assigned to Google. Here are the others:

*assigned to Google

The inventor, Anna Patterson, wrote a search engine for the Internet Archive a couple of years back, as a demo, which disappeared sometime around when she joined Google. Her four paged article, Why Writing Your Own Search Engine is Hard, is an excellent introduction to phrase based indexing. My favorite quote:

There is a major field of study about the different things to index on. Don't get a Ph.D.; just index on words. Words are what people search for; they don't search for N-Grams or letters or PTrees or locations in streams, so any other method other than the simplest will make you seem clever. But, hey, writing your own search engine is hard enough. Save what cleverness you own for ranking.
Like The Story? Vote For It On Yahoo Buzz!
Subscribe To Our Daily Search News Recap!
Your Email:
Send me the monthly search newsletter too! (Learn more about our newsletters and feeds)
Subscribe To Our Search Feed!
Subscribe Via Web FeedSubscribe with GoogleAdd to My Yahoo!Subscribe with BloglinesAdd to netvibes
Subscribe with Live.comSubscribe in NewsGator OnlineSubscribe in RojoAdd to My AOL
Share & Bookmark This Story!
By Bill Slawski Permalink Jump To Comments See Related Stories In: Google: Patents, SEO: Spamming



Reader Comments

_____________________________________________

This patent is basically an advanced branch of CONCEPT SEARCHING (remember EXCITE) - and fundamentally, is already being used to some extent is some legal search applications.

However, it is possible, that classic sites like this:
cuiwww.unige.ch/meta-index.html
(which, for years, until recently, has been in the top 20 on Google for the term SEARCH ENGINES) would be seen as spammy.

There will also be false positives with this method, only because of aggressive SEOing or extremely information-packed homepages. Or the aggressive use of synonyms or acronyms to cover all basis. Also, those who write in a taxonomy style (just using keywords without stop words) will suffer!!!!!!!

However, if this method is used in balance with link popularity and popularity links, it would be worth evaluating the SERPs.

But it must be remembered that, the new priority of search engines ALGOs - analyzing the anchor text/ back links from high trust ranked sites, makes it now nearly impossible for those honeypot sites to get high on the SERPs.

Most searchers do not use complex search terms - so many no longer get spam sites to the degree they would have gotten a few years ago,
And for those who do use complex terms, usually reference sites come up first. Poor sites usually remain near the bottom 900 - 1000 end of the serps

if Google does buy into this, the so-called bad phrases sites might go into the supplimental listings.

"PhraseRank" is a bad label to pin on this donkey, in my opinion. "PhraseMetric Filtration" might work better, but I doubt it has the buzz or zing value that SEOs will want to use. Given that so many SEOs now wrongly apply "TrustRank" (a Yahoo!-coined phrase) to Google's trust filtering, and "Latent Semantic Indexing" to Google's non-semantic indexing, it's almost certain they'l adopt the erroneous PhraseRank (after all, this is not rocket science -- it's just SEO).

Still, I propose that we open the floor for nominations for more accurate labels to describe what these patents portend.

Here are a few suggestions to kick off the list:

"Latent Phrasic Symbology"

"InterPhrase Pseudo-Semantic Analysis"

"PostPhrasal Spam Syndrome"

"PhraseGraphic Filtering"

"PhraseToponymy"

"Phrase-based Filtering"

"Phrase Index Scoring"

"Phrase: Got Spam?"

Hey, search engines web, that unige.ch page would be considered a sitemap with 242 links.

On a more serious note, let me resurrect the ancient concept of "power keyword optimization" (which I neither coined nor had any part in defining or popularizing). The old PKO concept could be summed thusly (with respect to the KEYWORDS meta tag):

Given that you may want to optimize for "michael martinez", "michael martinez blows", and "blows me down", you could define a meta tag value of "michael martinez blows me down" and it would cover all those ideas.

So, I propose we refer to the coming swarm of hyperoptimization techniques as "Search Phrase Operation".

Or SPO.

You could then have Search Phrase Optimization Techniques (SPOTs), Search Phrase Over Optimization Networks (SPOONs), Search Phrase Optimization Keyword Engines (SPOKEs), and Search Phrase Optimization Indexing Lexicons (SPOILs).

SPOTs, SPOONs, SPOKEs, and SPOILs will soon become the popular SEO buzzwords, displacing "quality links", "relevant backlinks", and "my pages are highly optimized with a PR of 7 but they don't appear in Google".

Hi Search Engines WEB,
You wouldn't say that these patents might be influenced by something like the Graham Spencer penned System and method for accelerated query evaluation of very large full-text databases, which allows for phrase-based indexing within a separate cache? Probably a coincidence that Graham Spencer now works for Google, but kind of fun to see.

You're having too much fun, Michael, trying to hone in on the creation of new buzzwords for the industry. It's hard enough that folks might think that the "page" in pagerank has something to do with "pages" instead of being taken from it's inventor's name.

It wouldn't be hard for spammers to beat this filter by throwing varying degrees of related phrases - eventually something will stick. Also, by having a few test pages up, spammers will be able to detect when this filter is toned down or cranked up, and adjust accordingly.

Good points. I think sometimes the battle against spam is an incremental one, in which the gains aren't going to be measured in defeating it completely, overnight. Rather, it's demise might be by making the cost of spamming computationally more expensive, and more difficult.

Regardless, it potentially has a number of other benefits in addition to acting as a spam filter, and it doesn't replace existing indexing and relevancy methods, but rather adds a layer on top of them.

somehow I'm reminded of the long-running comic in the pages of Mad Magazine, "Spy vs. Spy" but I don't recall either character applying for a patent for their counter-measures, even when vanquishing their foe in the final frame.

Comment by MondoTofu [TypeKey Profile Page] | January 2, 2007 4:23 PM

Search:

Search Marketing Expo

Save the date for:
SMX China (Nanjing) - Sept. 23-24
SMX Stockholm - Sept. 23-24: See who's speaking or register now.
SMX East (New York City) - Oct. 6-8: See the agenda or register today and save!
SMX London - Nov. 4-5: Pre-agenda rate now available. Click here.

Search Marketing Now

Learn more about search marketing through free online webcasts and webinars from our sister site Search Marketing Now.

Upcoming Webcasts:

Most Recent News Posts

About Search Engine Land

Stay Updated!

Get Our Search Newsletters:
Email:
Daily Monthly

Get Our Search Feed:
Subscribe Via Web FeedSubscribe with Google
Add to My Yahoo!Subscribe with Bloglines
Add to netvibesSubscribe with Live.com
Subscribe in NewsGator OnlineSubscribe in Rojo
Add to My AOL
More About Our Feeds & Newsletters

Add to Technorati Favorites

Track Us Socially:
Facebook: Our Search News App
Facebook: Search Engine Land Page
Facebook: Search Engine Land Group
Flickr: Search Engine Land
LinkedIn: Search Engine Land Group
Twitter: Search Engine Land Feed

Bragroll