PhraseRank, Not PageRank, To Fight Search Spam


Can indexing phrases from pages be an effective approach in identifying and filtering keyword stuffed pages, and honeypot pages aimed at attracting visitors solely to have them click upon ads?

A new patent application published yesterday and assigned to Google, Detecting spam documents in a phrase based information retrieval system, presents a reasonable argument in favor of the method.

Ok, so “Phraserank” doesn’t appear in the document. But it’s a term that might be worth thinking about. It may do much more than just help fight spam.

Danny noticed that I had a long writeup this morning on the Anna Patterson penned filing, and I think that this passage from the document jumped out at both of us:

From the foregoing, the number of the related phrases present in a given document will be known. A normal, non-spam document will generally have a relatively limited number of related phrases, typically on the order of between 8 and 20, depending on the document collection. By contrast, a spam document will have an excessive number of related phrases, for example on the order of between 100 and 1000 related phrases. Thus, the present invention takes advantage of this discovery by identifying as spam documents those documents that have a statistically significant deviation in the number of related phrases relative to an expected number of related phrases for documents in the document collection.

This is the sixth published patent application from Anna Patterson on some aspect of phrase-based indexing. Three of them are listed in the USPTO assignment database as being assigned to Google. Here are the others:

*assigned to Google

The inventor, Anna Patterson, wrote a search engine for the Internet Archive a couple of years back, as a demo, which disappeared sometime around when she joined Google. Her four paged article, Why Writing Your Own Search Engine is Hard, is an excellent introduction to phrase based indexing. My favorite quote:

There is a major field of study about the different things to index on. Don’t get a Ph.D.; just index on words. Words are what people search for; they don’t search for N-Grams or letters or PTrees or locations in streams, so any other method other than the simplest will make you seem clever. But, hey, writing your own search engine is hard enough. Save what cleverness you own for ranking.

Opinions expressed in the article are those of the guest author and not necessarily Search Engine Land.



Bill Slawski

See more articles by Bill Slawski >


Share, Bookmark & Discuss This Article
More:


Keep Updated: News Via Email | News Via RSS Feed | News Via Twitter


See more stories like this in the Members Library! Check out the Google: Patents, SEO: Spamming sections of the Members Library where this story is filed. Members also get access to exclusive video content, a members-only weekly & monthly newsletter, plus more. Check out all the benefits!

8 COMMENTS ON PhraseRank, Not PageRank, To Fight Search Spam

★ ★ Search Engines WEB ★ ★,

_____________________________________________

This patent is basically an advanced branch of CONCEPT SEARCHING (remember EXCITE) – and fundamentally, is already being used to some extent is some legal search applications.

However, it is possible, that classic sites like this:
cuiwww.unige.ch/meta-index.html
(which, for years, until recently, has been in the top 20 on Google for the term SEARCH ENGINES) would be seen as spammy.

There will also be false positives with this method, only because of aggressive SEOing or extremely information-packed homepages. Or the aggressive use of synonyms or acronyms to cover all basis. Also, those who write in a taxonomy style (just using keywords without stop words) will suffer!!!!!!!

However, if this method is used in balance with link popularity and popularity links, it would be worth evaluating the SERPs.

But it must be remembered that, the new priority of search engines ALGOs – analyzing the anchor text/ back links from high trust ranked sites, makes it now nearly impossible for those honeypot sites to get high on the SERPs.

Most searchers do not use complex search terms – so many no longer get spam sites to the degree they would have gotten a few years ago,
And for those who do use complex terms, usually reference sites come up first. Poor sites usually remain near the bottom 900 – 1000 end of the serps

if Google does buy into this, the so-called bad phrases sites might go into the supplimental listings.



Michael Martinez,

“PhraseRank” is a bad label to pin on this donkey, in my opinion. “PhraseMetric Filtration” might work better, but I doubt it has the buzz or zing value that SEOs will want to use. Given that so many SEOs now wrongly apply “TrustRank” (a Yahoo!-coined phrase) to Google’s trust filtering, and “Latent Semantic Indexing” to Google’s non-semantic indexing, it’s almost certain they’l adopt the erroneous PhraseRank (after all, this is not rocket science — it’s just SEO).

Still, I propose that we open the floor for nominations for more accurate labels to describe what these patents portend.

Here are a few suggestions to kick off the list:

“Latent Phrasic Symbology”

“InterPhrase Pseudo-Semantic Analysis”

“PostPhrasal Spam Syndrome”

“PhraseGraphic Filtering”

“PhraseToponymy”

“Phrase-based Filtering”

“Phrase Index Scoring”

“Phrase: Got Spam?”



Rose Water,

Hey, search engines web, that unige.ch page would be considered a sitemap with 242 links.



Michael Martinez,

On a more serious note, let me resurrect the ancient concept of “power keyword optimization” (which I neither coined nor had any part in defining or popularizing). The old PKO concept could be summed thusly (with respect to the KEYWORDS meta tag):

Given that you may want to optimize for “michael martinez”, “michael martinez blows”, and “blows me down”, you could define a meta tag value of “michael martinez blows me down” and it would cover all those ideas.

So, I propose we refer to the coming swarm of hyperoptimization techniques as “Search Phrase Operation”.

Or SPO.

You could then have Search Phrase Optimization Techniques (SPOTs), Search Phrase Over Optimization Networks (SPOONs), Search Phrase Optimization Keyword Engines (SPOKEs), and Search Phrase Optimization Indexing Lexicons (SPOILs).

SPOTs, SPOONs, SPOKEs, and SPOILs will soon become the popular SEO buzzwords, displacing “quality links”, “relevant backlinks”, and “my pages are highly optimized with a PR of 7 but they don’t appear in Google”.



Bill Slawski,

Hi Search Engines WEB,
You wouldn’t say that these patents might be influenced by something like the Graham Spencer penned System and method for accelerated query evaluation of very large full-text databases, which allows for phrase-based indexing within a separate cache? Probably a coincidence that Graham Spencer now works for Google, but kind of fun to see.

You’re having too much fun, Michael, trying to hone in on the creation of new buzzwords for the industry. It’s hard enough that folks might think that the “page” in pagerank has something to do with “pages” instead of being taken from it’s inventor’s name.



Halfdeck,

It wouldn’t be hard for spammers to beat this filter by throwing varying degrees of related phrases – eventually something will stick. Also, by having a few test pages up, spammers will be able to detect when this filter is toned down or cranked up, and adjust accordingly.



Bill Slawski,

Good points. I think sometimes the battle against spam is an incremental one, in which the gains aren’t going to be measured in defeating it completely, overnight. Rather, it’s demise might be by making the cost of spamming computationally more expensive, and more difficult.

Regardless, it potentially has a number of other benefits in addition to acting as a spam filter, and it doesn’t replace existing indexing and relevancy methods, but rather adds a layer on top of them.



MondoTofu,

somehow I’m reminded of the long-running comic in the pages of Mad Magazine, “Spy vs. Spy” but I don’t recall either character applying for a patent for their counter-measures, even when vanquishing their foe in the final frame.




RECENT COMMENTS

  • Yossi Hermush said " Nobody expects a “1.0” to provide the same coverage as do products that are out there 5, 6, 8 or mor"
  • SEO Denver said " Your piece does underscore that social networking and all aspects of internet marketing are time con"
  • Chris Silver Smith said " Wow - my coincidental timing couldn't have been better, what with Rupert Murdoch getting quoted this"

See All »


FREE DAILY SEARCH NEWS RECAP!

Stay on top of all the search news with our daily summary, the SearchCap newsletter. View a sample ›

STAY CURRENT THROUGHOUT THE DAY

RSS Feeds

The Search Engine Land feed keeps you informed as news happens. SEE ALL FEEDS »

Upcoming Search Engine Land Conferences

Advertise With Us »

Search Engine Land produces SMX, the Search Marketing Expo conference series. SMX events deliver the most comprehensive educational and networking experiences - whether you're just starting in search marketing or you're a seasoned expert.


SMX Web Site » | SMX Difference » | SMX News »


Join us at an upcoming SMX event:

Search Marketing Now Learn more about search marketing with our free online webcasts and webinars from our sister site, Search Marketing Now. Upcoming online events include:


See more webcast topics »

TRACK US SOCIALLY
Upcoming Search Engine Land Conferences

Get Your Search Engine Land
Premium Membership!

Become a premium member today and receive:

  • Express commenting privileges & photo.
  • Exclusive videos & newsletters.
  • Discounts to our SMX conferences.
  • Access to "How To" & Other Archives.

Learn More

Upcoming Search Engine Land Conferences
Add to GoogleAdd to My Yahoo!Add to BloglinesAdd to NetvibesAdd to Windows Live