PhraseRank, Not PageRank, To Fight Search Spam

Can indexing phrases from pages be an effective approach in identifying and filtering keyword stuffed pages, and honeypot pages aimed at attracting visitors solely to have them click upon ads?

A new patent application published yesterday and assigned to Google, Detecting spam documents in a phrase based information retrieval system, presents a reasonable argument in favor of the method.

Ok, so “Phraserank” doesn’t appear in the document. But it’s a term that might be worth thinking about. It may do much more than just help fight spam.

Danny noticed that I had a long writeup this morning on the Anna Patterson penned filing, and I think that this passage from the document jumped out at both of us:

From the foregoing, the number of the related phrases present in a given document will be known. A normal, non-spam document will generally have a relatively limited number of related phrases, typically on the order of between 8 and 20, depending on the document collection. By contrast, a spam document will have an excessive number of related phrases, for example on the order of between 100 and 1000 related phrases. Thus, the present invention takes advantage of this discovery by identifying as spam documents those documents that have a statistically significant deviation in the number of related phrases relative to an expected number of related phrases for documents in the document collection.

This is the sixth published patent application from Anna Patterson on some aspect of phrase-based indexing. Three of them are listed in the USPTO assignment database as being assigned to Google. Here are the others:

*assigned to Google

The inventor, Anna Patterson, wrote a search engine for the Internet Archive a couple of years back, as a demo, which disappeared sometime around when she joined Google. Her four paged article, Why Writing Your Own Search Engine is Hard, is an excellent introduction to phrase based indexing. My favorite quote:

There is a major field of study about the different things to index on. Don’t get a Ph.D.; just index on words. Words are what people search for; they don’t search for N-Grams or letters or PTrees or locations in streams, so any other method other than the simplest will make you seem clever. But, hey, writing your own search engine is hard enough. Save what cleverness you own for ranking.

Opinions expressed in the article are those of the guest author and not necessarily Search Engine Land.

Related Topics: Channel: SEO | Google: Patents | SEO: Spamming

Sponsored


About The Author: is the Director of Search Marketing for Go Fish Digital and the editor of SEO by the Sea. He has been doing SEO and web promotion since the mid-90s, and was a legal and technical administrator in the highest level trial court in Delaware.

Connect with the author via: Email | Twitter | Google+ | LinkedIn



SearchCap:

Get all the top search stories emailed daily!  

Share

Other ways to share:
 

Read before commenting! We welcome constructive comments and allow any that meet our common sense criteria. This means being respectful and polite to others. It means providing helpful information that contributes to a story or discussion. It means leaving links only that substantially add further to a discussion. Comments using foul language, being disrespectful to others or otherwise violating what we believe are common sense standards of discussion will be deleted. Comments may also be removed if they are posted from anonymous accounts. You can read more about our comments policy here.
  • http://search-engines-web.com/ ★ ★ Search Engines WEB ★ ★

    _____________________________________________

    This patent is basically an advanced branch of CONCEPT SEARCHING (remember EXCITE) – and fundamentally, is already being used to some extent is some legal search applications.

    However, it is possible, that classic sites like this:
    cuiwww.unige.ch/meta-index.html
    (which, for years, until recently, has been in the top 20 on Google for the term SEARCH ENGINES) would be seen as spammy.

    There will also be false positives with this method, only because of aggressive SEOing or extremely information-packed homepages. Or the aggressive use of synonyms or acronyms to cover all basis. Also, those who write in a taxonomy style (just using keywords without stop words) will suffer!!!!!!!

    However, if this method is used in balance with link popularity and popularity links, it would be worth evaluating the SERPs.

    But it must be remembered that, the new priority of search engines ALGOs – analyzing the anchor text/ back links from high trust ranked sites, makes it now nearly impossible for those honeypot sites to get high on the SERPs.

    Most searchers do not use complex search terms – so many no longer get spam sites to the degree they would have gotten a few years ago,
    And for those who do use complex terms, usually reference sites come up first. Poor sites usually remain near the bottom 900 – 1000 end of the serps

    if Google does buy into this, the so-called bad phrases sites might go into the supplimental listings.

  • http://seo-theory.blogspot.com/ Michael Martinez

    “PhraseRank” is a bad label to pin on this donkey, in my opinion. “PhraseMetric Filtration” might work better, but I doubt it has the buzz or zing value that SEOs will want to use. Given that so many SEOs now wrongly apply “TrustRank” (a Yahoo!-coined phrase) to Google’s trust filtering, and “Latent Semantic Indexing” to Google’s non-semantic indexing, it’s almost certain they’l adopt the erroneous PhraseRank (after all, this is not rocket science — it’s just SEO).

    Still, I propose that we open the floor for nominations for more accurate labels to describe what these patents portend.

    Here are a few suggestions to kick off the list:

    “Latent Phrasic Symbology”

    “InterPhrase Pseudo-Semantic Analysis”

    “PostPhrasal Spam Syndrome”

    “PhraseGraphic Filtering”

    “PhraseToponymy”

    “Phrase-based Filtering”

    “Phrase Index Scoring”

    “Phrase: Got Spam?”

  • http://www.royalessence.com/ Rose Water

    Hey, search engines web, that unige.ch page would be considered a sitemap with 242 links.

  • http://seo-theory.blogspot.com/ Michael Martinez

    On a more serious note, let me resurrect the ancient concept of “power keyword optimization” (which I neither coined nor had any part in defining or popularizing). The old PKO concept could be summed thusly (with respect to the KEYWORDS meta tag):

    Given that you may want to optimize for “michael martinez”, “michael martinez blows”, and “blows me down”, you could define a meta tag value of “michael martinez blows me down” and it would cover all those ideas.

    So, I propose we refer to the coming swarm of hyperoptimization techniques as “Search Phrase Operation”.

    Or SPO.

    You could then have Search Phrase Optimization Techniques (SPOTs), Search Phrase Over Optimization Networks (SPOONs), Search Phrase Optimization Keyword Engines (SPOKEs), and Search Phrase Optimization Indexing Lexicons (SPOILs).

    SPOTs, SPOONs, SPOKEs, and SPOILs will soon become the popular SEO buzzwords, displacing “quality links”, “relevant backlinks”, and “my pages are highly optimized with a PR of 7 but they don’t appear in Google”.

  • http://www.seobythesea.com Bill Slawski

    Hi Search Engines WEB,
    You wouldn’t say that these patents might be influenced by something like the Graham Spencer penned System and method for accelerated query evaluation of very large full-text databases, which allows for phrase-based indexing within a separate cache? Probably a coincidence that Graham Spencer now works for Google, but kind of fun to see.

    You’re having too much fun, Michael, trying to hone in on the creation of new buzzwords for the industry. It’s hard enough that folks might think that the “page” in pagerank has something to do with “pages” instead of being taken from it’s inventor’s name.

  • http://www.seo4fun.com/blog/ Halfdeck

    It wouldn’t be hard for spammers to beat this filter by throwing varying degrees of related phrases – eventually something will stick. Also, by having a few test pages up, spammers will be able to detect when this filter is toned down or cranked up, and adjust accordingly.

  • http://www.seobythesea.com Bill Slawski

    Good points. I think sometimes the battle against spam is an incremental one, in which the gains aren’t going to be measured in defeating it completely, overnight. Rather, it’s demise might be by making the cost of spamming computationally more expensive, and more difficult.

    Regardless, it potentially has a number of other benefits in addition to acting as a spam filter, and it doesn’t replace existing indexing and relevancy methods, but rather adds a layer on top of them.

  • MondoTofu

    somehow I’m reminded of the long-running comic in the pages of Mad Magazine, “Spy vs. Spy” but I don’t recall either character applying for a patent for their counter-measures, even when vanquishing their foe in the final frame.

Get Our News, Everywhere!

Daily Email:

Follow Search Engine Land on Twitter @sengineland Like Search Engine Land on Facebook Follow Search Engine Land on Google+ Get the Search Engine Land Feed Connect with Search Engine Land on LinkedIn Check out our Tumblr! See us on Pinterest

 
 

Click to watch SMX conference video

Join us at one of our SMX or MarTech events:

United States

Europe

Australia & China

Learn more about: SMX | MarTech


Free Daily Search News Recap!

SearchCap is a once-per-day newsletter update - sign up below and get the news delivered to you!

 


 

Search Engine Land Periodic Table of SEO Success Factors

Get Your Copy
Read The Full SEO Guide