Dec 29, 2006 at 7:22am ET by Bill Slawski
Can indexing phrases from pages be an effective approach in identifying and filtering keyword stuffed pages, and honeypot pages aimed at attracting visitors solely to have them click upon ads?
A new patent application published yesterday and assigned to Google, Detecting spam documents in a phrase based information retrieval system, presents a reasonable argument in favor of the method.
Ok, so “Phraserank” doesn’t appear in the document. But it’s a term that might be worth thinking about. It may do much more than just help fight spam.
Danny noticed that I had a long writeup this morning on the Anna Patterson penned filing, and I think that this passage from the document jumped out at both of us:
From the foregoing, the number of the related phrases present in a given document will be known. A normal, non-spam document will generally have a relatively limited number of related phrases, typically on the order of between 8 and 20, depending on the document collection. By contrast, a spam document will have an excessive number of related phrases, for example on the order of between 100 and 1000 related phrases. Thus, the present invention takes advantage of this discovery by identifying as spam documents those documents that have a statistically significant deviation in the number of related phrases relative to an expected number of related phrases for documents in the document collection.
This is the sixth published patent application from Anna Patterson on some aspect of phrase-based indexing. Three of them are listed in the USPTO assignment database as being assigned to Google. Here are the others:
*assigned to Google
The inventor, Anna Patterson, wrote a search engine for the Internet Archive a couple of years back, as a demo, which disappeared sometime around when she joined Google. Her four paged article, Why Writing Your Own Search Engine is Hard, is an excellent introduction to phrase based indexing. My favorite quote:
There is a major field of study about the different things to index on. Don’t get a Ph.D.; just index on words. Words are what people search for; they don’t search for N-Grams or letters or PTrees or locations in streams, so any other method other than the simplest will make you seem clever. But, hey, writing your own search engine is hard enough. Save what cleverness you own for ranking.
Opinions expressed in the article are those of the guest author and not necessarily Search Engine Land.
Share, Bookmark & Discuss This Article
More:
Keep Updated: News Via Email | News Via RSS Feed | News Via Twitter
See more stories like this in the Members Library! Check out the Google: Patents, SEO: Spamming sections of the Members Library where this story is filed. Members also get access to exclusive video content, a members-only weekly & monthly newsletter, plus more. Check out all the benefits!
TOP STORIES
SEARCH NEWS BRIEFS
FEATURES & ANALYSIS
RECENT COMMENTS
Stay on top of all the search news with our daily summary, the SearchCap newsletter. View a sample ›
Search Engine Land produces SMX, the Search Marketing Expo conference series. SMX events deliver the most comprehensive educational and networking experiences - whether you're just starting in search marketing or you're a seasoned expert.
SMX Web Site » | SMX Difference » | SMX News »
Join us at an upcoming SMX event:
Learn more about search marketing with our free online webcasts and webinars from our sister site, Search Marketing Now. Upcoming online events include:
Featured sites from our Blogroll
Become a premium member today and receive:
_____________________________________________
This patent is basically an advanced branch of CONCEPT SEARCHING (remember EXCITE) – and fundamentally, is already being used to some extent is some legal search applications.
However, it is possible, that classic sites like this:
cuiwww.unige.ch/meta-index.html
(which, for years, until recently, has been in the top 20 on Google for the term SEARCH ENGINES) would be seen as spammy.
There will also be false positives with this method, only because of aggressive SEOing or extremely information-packed homepages. Or the aggressive use of synonyms or acronyms to cover all basis. Also, those who write in a taxonomy style (just using keywords without stop words) will suffer!!!!!!!
However, if this method is used in balance with link popularity and popularity links, it would be worth evaluating the SERPs.
But it must be remembered that, the new priority of search engines ALGOs – analyzing the anchor text/ back links from high trust ranked sites, makes it now nearly impossible for those honeypot sites to get high on the SERPs.
Most searchers do not use complex search terms – so many no longer get spam sites to the degree they would have gotten a few years ago,
And for those who do use complex terms, usually reference sites come up first. Poor sites usually remain near the bottom 900 – 1000 end of the serps
if Google does buy into this, the so-called bad phrases sites might go into the supplimental listings.
“PhraseRank” is a bad label to pin on this donkey, in my opinion. “PhraseMetric Filtration” might work better, but I doubt it has the buzz or zing value that SEOs will want to use. Given that so many SEOs now wrongly apply “TrustRank” (a Yahoo!-coined phrase) to Google’s trust filtering, and “Latent Semantic Indexing” to Google’s non-semantic indexing, it’s almost certain they’l adopt the erroneous PhraseRank (after all, this is not rocket science — it’s just SEO).
Still, I propose that we open the floor for nominations for more accurate labels to describe what these patents portend.
Here are a few suggestions to kick off the list:
“Latent Phrasic Symbology”
“InterPhrase Pseudo-Semantic Analysis”
“PostPhrasal Spam Syndrome”
“PhraseGraphic Filtering”
“PhraseToponymy”
“Phrase-based Filtering”
“Phrase Index Scoring”
“Phrase: Got Spam?”
Hey, search engines web, that unige.ch page would be considered a sitemap with 242 links.
On a more serious note, let me resurrect the ancient concept of “power keyword optimization” (which I neither coined nor had any part in defining or popularizing). The old PKO concept could be summed thusly (with respect to the KEYWORDS meta tag):
Given that you may want to optimize for “michael martinez”, “michael martinez blows”, and “blows me down”, you could define a meta tag value of “michael martinez blows me down” and it would cover all those ideas.
So, I propose we refer to the coming swarm of hyperoptimization techniques as “Search Phrase Operation”.
Or SPO.
You could then have Search Phrase Optimization Techniques (SPOTs), Search Phrase Over Optimization Networks (SPOONs), Search Phrase Optimization Keyword Engines (SPOKEs), and Search Phrase Optimization Indexing Lexicons (SPOILs).
SPOTs, SPOONs, SPOKEs, and SPOILs will soon become the popular SEO buzzwords, displacing “quality links”, “relevant backlinks”, and “my pages are highly optimized with a PR of 7 but they don’t appear in Google”.
Hi Search Engines WEB,
You wouldn’t say that these patents might be influenced by something like the Graham Spencer penned System and method for accelerated query evaluation of very large full-text databases, which allows for phrase-based indexing within a separate cache? Probably a coincidence that Graham Spencer now works for Google, but kind of fun to see.
You’re having too much fun, Michael, trying to hone in on the creation of new buzzwords for the industry. It’s hard enough that folks might think that the “page” in pagerank has something to do with “pages” instead of being taken from it’s inventor’s name.
It wouldn’t be hard for spammers to beat this filter by throwing varying degrees of related phrases – eventually something will stick. Also, by having a few test pages up, spammers will be able to detect when this filter is toned down or cranked up, and adjust accordingly.
Good points. I think sometimes the battle against spam is an incremental one, in which the gains aren’t going to be measured in defeating it completely, overnight. Rather, it’s demise might be by making the cost of spamming computationally more expensive, and more difficult.
Regardless, it potentially has a number of other benefits in addition to acting as a spam filter, and it doesn’t replace existing indexing and relevancy methods, but rather adds a layer on top of them.
somehow I’m reminded of the long-running comic in the pages of Mad Magazine, “Spy vs. Spy” but I don’t recall either character applying for a patent for their counter-measures, even when vanquishing their foe in the final frame.