Search Engine Land » Platforms » Google » PhraseRank, Not PageRank, To Fight Search Spam

PhraseRank, Not PageRank, To Fight Search Spam

Can indexing phrases from pages be an effective approach in identifying and filtering keyword stuffed pages, and honeypot pages aimed at attracting visitors solely to have them click upon ads? A new patent application published yesterday and assigned to Google, Detecting spam documents in a phrase based information retrieval system, presents a reasonable argument in […]

Bill Slawski on December 29, 2006 at 7:22 am | Reading time: 3 minutes

Chat with SearchBot

Can indexing phrases from pages be an effective approach in identifying and filtering keyword stuffed pages, and honeypot pages aimed at attracting visitors solely to have them click upon ads?

A new patent application published yesterday and assigned to Google, Detecting spam documents in a phrase based information retrieval system, presents a reasonable argument in favor of the method.

Ok, so “Phraserank” doesn’t appear in the document. But it’s a term that might be worth thinking about. It may do much more than just help fight spam.

Danny noticed that I had a long writeup this morning on the Anna Patterson penned filing, and I think that this passage from the document jumped out at both of us:

From the foregoing, the number of the related phrases present in a given document will be known. A normal, non-spam document will generally have a relatively limited number of related phrases, typically on the order of between 8 and 20, depending on the document collection. By contrast, a spam document will have an excessive number of related phrases, for example on the order of between 100 and 1000 related phrases. Thus, the present invention takes advantage of this discovery by identifying as spam documents those documents that have a statistically significant deviation in the number of related phrases relative to an expected number of related phrases for documents in the document collection.

This is the sixth published patent application from Anna Patterson on some aspect of phrase-based indexing. Three of them are listed in the USPTO assignment database as being assigned to Google. Here are the others:

Multiple index based information retrieval system* (20060106792)
Phrase-based searching in an information retrieval system* (20060031195)
Phrase-based indexing in an information retrieval system (20060020607)
Phrase-based generation of document descriptions (20060020571)
Phrase identification in an information retrieval system (20060018551)

*assigned to Google

The inventor, Anna Patterson, wrote a search engine for the Internet Archive a couple of years back, as a demo, which disappeared sometime around when she joined Google. Her four paged article, Why Writing Your Own Search Engine is Hard, is an excellent introduction to phrase based indexing. My favorite quote:

There is a major field of study about the different things to index on. Don’t get a Ph.D.; just index on words. Words are what people search for; they don’t search for N-Grams or letters or PTrees or locations in streams, so any other method other than the simplest will make you seem clever. But, hey, writing your own search engine is hard enough. Save what cleverness you own for ranking.

Contributing authors are invited to create content for Search Engine Land and are chosen for their expertise and contribution to the search community. Our contributors work under the oversight of the editorial staff and contributions are checked for quality and relevance to our readers. The opinions they express are their own.

Add Search Engine Land to your Google News feed.

Name	Hostname	Vendor	Expiry
_sm_bot	.semrush.com		60 days
It is a cookie-requirement to prevent automated requests and maintain user interaction.
_sm_bot_verify	.semrush.com		60 days
This cookie is necessary to confirm the prior installation of _sm_bot cookie.
__cf_bm	.vimeo.com	Cloudflare, Inc.	1 hour
The __cf_bm cookie supports Cloudflare Bot Management by managing incoming traffic that matches criteria associated with bots. The cookie does not collect any personal data, and any information collected is subject to one-way encryption.
_cfuvid	.vimeo.com		Session
Used by Cloudflare WAF to distinguish individual users who share the same IP address and apply rate limits
cookiehub	.searchengineland.com	CookieHub	365 days
Used by CookieHub to store information about whether visitors have given or declined the use of cookie categories used on the site.
__cf_bm	.downloads.digitalmarketingdepot.com	Cloudflare, Inc.	1 hour
The __cf_bm cookie supports Cloudflare Bot Management by managing incoming traffic that matches criteria associated with bots. The cookie does not collect any personal data, and any information collected is subject to one-way encryption.

Name	Hostname	Vendor	Expiry
CLID	www.clarity.ms	Microsoft	365 days
Identifies the first-time Clarity saw this user on any site using Clarity.
_ga_	.searchengineland.com	Google	400 days
Contains a unique identifier used by Google Analytics 4 to determine that two distinct hits belong to the same user across browsing sessions.
_ga	.searchengineland.com	Google	400 days
Contains a unique identifier used by Google Analytics to determine that two distinct hits belong to the same user across browsing sessions.
_gid	.searchengineland.com	Google	1 day
Contains a unique identifier used by Google Analytics to determine that two distinct hits belong to the same user across browsing sessions.
_gat_	.searchengineland.com	Google	1 hour
Used by Google Analytics to throttle request rate (limit the collection of data on high traffic sites)
_clck	.searchengineland.com	Microsoft	365 days
Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_clsk	.searchengineland.com	Microsoft	1 day
Connects multiple page views by a user into a single Clarity session recording.
MUID	.bing.com	Microsoft	390 days
Microsoft User Identifier tracking cookie used by Bing Ads. It can be set by embedded microsoft scripts. Widely believed to sync across many different Microsoft domains, allowing user tracking.
MR	.c.bing.com	Microsoft	7 days
Used by Microsoft Clarity to indicate whether to refresh MUID.
SM	.c.clarity.ms	Microsoft	Session
This cookie is installed by Clarity. The cookie is used to store non-personally identifiable information. The cookie is used in synchronizing the MUID (Microsoft unique user ID) across Microsoft domains.
MUID	.clarity.ms	Microsoft	390 days
Microsoft User Identifier tracking cookie used by Bing Ads. It can be set by embedded microsoft scripts. Widely believed to sync across many different Microsoft domains, allowing user tracking.
MR	.c.clarity.ms	Microsoft	7 days
Used by Microsoft Clarity to indicate whether to refresh MUID.
_cltk		Microsoft	Session
This cookie is installed by Microsoft Clarity tool and stores information about how visitors use the website
_clsk	searchengineland.com	Microsoft	1 day
Connects multiple page views by a user into a single Clarity session recording.
__tt_embed__mounting			Session
We use TikTok to market ourselves using the TikTok cookie that collects data about behaviour and purchases on our website and to measure the effect of our advertising. This tracking is used to evaluate and measure how different campaigns and marketing strategies perform on TikTok.
__tt_embed__storage_test			Session
We use TikTok to market ourselves using the TikTok cookie that collects data about behaviour and purchases on our website and to measure the effect of our advertising. This tracking is used to evaluate and measure how different campaigns and marketing strategies perform on TikTok.

Name	Hostname	Vendor	Expiry
_mkto_trk	.searchengineland.com		400 days
This cookie is associated with an email marketing service provided by Marketo. This tracking cookie allows a website to link visitor behavior to the recipient of an email marketing campaign, to measure campaign effectiveness.
SRM_B	.c.bing.com	Microsoft	390 days
This cookie is installed by Microsoft Bing. Identifies unique web browsers visiting Microsoft sites.
ANONCHK	.c.clarity.ms	Microsoft	1 hour
Used to store session ID for a users session to ensure that clicks from adverts on the Bing search engine are verified for reporting purposes and for personalisation
YSC	.youtube.com	Google	Session
This cookie is set by YouTube video service on pages with YouTube embedded videos to track views.
VISITOR_INFO1_LIVE	.youtube.com	Google	180 days
Set by YouTube and used for various purposes, including analytical and advertising.
VISITOR_PRIVACY_METADATA	.youtube.com	Google	180 days
ttwid	.tiktok.com		360 days
msToken	.tiktok.com		10 days