Search engine Blekko has announced that it has now blocked 1.1 million web sites from its search results using a new system it calls “AdSpam,” and new pages from any web site won’t be added if they don’t pass muster.
Blocking Spam Before It Gets In
From the Blekko press release:
“This new technology will block spam before it ever shows up in a search results page,” said Rich Skrenta, CEO of Blekko. “We have algorithmically identified multiple spam signals for every page in our index. Eliminating those domains from our index has dramatically cleaned up our search results.”
And from the company’s blog post:
Today, we are taking the next giant step forward with the launch of Blekko’s new AdSpam algorithm. This new technology will dramatically change search. It’s the first search algorithm ever created to find spam rather than rank results. The algorithm is specifically designed to recognize pages which are spam and eliminate them before they ever appear in search results.
That’s interesting, this idea of blocking spam pages before they are added to a search index. It may have been done before, but if so, I don’t recall by which service. Certainly, it was never something noteworthy enough for me to recall. If you keep the spam out entirely, potentially that makes for cleaner results.
Then again, it’s also something that’s far more likely to benefit Blekko than Google or Bing. The reason is that both those search engines have far more mature search algorithms than Blekko, so they might already do a better job of keeping spam out of the top results, even though the spam pages themselves are included in the overall searchable index — which is like a big book of all the web pages they’ve collected.
More important, both Google and Bing have huge resources where indexing a million or even a billion spam pages doesn’t really leave less “room” to store the “good” stuff. They have thousands of servers. Storage for them is relatively cheap. But for Blekko, every page of spam they index is potentially more costly.
As for the “AdSpam” name — that’s terrible. I gather it comes from the idea that these are pages loaded with ads — but I find it pretty confusing.
Previously In “Banned On Blekko”
Last month, Blekko gained some attention by banning 20 “spam” sites from its index. From our coverage then:
Rich Skrenta, Blekko’s CEO confirmed the ban with us today. He told us Blekko has decided to ban the “top 20 spam sites from blekko’s index entirely, based on our users click /spam on results.” This includes ehow.com, one of Demand Media’s top revenue generating web sites.
But wait. Are these the top 20 spam sites or, as Blekko’s release said today, the “top 20 content farms.” Both. Neither. It’s confusing.
Spam Is In The Eye Of The Beholder Search Engine
Search engine spam is whatever a search engine decides it to be. For example, both Google and Bing would generally consider pages that “cloak” — show content to the user that’s different than what their automated crawlers see — to be spam. Both agree on many other tactics that would be considered spam, but they may not agree precisely. Nor will they agree with Blekko.
Virtually none of the sites above, from my quick review, would be considered spam by either Google or Bing. Certainly Google caused some of them to lose rankings in its recent Farmer / Panda update. But that wasn’t because they were spamming Google. It was because they had some content that the new algorithm decided to no longer reward as well as in the past.
In short, low-quality content doesn’t equal spam, not to Google or Bing. It’s just something they won’t rank as highly, which is exactly what their algorithms are supposed to do.
With Blekko’s initial block list, it decided that sites were spam based on user reports, regardless of whether those sites violated any traditional search engine spam guidelines. With the latest move, Blekko is further deciding that low quality equals spam. Again, from the post:
So what is exactly is AdSpam? In short, it is a machine-learning algorithm that examines pages for specific spam signals — the presence of multiple display ad positions on a single page and thin to zero content.
The end result of Blekko’s approach versus Google’s could potentially be the same. Google aims to keep “shallow” content from showing up for many searches, even though the pages are among those it has collected. Blekko is also aiming to keep shallow content out — but unlike Google, it applies the “spam” label to such content and is preventing it from being indexed in the first place.
What’s Gone Now?
Over at the New York Times, Claire Cain Miller had a good part about Blekko’s move and whether good sites might be harmed. No, says Blekko:
Though it seems like many legitimate sites could be considered spam under this algorithm — newspapers cover a wide variety of topics, for instance, and many bloggers may be amateur writers but are experts in their fields — Mr. Skrenta said that when he combed through thousands of sites that the algorithm banned, he found only two false positives.
Examples of the sites Blekko now bans: cheap-refrigerators.net, best-weddinggifts and Boston.diningguide.com.
Does It Help?
I haven’t done any widespread testing. But noticing that the name of the cheap-refrigerators.net web site above that was confirmed as removed — “Refrigerators Buying Tips” — I thought a search on that topic might be interesting:
You can click to enlarge the image. I’ve removed the ad that was at the top of results from both Blekko and Google, so you can focus on the top five editorial results. My take:
- OK, but short and basic
- Irrelevant – you have to search further in the site to get tips
- Good basic tips from major retailer
- Thin content that just links to more thin content
- Irrelevant — about water filters for refrigerators
- Good, substantial multi-part article
- Good, short tips leading to further reviews
- Good tips
- Good tips from Consumer Reports, a major trusted brand
- OK tips, about three years old
Sorry, Blekko — I can’t say that dropping that refrigerator site, much less the other 1 million or so other sites, helped you at all against Google for this particular query. Using the /reviews slashtag did help — it got one of the good sites that Google had to be listed first. But the other three good sites that Google had in the top five results didn’t show.
Moreover, most typical searchers aren’t going to use slashtags — and there’s even less reason to use them when the same search at Google brings up better results, no slashtag required.
The Human Factor
Blekko’s post also says:
Unlike algorithms used by other search engines, AdSpam is being used in conjunction with human curation to detect to continue the War on Spam.
True — Blekko is making use of human efforts to decide what’s good and bad. In particular, Blekko recently partnered with Stack Exchange (formerly Stack Overlow) for curation of programming and technical topics.
Expect Google to push back on the entire “it has no humans” aspect, however. It’s done this before, the last time when both Mahalo and Search Wikia tried that angle. Google stressed that it has human reviewers, who serve as a sort of “double-check” on the computer algorithm changes it makes, for example.
Google stressed this again recently when it made the Farmer Update, to highlight that the computer-based change seemed to be supported by the human data it seeks to model. Google’s also suggested that what people block using its Chrome Personal Blocklist extension could be data that’s used in its search algorithm, in the future.
Still, Google has nothing like the slashtag curation that Blekko offers. Having said that, Blekko has yet to show that this curation is turning into higher quality results that are attracting significant users from Google, much less Bing. But on the PR front, there’s no doubt that Blekko’s moves are keeping pressure on Google to improve as well.
I’ve not had a chance to talk with Blekko more about the system, as I’m currently at our SMX West search marketing conference in San Jose. Blekko — along with Google and Bing — is taking part in our “The Spam Police” and “Ask The Search Engines” sessions tomorrow, so I expect more specific under-the-hood details will emerge from that. Stay tuned (and also watch for related coverage on Techmeme). Also see the articles below for more background on some of the things I’ve mentioned above.