Blekko Blocks More “Spam” Domains: 1.1 Million Of Them

Search engine Blekko has announced that it has now blocked 1.1 million web sites from its search results using a new system it calls “AdSpam,” and new pages from any web site won’t be added if they don’t pass muster.

Blocking Spam Before It Gets In

From the Blekko press release:

“This new technology will block spam before it ever shows up in a search results page,” said Rich Skrenta, CEO of Blekko. “We have algorithmically identified multiple spam signals for every page in our index. Eliminating those domains from our index has dramatically cleaned up our search results.”

And from the company’s blog post:

Today, we are taking the next giant step forward with the launch of Blekko’s new AdSpam algorithm. This new technology will dramatically change search. It’s the first search algorithm ever created to find spam rather than rank results. The algorithm is specifically designed to recognize pages which are spam and eliminate them before they ever appear in search results.

That’s interesting, this idea of blocking spam pages before they are added to a search index. It may have been done before, but if so, I don’t recall by which service. Certainly, it was never something noteworthy enough for me to recall. If you keep the spam out entirely, potentially that makes for cleaner results.

Then again, it’s also something that’s far more likely to benefit Blekko than Google or Bing. The reason is that both those search engines have far more mature search algorithms than Blekko, so they might already do a better job of keeping spam out of the top results, even though the spam pages themselves are included in the overall searchable index — which is like a big book of all the web pages they’ve collected.

More important, both Google and Bing have huge resources where indexing a million or even a billion spam pages doesn’t really leave less “room” to store the “good” stuff. They have thousands of servers. Storage for them is relatively cheap. But for Blekko, every page of spam they index is potentially more costly.

As for the “AdSpam” name — that’s terrible. I gather it comes from the idea that these are pages loaded with ads — but I find it pretty confusing.

Previously In “Banned On Blekko”

Last month, Blekko gained some attention by banning 20 “spam” sites from its index. From our coverage then:

Rich Skrenta, Blekko’s CEO confirmed the ban with us today. He told us Blekko has decided to ban the “top 20 spam sites from blekko’s index entirely, based on our users click /spam on results.” This includes ehow.com, one of Demand Media’s top revenue generating web sites.

The sites?

  • ehow.com
  • experts-exchange.com
  • naymz.com
  • activehotels.com
  • robtex.com
  • encyclopedia.com
  • fixya.com
  • chacha.com
  • 123people.com
  • download3k.com
  • petitionspot.com
  • thefreedictionary.com
  • networkedblogs.com
  • buzzillions.com
  • shopwiki.com
  • wowxos.com
  • answerbag.com
  • allexperts.com
  • freewebs.com
  • copygator.com.

But wait. Are these the top 20 spam sites or, as Blekko’s release said today, the “top 20 content farms.” Both. Neither. It’s confusing.

Spam Is In The Eye Of The Beholder Search Engine

Search engine spam is whatever a search engine decides it to be. For example, both Google and Bing would generally consider pages that “cloak” — show content to the user that’s different than what their automated crawlers see — to be spam. Both agree on many other tactics that would be considered spam, but they may not agree precisely. Nor will they agree with Blekko.

Virtually none of the sites above, from my quick review, would be considered spam by either Google or Bing. Certainly Google caused some of them to lose rankings in its recent Farmer / Panda update. But that wasn’t because they were spamming Google. It was because they had some content that the new algorithm decided to no longer reward as well as in the past.

In short, low-quality content doesn’t equal spam, not to Google or Bing. It’s just something they won’t rank as highly, which is exactly what their algorithms are supposed to do.

With Blekko’s initial block list, it decided that sites were spam based on user reports, regardless of whether those sites violated any traditional search engine spam guidelines. With the latest move, Blekko is further deciding that low quality equals spam. Again, from the post:

So what is exactly is AdSpam? In short, it is a machine-learning algorithm that examines pages for specific spam signals — the presence of multiple display ad positions on a single page and thin to zero content.

The end result of Blekko’s approach versus Google’s could potentially be the same. Google aims to keep “shallow” content from showing up for many searches, even though the pages are among those it has collected. Blekko is also aiming to keep shallow content out — but unlike Google, it applies the “spam” label to such content and is preventing it from being indexed in the first place.

What’s Gone Now?

Over at the New York Times, Claire Cain Miller had a good part about Blekko’s move and whether good sites might be harmed. No, says Blekko:

Though it seems like many legitimate sites could be considered spam under this algorithm — newspapers cover a wide variety of topics, for instance, and many bloggers may be amateur writers but are experts in their fields — Mr. Skrenta said that when he combed through thousands of sites that the algorithm banned, he found only two false positives.
Examples of the sites Blekko now bans: cheap-refrigerators.net, best-weddinggifts and Boston.diningguide.com.

Does It Help?

I haven’t done any widespread testing. But noticing that the name of the cheap-refrigerators.net web site above that was confirmed as removed — “Refrigerators Buying Tips” — I thought a search on that topic might be interesting:

You can click to enlarge the image. I’ve removed the ad that was at the top of results from both Blekko and Google, so you can focus on the top five editorial results. My take:

Blekko

  1. OK, but short and basic
  2. Irrelevant –  you have to search further in the site to get tips
  3. Good basic tips from major retailer
  4. Thin content that just links to more thin content
  5. Irrelevant — about water filters for refrigerators

Google

  1. Good, substantial multi-part article
  2. Good, short tips leading to further reviews
  3. Good tips
  4. Good tips from Consumer Reports, a major trusted brand
  5. OK tips, about three years old

Sorry, Blekko — I can’t say that dropping that refrigerator site, much less the other 1 million or so other sites, helped you at all against Google for this particular query. Using the /reviews slashtag did help — it got one of the good sites that Google had to be listed first. But the other three good sites that Google had in the top five results didn’t show.

Moreover, most typical searchers aren’t going to use slashtags — and there’s even less reason to use them when the same search at Google brings up better results, no slashtag required.

The Human Factor

Blekko’s post also says:

Unlike algorithms used by other search engines, AdSpam is being used in conjunction with human curation to detect to continue the War on Spam.

True — Blekko is making use of human efforts to decide what’s good and bad. In particular, Blekko recently partnered with Stack Exchange (formerly Stack Overlow) for curation of programming and technical topics.

Expect Google to push back on the entire “it has no humans” aspect, however. It’s done this before, the last time when both Mahalo and Search Wikia tried that angle. Google stressed that it has human reviewers, who serve as a sort of “double-check” on the computer algorithm changes it makes, for example.

Google stressed this again recently when it made the Farmer Update, to highlight that the computer-based change seemed to be supported by the human data it seeks to model. Google’s also suggested that what people block using its Chrome Personal Blocklist extension could be data that’s used in its search algorithm, in the future.

Still, Google has nothing like the slashtag curation that Blekko offers. Having said that, Blekko has yet to show that this curation is turning into higher quality results that are attracting significant users from Google, much less Bing. But on the PR front, there’s no doubt that Blekko’s moves are keeping pressure on Google to improve as well.

More Info

I’ve not had a chance to talk with Blekko more about the system, as I’m currently at our SMX West search marketing conference in San Jose. Blekko — along with Google and Bing — is taking part in our “The Spam Police” and “Ask The Search Engines” sessions tomorrow, so I expect more specific under-the-hood details will emerge from that. Stay tuned (and also watch for related coverage on Techmeme). Also see the articles below for more background on some of the things I’ve mentioned above.

Related Topics: Channel: Other | Top News

Sponsored


About The Author: is a Founding Editor of Search Engine Land. He’s a widely cited authority on search engines and search marketing issues who has covered the space since 1996. Danny also serves as Chief Content Officer for Third Door Media, which publishes Search Engine Land and produces the SMX: Search Marketing Expo conference series. He has a personal blog called Daggle (and keeps his disclosures page there). He can be found on Facebook, Google + and microblogs on Twitter as @dannysullivan.

Connect with the author via: Email | Twitter | Google+ | LinkedIn



SearchCap:

Get all the top search stories emailed daily!  

Share

Other ways to share:
 

Read before commenting! We welcome constructive comments and allow any that meet our common sense criteria. This means being respectful and polite to others. It means providing helpful information that contributes to a story or discussion. It means leaving links only that substantially add further to a discussion. Comments using foul language, being disrespectful to others or otherwise violating what we believe are common sense standards of discussion will be deleted. Comments may also be removed if they are posted from anonymous accounts. You can read more about our comments policy here.
  • http://feefighters.com Sheel Mohnot

    I like what they are doing in theory – it makes a lot of sense, but Blekko has a LOT of issues… like surfacing a domain that we got rid of 7 months ago, and showing sites that have nothing to do with the intended search (apart from a similar domain name).

    Details: http://feefighters.com/blog/?p=4554

  • Richard Stokes

    \That’s interesting, this idea of blocking spam pages before they are added to a search index. It may have been done before, but if so, I don’t recall by which service. Certainly, it was never something noteworthy enough for me to recall. If you keep the spam out entirely, potentially that makes for cleaner results.\

    Danny,

    This was a head scratched to me as well. We’ve been using similar algorithms for detecting spam sites for about 18 months now, but it’s a well documented fact that the technology goes back much further. Researchers from Microsoft, Yahoo, and Google have written about various aspects of web spam detection extensively since 2005 and the basics go back even further to 1999 (Andrei Broder, et al covered the basics even before then). It’s a bold (and false) claim.

    What I will buy is that they have their own tweaks. Everyone does.

    Sorry I’m missing you at SMX! I’m currently heading to Maui to talk on advanced webspam detection (and other topics) at Perry Marshall’s annual conference. Would love to connect sometime.

    Richard Stokes
    AdGooroo

  • http://www.michael-martinez.com/ Michael Martinez

    You must not have gotten the same Google results I did for your example query. I don’t think it really matters how much fluff is filtered out of the search engine. More seems to float to the top because, frankly, most sites that publish serious information have earned relatively few links and don’t use on-page optimization.

    The problem that even Google has masterfully failed to solve is how to surface truly useful, authoritative content. Links won’t help you do that. On-page optimization won’t help you do that.

    We need better indexing and analysis technologies that can dig past all the popularity factors and get to the meat of what people are really searching for. That’s at least a few search generations away.

Get Our News, Everywhere!

Daily Email:

Follow Search Engine Land on Twitter @sengineland Like Search Engine Land on Facebook Follow Search Engine Land on Google+ Get the Search Engine Land Feed Connect with Search Engine Land on LinkedIn Check out our Tumblr! See us on Pinterest

 
 

Click to watch SMX conference video

Join us at one of our SMX or MarTech events:

United States

Europe

Australia & China

Learn more about: SMX | MarTech


Free Daily Search News Recap!

SearchCap is a once-per-day newsletter update - sign up below and get the news delivered to you!

 


 

Search Engine Land Periodic Table of SEO Success Factors

Get Your Copy
Read The Full SEO Guide