Are Manual Solutions The Answer To Content Farms?
It was interesting to see some of the recent reactions when upstart Blekko decided to toss some sites out of their index. For the uninitiated it was a bit of a seeming PR play against Google whom have been getting smacked about for thin quality of late. If you hadn’t guessed by now, we’re talking about (Demand Media’s) eHow and the other “top 20 spam sites” that were nuked.
Of course the question remains, why? It certainly does seem like a knee jerk reaction that almost panders to the search community. Sure, I dislike running into weak content in the SERPs as much as the next guy. But I am pretty sure that there are many other equally thin content in many cases, much worse than what they’re churning out. Seriously? There’s only 20 sites worth tossing?
The State Of Modern Search
All is not lost my friends. One of the better developments over the last few years is all of the new (potential) signals and and infrastructure to deal with them. To a certain extent there is every chance for Google (and other engines) to get past the link.
Why now, more so than in the past? The infrastructure (caffeine) and the motivation (growing quality grumblings). Let’s consider some areas that might make sense, while also helping to combat spam and low quality results.
Personalization. One of the longest running goals at Google is deeper personalized search. Add to that the world of mobile, another personalization and area of great interest, we might see far more personalized results in the near future. If the new infrastructure enables a more granular personalization than is currently in place, this can give new signals that can lessen the spam we see on web today.
Explicit User Feedback. When a user takes an action to tell the search engine something, it is a type of relevance feedback known as ‘explicit feedback’. Think of (now defunct) search wiki as a good example. Others might include emailing a page, saving to favorites and so forth. Traditionally, this type of data has been hard for search engines to come by. The noisier implicit feedback, is far more readily available. But, it would certainly be a great way to deal with spam or unwanted domains in a personalized search environment.
Temporal Data. Another area that has been prevalent with Google over the last few years is freshness (in many query spaces). There may be some of these types of signals that could increase relevancy while dealing with spam. Even for the link graph, stronger weighting of these might help decrease the power of authority in many situations.
Social Graph. Another obvious area is of course, social. The social graph and real-time search are two areas Google is also vested in over the last while. This can lead to deeper personalization as well as other potential signals. Once more, a lot of social signals are open to spam unless they’re used in a granular personalization approach. But in concert with the other elements mentioned here, it seems that it would also help root out weak content/sites while not opening the entire link graph up for manipulation.
At the end of the day, there needs to be an automated solution that is protecting not only against spam but those that wish to do their competitors harm. Having ‘votes’ of spam should only affect the individual user. You can’t spam yourself and removing a competitor only from your results, is the kind of personalization that would work to deal with this.
Some Thoughts From The Geeks
On the topic of explicit feedback mechanisms such as we’ve seen with Google Search Wiki, Rich says they didn’t work because, “there are too many possible queries, effectively an infinite set. How many different queries are there are all possible song lyrics?”.
Skrenta then made the case for their approach:
“What we are doing is identifying the top sites per category. The top 100 /health sites collectively have millions of pages and can answer any medical question you have. The top 50 lyrics sites have lyrics for every song”.
That makes some sense, but I am also leery of ‘human powered’ solutions, which was countered by Rich whom contends that it’s: “(..) disingenuous to pretend that “the algorithm” drives the results. The algorithm gets changed on a day to day basis in response to new material appearing on the web. ”
Ok, yes, there are people constantly messing with the algorithms at Google, which does mean they’re also making subjective statements of their own. Also, for the unfamiliar, Google does have raters in the system scoring on perceived relevance as part of search quality testing.
Mark Cramer for his part as someone familiar with user feedback mechanisms, feels that, “the implicit feedback approach is always the best. In most all cases, people are not interested in providing explicit feedback.”
With Google SearchWiki being the glaring example. Surf Canyon did get involved with the move by making it an option for their users, “we figured it wouldn’t hurt to throw it in there” said Cramer, referring to a new option for the application.
In further clarification on Blekko’s approach, which is a more subjective stance, Skrenta once more uses the lyric SERP example:
“Rather than rolling the dice with a 200-weight algorithm that’s been trained by a bunch of minimum wage web contractors, you could actually just pick the top lyrics sites. They have the lyrics to every song every published. And they won’t download malware or spyware onto your computer.“
This is once more a seemingly logical approach, but I can’t see it being something a search engine such as Google would consider. It does speak more to a more personalized environment such as we looked at earlier.
As of this week, even Google is getting back into the explicit user feedback experience with a Chrome add-on for removing sites from your results. Will this fair any better than previous attempts? It is highly unlikely. Forgetting for a moment the market share for Chrome, users simply aren’t that interested. Just give them good results to start with.
While we can give kudos to the gang at Blekko for trying to say something on the need for higher quality search results, there are limits. This doesn’t scale well and would be a PR nightmare for any major search engine. Does e-How or Mahalo really have the worst result for everything it publishes? It seems a slippery slope to venture onto.
Where will it end and what safeguards are in place?
Until it can be proven in some larger implementations that users will not only engage with explicit feedback but do it honestly, I don’t believe arbitrary, non-algorithmic, actions are the answer. It’s certainly not the answer for Google, I know that much.
One thing is certain; producing high quality relevant results ain’t easy.
Finding An Algorithmic Solution
So let us consider; what if the shoe was on the other foot?
Imagine that Google had made such a move. It most certainly wouldn’t be hailed; in fact, I am pretty sure people would be screaming from the mountain tops that Google was biased, that they were the Internet judge and jury, on and on. I guarantee you that much.
This is one of the reasons that Google (and many other search engineers) tend to prefer to develop an algorithmic solution to the problem. One of the other obvious reasons is that constantly updating the index from a subjective strategy would be massively resource intensive and cause far more grief than poor results and search neutrality have of late.
This approach is not the answer.
What needs to be done is to find better filters and dampeners which can help limit the positive effects on low quality results. Now this isn’t even close to being as easy as it sounds.
One element that is certainly a potential block is authority. Often, these types of sites have the link equity, age and trust that makes ranking for many a long tail terms, fairly easy. If you’ve ever worked on a strong (authority) domain, you know what I mean. But what if that dampening has an effect on the authority of your site?
See? Not so easy, is it? There will always be winners and losers when the goal posts are moved. You may be one of the losers. Be careful what you ask for.
Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.