David Harry, Author at Search Engine Land

Are Manual Solutions The Answer To Content Farms?

David Harry — Wed, 16 Feb 2011 17:13:15 +0000

It was interesting to see some of the recent reactions when upstart Blekko decided to toss some sites out of their index. For the uninitiated it was a bit of a seeming PR play against Google whom have been getting smacked about for thin quality of late. If you hadn’t guessed by now, we’re talking about (Demand Media’s) eHow and the other “top 20 spam sites” that were nuked.

Of course the question remains, why? It certainly does seem like a knee jerk reaction that almost panders to the search community. Sure, I dislike running into weak content in the SERPs as much as the next guy. But I am pretty sure that there are many other equally thin content in many cases, much worse than what they’re churning out. Seriously? There’s only 20 sites worth tossing?

The State Of Modern Search

All is not lost my friends. One of the better developments over the last few years is all of the new (potential) signals and and infrastructure to deal with them. To a certain extent there is every chance for Google (and other engines) to get past the link.

Why now, more so than in the past? The infrastructure (caffeine) and the motivation (growing quality grumblings). Let’s consider some areas that might make sense, while also helping to combat spam and low quality results.

Personalization. One of the longest running goals at Google is deeper personalized search. Add to that the world of mobile, another personalization and area of great interest, we might see far more personalized results in the near future. If the new infrastructure enables a more granular personalization than is currently in place, this can give new signals that can lessen the spam we see on web today.

Explicit User Feedback. When a user takes an action to tell the search engine something, it is a type of relevance feedback known as ‘explicit feedback’. Think of (now defunct) search wiki as a good example. Others might include emailing a page, saving to favorites and so forth. Traditionally, this type of data has been hard for search engines to come by. The noisier implicit feedback, is far more readily available. But, it would certainly be a great way to deal with spam or unwanted domains in a personalized search environment.

Temporal Data. Another area that has been prevalent with Google over the last few years is freshness (in many query spaces). There may be some of these types of signals that could increase relevancy while dealing with spam. Even for the link graph, stronger weighting of these might help decrease the power of authority in many situations.

Social Graph. Another obvious area is of course, social. The social graph and real-time search are two areas Google is also vested in over the last while. This can lead to deeper personalization as well as other potential signals. Once more, a lot of social signals are open to spam unless they’re used in a granular personalization approach. But in concert with the other elements mentioned here, it seems that it would also help root out weak content/sites while not opening the entire link graph up for manipulation.

At the end of the day, there needs to be an automated solution that is protecting not only against spam but those that wish to do their competitors harm. Having ‘votes’ of spam should only affect the individual user. You can’t spam yourself and removing a competitor only from your results, is the kind of personalization that would work to deal with this.

Some Thoughts From The Geeks

To try and gain some more insight into this and the larger considerations of user feedback, I contacted Rich Skrenta from Blekko and Mark Cramer of Surf Canyon (awesome tool, awesome geek).

On the topic of explicit feedback mechanisms such as we’ve seen with Google Search Wiki, Rich says they didn’t work because, “there are too many possible queries, effectively an infinite set. How many different queries are there are all possible song lyrics?”.

Skrenta then made the case for their approach:

“What we are doing is identifying the top sites per category. The top 100 /health sites collectively have millions of pages and can answer any medical question you have. The top 50 lyrics sites have lyrics for every song”.

That makes some sense, but I am also leery of ‘human powered’ solutions, which was countered by Rich whom contends that it’s: “(..) disingenuous to pretend that “the algorithm” drives the results. The algorithm gets changed on a day to day basis in response to new material appearing on the web. ”

Ok, yes, there are people constantly messing with the algorithms at Google, which does mean they’re also making subjective statements of their own. Also, for the unfamiliar, Google does have raters in the system scoring on perceived relevance as part of search quality testing.

Mark Cramer for his part as someone familiar with user feedback mechanisms, feels that, “the implicit feedback approach is always the best. In most all cases, people are not interested in providing explicit feedback.”

With Google SearchWiki being the glaring example. Surf Canyon did get involved with the move by making it an option for their users, “we figured it wouldn’t hurt to throw it in there” said Cramer, referring to a new option for the application.

In further clarification on Blekko’s approach, which is a more subjective stance, Skrenta once more uses the lyric SERP example:

“Rather than rolling the dice with a 200-weight algorithm that’s been trained by a bunch of minimum wage web contractors, you could actually just pick the top lyrics sites. They have the lyrics to every song every published. And they won’t download malware or spyware onto your computer.“

This is once more a seemingly logical approach, but I can’t see it being something a search engine such as Google would consider. It does speak more to a more personalized environment such as we looked at earlier.

As of this week, even Google is getting back into the explicit user feedback experience with a Chrome add-on for removing sites from your results. Will this fair any better than previous attempts? It is highly unlikely. Forgetting for a moment the market share for Chrome, users simply aren’t that interested. Just give them good results to start with.

Dear Blekko

While we can give kudos to the gang at Blekko for trying to say something on the need for higher quality search results, there are limits. This doesn’t scale well and would be a PR nightmare for any major search engine. Does e-How or Mahalo really have the worst result for everything it publishes? It seems a slippery slope to venture onto.

Where will it end and what safeguards are in place?

Until it can be proven in some larger implementations that users will not only engage with explicit feedback but do it honestly, I don’t believe arbitrary, non-algorithmic, actions are the answer. It’s certainly not the answer for Google, I know that much.

One thing is certain; producing high quality relevant results ain’t easy.

Finding An Algorithmic Solution

So let us consider; what if the shoe was on the other foot?

Imagine that Google had made such a move. It most certainly wouldn’t be hailed; in fact, I am pretty sure people would be screaming from the mountain tops that Google was biased, that they were the Internet judge and jury, on and on. I guarantee you that much.

This is one of the reasons that Google (and many other search engineers) tend to prefer to develop an algorithmic solution to the problem. One of the other obvious reasons is that constantly updating the index from a subjective strategy would be massively resource intensive and cause far more grief than poor results and search neutrality have of late.

This approach is not the answer.

What needs to be done is to find better filters and dampeners which can help limit the positive effects on low quality results. Now this isn’t even close to being as easy as it sounds.

One element that is certainly a potential block is authority. Often, these types of sites have the link equity, age and trust that makes ranking for many a long tail terms, fairly easy. If you’ve ever worked on a strong (authority) domain, you know what I mean. But what if that dampening has an effect on the authority of your site?

See? Not so easy, is it? There will always be winners and losers when the goal posts are moved. You may be one of the losers. Be careful what you ask for.

A Tactical Guide To Becoming An SEO Ubergeek

David Harry — Thu, 14 Oct 2010 15:52:07 +0000

So you’re sitting there thinking: how can I take my SEO chops to the next level? Well, I am sure hoping you are, or have at some time along the path. But where to start? To me an SEO that doesn’t understand information retrieval (IR) is like the web developer that doesn’t know HTML. You really should know how a search engine works. No, I am serious… it is 2/3rd of the initialism for cryin’ out loud.

You should be proud to say: Hi there. My name is Dave and I am an algoholic.

For my first post here on Search Engine Land I want to bring you into my world. A glimpse into what types of articles I will be writing for you here. For those not familiar with me I am afflicted with the IR bug in the most geeky of ways and today, I’ll give you a crash course on how you can be too.

Become Patently Obvious

The first thing we are going to look at are patents. Or at least we’ll get some perspective. You see, all too often the SEO world freaks out when a patent is awarded and starts hailing it as if someone discovered Atlantis. This is truly bad form. Right away one needs to consider the concept of patent pending. If the patent was filed in say, 2004, then it has already been implemented, certainly adapted, and even possibly discarded since then. Nothing from a patent is new nor telling beyond gleaning the mindset of a search engineer. One must avoid the SEO Magic Bullet approach.

People also tend to look at patents in isolation. This is also short sighted. Google has been awarded more than 10 patents on local search in the last three years alone. As such we must look at the totality of them and consider whichever current award we’re looking at in that context.

Now let’s take the “avoid SEO magic bullet” perspective and look at some ways to stay on top of things. Some tips to keeping up (rationally) with patents include;

Set up some alerts via RSS with Latest Patents.

Create email alerts at Free Patents Online.

Remember to research the authors. This often gives insight into what they’ve worked on in the past and offers relevant content.

Always bear in mind we never know the exact uses nor weighting of a given signal in search engine algorithms.

Seek out related patents that can offer context and perspective.

Always check the associated images with the patents; they simplify things.

Read other uber patent geeks such as Bill Slawski.

It will take you some time to get used to reading patents, but with time and practice it does get easier. The end goal is not as much about figuring out how the search engine in question is incorporating the patent into its algorithms. It is always about getting into the mind-set of a search engineer so that you develop a common sense approach to your own strategies and testing practices.

SEO Ain’t Rocket Science, It’s Computer Science

The next area we want to get into is the world of information retrieval. This is part of the computer science world. If patents are the past, IR world watching is the future. This is an important part of becoming a super uber search geek. While there is much you can glean from what’s already out there in the search world, in many ways it is about seeing what lies ahead. When doing SEO you want to always ensure your tactics are “future proofed.” As such, IR watching is paramount to delivering a strategy that stands the test of time.

To get you rolling with that here is some essential viewing:

How Search Engines Work

Machine Learning Section on VideoLectures.net

Natural Language Processing vid

Text Mining vids

Semantic Web Vids

And some essential reading:

ICML – the International Conference on Machine Learning

SIGIR – Special Interest Group on Information Retrieval

AIRweb – Adversarial Information Retrieval on the Web

Google Research

Microsoft Research

A selected list of IR research papers

And of course you can search places such as Stanford for even more. There is a ton of information out there but the above resources should give you a starting point for doing some research and reading of your own.

Free Courses And Learning On The Web

Introduction to Information Retrieval. This is an online version of the book from Cambridge University. This book is the result of a series of courses taught at Stanford University and at the University of Stuttgart, in a range of durations including a single quarter, one semester and two quarters. These courses were aimed at early-stage graduate students in computer science.

Information Retrieval. A book by C. J. van Rijsbergen. The major change in the second edition of this book is the addition of a new chapter on probabilistic retrieval. This chapter has been included because I think this is one of the most interesting and active areas of research in information retrieval. There are still many problems to be solved so I hope that this particular chapter will be of some help to those who want to advance the state of knowledge in this area.

Information Retrieval Interaction – P. Ingwersen.
Focuses on user interaction in information retrieval. The aims of the book are to establish a unifying scientific approach to IR—a synthesis based on the concept of IR interaction and the cognitive viewpoint; to present research and developments in the field of information retrieval based on a new categorization, and to generate a consolidated framework of functional requirements for intermediary analysis and design.

Are We There Yet?

These resources should be enough to get you going and set you on the path to understanding more about search than your average SEO practitioner, and you should be well on the road to becoming an uber geeky search nut. And please remember, this is not something to take lightly. The more you dig into the deeper aspects of “this thing of ours,” the more prepared you will be. The next time you read some blog post or attend a seminar you will be far better equipped to distinguish the signal from the noise. You can look at the theory and ponder: does this even make sense? You will also be better prepared to conduct your own testing. How can one properly test a theory when they don’t even understand the rudiments of how a search engine works?

You don’t have to become a computer scientist to be an SEO. You don’t need to get your PhD to get pages to rank well in search results. But if you spend some time learning more about the very focus of what we do I can guarantee that you will have a far more profound understanding of the job than you had previously.