Subscribe Via Web Feed Subscribe with Google Add to My Yahoo! Subscribe with Bloglines Add to netvibes Subscribe with Live.com

« SearchCap: The Day In Search, May 8, 2007 | Main | Netscape Compared To Digg, A Marketer's Perspective »

May. 8, 2007 at 4:52pm Eastern by Danny Sullivan

Newspapers Amok! New York Times Spamming Google? LA Times Hijacking Cars.com?

Back in March, Google warned that allowing your internal search results to be listed in Google might be considered spamming. Today, there's some buzz that one of the top listings for a search for sex on Google turns out to be an internal search results page from the New York Times. In looking at that, I also came across an example of the LA Times "hijacking" the listing of Cars.com for a search on cars, thought the fault for that lies with Google.

Let's start with the New York Times:

Sex On Google

John Andrews wrote yesterday that internal search pages from the query.nytimes.com domain were ranking well for various terms. Threadwatch started checking around and came up with the money shot today, that the New York Times grabbed a top spot for "sex"  with this page. Threadwatch headlined it "the web's freshest spam." Search Engine Journal picked up on the spamming charge, which got exposed to those over at Digg.

So is it spamming? Yep. As my earlier article, Google Warning Against Letting Your Search Results Get Indexed, explains, Google's guidelines on inclusion of search results content say:

Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don't add much value for users coming from search engines.

This means that the New York Times ought to block any pages within the query.nytimes.com domain. They don't, so technically, they're spamming.

What's uncertain is whether these query pages have been crawled from before the guidelines change or not. IE: internal queries from the New York Times may have long been opened to crawling from before the recent change but only now becoming visible perhaps due to an algorithm change.

My article points out that many sites are now technically spamming, because of this little publicized change. Heck, I've yet to block our own search results from being crawled by Google. I just haven't gotten around to it. I'm sure I'm not the only one.

John's article also notes that the New York Times has an entire Times Topics area that is accessible to search engines. He describes these as:

The re-published, re-purposed, New York Times Archives. Each “article” is re-purposed on a clean, CSS-driven text page, clearly dated TODAY and not-co-clearly labeled as “originally published” back in 1997, 1998, or whatever all the way back to 1981. Of course cross-referenced, categorized, sub-categorized, ad-infinitum.

I know Marshall Simmonds who does oversees the SEO work at the New York Times well, and I'll throw him a break on this one. "Re-published, re-purposed" material sounds pretty bad, devious and spam like. Now how about if I say the New York Times is making categorized lists of its stories available to search engines, in the way that literally millions of blogs do? Bad then?

For example, here's John's republished, repurposed information on link building that is clearly dated today, at least according to the most important date that search engines examine -- what's in the http header information. That page is John's category page for posts he's done on link building. The last post was from July 9, 2006 -- but the http header info reports the page as having a "fresh" date of May 8, 2007.

John's doing absolutely nothing wrong. As I said, it's common for sites to have category pages for stories they've written. It's GOOD for them to have these, in most cases. As for the header, it's also common that sites don't provide last modified dates or that they reports the current date as the document's authored date. That's why search engines typically depend on their own internal comparison processes to determine if a document has changed or other means to assign actual dates to them. The visible date shown to human often means little.

Given this context, I find it hard to see how the New York Times is spamming with the Times Topics pages, any more than I'd say the Topix news search site is spamming with its long-standing topics pages.

Moreover, if you go back to that sex search, ranked above the NYT is a Salon category page on sex. And in the same first page of results, I also get category pages from the Village Voice and Metacafe. Spammers? No. Smart SEO? In some cases, yes, for the forethought in having optimized category pages.

Should category pages be treated the same as search results pages -- IE, be seen as something that should be blocked? If so, many blogs will be facing difficulties. In addition, the line between "search results" and "category results" can be unclear. As I noted when the policy was added to the Google guidelines:

In contrast, the new policy is a can of worms that's been opened for shopping and other sites that have learned to turn product search results into crawlable content. At the moment, I think we're in watch-and-see mode as to how aggressively or selectively Google applies removal. If you're concerned, start looking at your robots.txt files now. If you're a long-term thinker, understand the writing is clearly on the wall for sites that pretty knowingly have milked their search results to pull in Google traffic.

While the policy has been in place for several months, I still see plenty of search results and category results showing up. Valleywag recently highlighted how Technorati turns up often for generic Google searches. In addition, my From The Isn't It Ironic Dept: Google Product Search's Results Show Up In Google article focused on how Google Product Search results were showing up in Google, despite the policy. That was an understandable oversight -- but the screenshots also show how plenty of other shopping results remain in Google, despite supposedly being a bannable offense.

Finally, in checking on the New York Times, I plugged in cars to see if it was ranking for that term. It wasn't, but the Los Angeles Times was -- and that listing caught my eye. See it at the bottom here:

Cars On Google

The title is all in lower-case, and there's no cached version. That's the a sign that Google is listing a "partially-indexed URL," one that it can't crawl for some reason.

As it turns out, the title is in lower-case and without a cached version for a different reason. That listing is actually just a link from the navigation you'll find at the top of Los Angeles Times pages, like this:

LA Times Link To Cars.com

The link does a 302 temporary redirect to Cars.com. This is causing Google to think that the LA Times is somehow the owner of the listing that formerly showed Cars.com in that spot. Cars.com is still getting the traffic at the moment, but the LA Times controls it -- has technically hijacked it. If it wanted, it could redirect that URL to anyplace else other than its Cars.com partner.

Google had largely fixed this hijacking problem. Threadwatch noted last month that it looked to have returned. Indeed, the cars search shows it in action big time.

Like The Story? Vote For It On Yahoo Buzz!
Subscribe To Our Daily Search News Recap!
Your Email:
Send me the monthly search newsletter too! (Learn more about our newsletters and feeds)
Subscribe To Our Search Feed!
Subscribe Via Web FeedSubscribe with GoogleAdd to My Yahoo!Subscribe with BloglinesAdd to netvibes
Subscribe with Live.comSubscribe in NewsGator OnlineSubscribe in RojoAdd to My AOL
Share & Bookmark This Story!
By Danny Sullivan Permalink Jump To Comments See Related Stories In: Google: SEO, SEO: Spamming, SEO: Titles & Descriptions



Reader Comments

Search:

Search Marketing Expo

Save the date for:
SMX Madrid (in Spanish, May 20-21)
SMX Advanced - Seattle, WA (June 3-4) Register today! Early bird rate expires May 9!
SMX Local & Mobile - San Francisco, CA (July 24-25) (July 24-25) Pre-agenda rate expires May 2. Get the lowest rate by registering now.
SMX East - NYC - (Oct. 6-8)
SMX London - November 4 & 5, 2008

Search Marketing Now

Learn more about search marketing through free online webcasts and webinars from our sister site Search Marketing Now.

Upcoming Webcasts:

Most Recent News Posts

About Search Engine Land

Stay Updated!

Get Our Search Newsletters:
Email:
Daily Monthly

Get Our Search Feed:
Subscribe Via Web FeedSubscribe with Google
Add to My Yahoo!Subscribe with Bloglines
Add to netvibesSubscribe with Live.com
Subscribe in NewsGator OnlineSubscribe in Rojo
Add to My AOL
More About Our Feeds & Newsletters

Add to Technorati Favorites

Track Us Socially:
Facebook: Our Search News App
Facebook: Search Engine Land Page
Facebook: Search Engine Land Group
Flickr: Search Engine Land
LinkedIn: Search Engine Land Group
Twitter: Search Engine Land Feed

Bragroll