Todd Nemet, Author at Search Engine Land

Do As I Say, Not As I Do: A Look At Search Engines & SEO Best Practices

Todd Nemet — Thu, 01 Dec 2011 14:05:32 +0000

Now that the holidays are upon us, we all probably could use some cheering up. So I thought I’d have some fun with our favorite search engines: Google, Yahoo, Bing, YouTube, and Blekko.

At Nine By Blue, I have been developing software that automatically checks sites for technical SEO best practices. Normally we run it on our clients’s sites to quickly check for issues and monitor them for any future problems.

But I was curious to see what I would find if I pointed the software at some typical pages on the search engines’s sites and then compare their implementations with the technical SEO best practices that we typically recommend.

Below is a list of some of the issues that I found in no particular order.

Disclaimer #1: This list is intended to point out how difficult it is to fully optimize a site for SEO, especially large-scale enterprise sites. I’m not claiming that I could have done any better, even if I had full control of these sites.

Disclaimer #2: Yes, I’m aware of Google’s SEO report card, but I have never read it because it is too long. Also, I didn’t want to be influenced by it.

Use a Link Rel=Canonical Tag On The Homepage

Most of the sites that I reviewed had many different URLs that lead to the home page. This can be because of tracking parameters (i.e. https://www.site.com/?ref=affilliate1) or default file names (i.e. https://www.site.com/index.php), or even duplicate subdomains (https://www1.site.com/).

Because of this, I always recommend putting a link rel=canonical tag on the home page. This ensures that links to these different home page URLs all get counted as pointing to the same URL. I also recommend adding this tag for any other pages that might have similar issues.

I was surprised to find that Bing was the only site that had a proper link rel=canonical tag on the home page.

YouTube also has a link rel=canonical tag, but it was pointing to an improper URL “/” instead of the full URL “https://www.youtube.com/”.

Avoid Duplicate Subdomains & 301 Redirect Them To The Main Subdomain

With a few exceptions, I have been able to find a duplicate copy of the sites that I review.

I have a list of typical subdomains — like www1, dev, api, m, etc. — that will generally turn up a copy of the site. Other duplicate copies of a site can be found at the IP address (i.e. https://192.168.1.1/ instead of https://www.site.com/) and by probing DNS for additional hostnames or domains.

These duplicate subdomains or duplicate sites have a negative effect on SEO because they make the search engines crawl multiple copies of your site just to get one copy. It can also cause links intended for a particular page to be spread out among multiple copies, reducing the page’s authority.

The best way to fix this is to use a permanent (301) redirect to canonical subdomain’s version of that URL. If that isn’t possible, then a link rel=canonical tag pointing to the canonical subdomain page will work almost as well.

For example, an entire duplicate copy of Bing.com is available at https://www1.bing.com/. Compounding this is the fact that the page has a link rel=canonical tag also pointing to https://www1.bing.com/ and all the links on the page point to www1 as well.

Other subdomains, such as www2 through www5 and www01, all properly redirect to www.bing.com with a 301.

Blekko has an old, pre-launch copy of its site at https://api.blekko.com/. (Here is their old executive page.) Fortunately, this subdomain has a robots.txt file that is preventing it from being crawled. But these pages, like the old executive page at https://api.blekko.com/mgmt.html is also available at https://dev.blekko.com/mgmt.html and the main subdomain at https://blekko.com/mgmt.html.

It would be better to 301 redirect these URLs to the current management page at https://blekko.com/ws/+/management than to leave multiple copies of them on different subdomains.

YouTube redirects its duplicate subdomains www1 through www5 to www.youtube.com, which is in line with best practices. Unfortunately, it redirects with a 302 (temporary) redirect rather than a recommended 301 (permanent) redirect.

Use Permanent Redirects From https: URLs To http: URLs IF They Don’t Require SSL

Another type of duplicate copy of a site that I usually find is the SSL/https version of the site. https is appropriate for pages that require security, like a login page or a page for editing a user profile, but for pages that don’t require security, it is a source of duplicate content causing crawl inefficiency and link diffusion.

The recommended solution for this is to redirect pages from https to http whenever possible.

Our software detected duplicate https copies of most pages, including Microsoft’s help pages, the YouTube about pages, Google’s corporate page, and even the Google webmaster guidelines.

The duplicate content issue with the Google webmaster guidelines page (and the other Google help pages) is compounded by a link rel=canonical tag that points to either the http or https version of the URL, depending on URL is requested.

It is important to make sure that the link rel=canonical tag always points to the intended canonical version of the page, so be careful when dynamically generating this element.

A request for https://www.bing.com/ results in a security warning (shown below) due to a mismatched SSL certificate. This is common for sites using Akamai for global server load balancing.

It even pops up for https://www.whitehouse.gov/. I’m not aware of a way to get around this issue, though I would love to talk with something at Akamai about this.

Use Robots.txt File To Prevent URLs From Being Crawled

Sites generally have different types of pages that they don’t want to have search engine’s index. This could be because these pages are unlikely to convert or aren’t a good experience for users to land on, like a “create an account” or “leave a comment” page. Or it could be because the page is not intended for Web browsers, like an XML response to an API call.

Bing’s search API calls, which are made to URLs starting with https://api.bing.com/ or https://api.bing.net/ can be crawled by spiders according to the robots.txt file. This can be devastating to crawl efficiency because search engines will continue to crawl these XML results even though they are useless to browsers.

A search on Google for [site:api.bing.net OR site:api.bing.com] currently returns about 260 results, but based on analysis I have done on clients’ Web access log files, it is many times more URLs than these have been crawled and rejected.

Use ALT Attributes In Images

Images should always be given alternate text via the ALT attribute (not TITLE or NAME as I have seen on some sites). This is good for accessibility issues like screen readers, and it provides additional context about a page to search engines.

Though many images on the pages that were checked had appropriate alternate text, I couldn’t help but notice that Duane Forrester’s image on his profile page didn’t. But he is in good company because Larry, Sergey, Eric, and the rest of the Google executive team don’t either.

Avoid Use Of Rel=Nofollow Attributes On Links To “sculpt PageRank”

A rel=nofollow attribute on a link tells search engines not to consider the link as part of its link graph. Occasionally, I will review a site that attempts to use this fact to control the way that PageRank “flows” through a site.

This technique is generally considered to be ineffective and actually counterproductive, and I always recommend against it. (There are still valid uses for rel=nofollow attributes on internal links, such as link to pages that are excluded from being crawled by robots.txt.)

None of the search engine pages I checked were using rel=nofollow attributes in this way with the exception of the YouTube home page.

In the image below, nofollowed links are highlighted in red. Links to the most viewed and top favorited are being shown to search engines but general music, entertainment, and sports videos are not.

Return Response Codes Directly

A URL that doesn’t lead to a valid page should return a 404 (page not found) response code directly.

If an invalid URL is sent to Bing’s community blog site, it will redirect to a 404 page. Here is the chain:

The URL https://www.bing.com/community/b/nopagehere.aspx returns a 302 (temporary) redirect to
the URL https://www.bing.com/community/error-notfound.aspx?aspxerrorpath=/community/b/nopagehere.asp, which returns a 404 (page not found) response.

The recommended best practice would be for the first URL to return a 404 directly. If that isn’t possible, then the redirect should be changed to a 301 (permanent) redirect.

Yahoo’s corporate information pages do something interesting when they get an invalid URL.

A request to https://info.yahoo.com/center/us/yahoo/anypage.html, which is not a valid URL, correctly returns a 404 (page not found) response.

But the 404 page contains an old school meta refresh with a time of one second that redirects to https://info.yahoo.com/center/us/yahoo/.

A 301 redirect to this page is the recommended way to handle these types of invalid URLs.

Support If-Modified-Since/Last-Modified Conditional GETs

I am a big fan of using cache control headers to increase crawl efficiency and decrease page speed. (My article on this topic is here.)

I found it interesting that out of all the URLs that were checked only a few Google URLs supported If-Modified-Since requests and none of the URLs supported If-None-Match.

Periodically Check Your DNS Configuration

As part of a site review, I like to use on-line resources like https://intodns.com/ and https://robtex.com/ to check the DNS configuration.

DNS is an important part of technical SEO because if something breaks with DNS, then the site will go down and it isn’t going to get crawled. Fortunately, this rarely happens.

However, I have reviewed sites that had their crawling affected by DNS changes. And I have reviewed several large sites that had their DNS servers on the same subnet, essentially creating a single point of failure for their entire business.

As expected, all the search engines had no serious DNS issues. I was surprised to see that two of them had recursion enabled on their name servers because in some rare instances that can be a security risk.

My recommended best practice is to run these types of checks at least once a quarter.

Conclusion

These are a few of the issues that were found that I commonly see or think are important. There were others, but they were relatively minor or subtle things like short titles, duplicate/missing meta descriptions, missing headers, and too many static resources per page.

Normally, I would have access to Web access log files and webmaster tools, which allows our software to check a lot more things.

I hope this gives you some ideas for things to check on your own site. And I hope that when you find something that you realize that even the search engines have their own technical SEO issues from time to time.

The Clickthrough Rate Equation In Organic Search, Part Two

Todd Nemet — Thu, 03 Nov 2011 16:24:00 +0000

In last month’s post, I talked about how improving organic clickthrough rate multiplies the effectiveness of the other work that goes into optimizing a website for search, such as keyword research, SEO, and usability. Most of these ways of increasing clickthrough rate are directly in our control by tweaking the on-page code.

I ended by covering the two most important components of the search result: titles and snippets.

In this post, I’m going to cover some of the other search result components that can also improve clickthrough rate.

The Green Text

URLs

I have noticed that some sites like to put lots of keywords in their URLs so that they will show up in the search results. (And possibly because they believe is helps with ranking, which is a separate issue.) Using keyword-rich URLs is fine as long as you take the following into consideration:

Don’t do this if your URL path elements are actually URL query parameters.

For example, you have a URL like https://www.example.com/t-shirt-id/1234/page/4 that was rewritten from a URL like https://www.example.com/product.php?t-shirt-id=1234&page=4. If you do, you are risking serious crawl efficiency issues because search engines can’t normalize path elements the way that they can with query parameters.

Make sure that you aren’t inadvertently causing any case-insensitivity issues or duplicate content issues.

I see a lot of sites that will return the same page for a URL like https://www.newssite.com/it-doesn’t-matter-what-you-put-here-12345 and the real canonical URL like https://www.newssite.com/kim-kardashian-files-for-divorce-12345. Be sure to use a 301 redirect or at least a link rel=canonical URL to normalize pages like these.

Don’t change all of the URLs on your site just for the sake of putting keywords in them. A significant site re-architecture like that is difficult to pull off without any hiccups.

Here is an example URL from a search for [xkcd t-shirts] that contains keywords in the URL:

Breadcrumbs

I think a far better way to get relevant keywords into a search result is by using breadcrumbs. Here are two more example search results for the same query:

These breadcrumbs are great not only because they contain relevant keywords, but also because they give a sense of how the page you are thinking about clicking on fits in to the rest of the site. This will make it easier for users to navigate your site and make it more likely for them to convert.

Here are the corresponding breadcrumbs on the pages from the two search results above:

Thinkgeek.com:

Redbubble.com:

It isn’t possible to put together just any set of links and have search engines pick them up. At a minimum the links and link text need to:

be canonical
be relevant
be short (no more than 3 or 4 words)
most importantly, represent the actual navigable hierarchy of the site.

Google and Bing list their recommended best practices for breadcrumbs and describe the mark up language on this Google help page and this Bing help page. Both support microdata and RDFa. Schema.org also has support for a breadcrumb property if you are throwing in with microformats.

Structured Markup

RDFa, Microformats, Microdata

Structured markup can be used to explicitly indicate specific types of data to search engines. According to my notes from SMX East in September, these are supported:

Bing and Google: reviews, people, recipes
Google: products, events, music, and apps
Yahoo, Bing, and Google: Schema.org, which has a zillion types of data to annotate but which has limited support currently because it was recently announced in June of this year.

Here is an example showing rich snippet mark up for a product with reviews on Amazon:

Every site I have spoken with or that has presented at a session I’ve attended has indicated a large increase in click through rate after implementing their markup, especially for reviews and recipes. (One example: Topher Kohan of CNN mentioned at SMX East that adding hRecipe markup to one of their sites resulted in a 22% increase in traffic.)

Selecting the right type of markup and implementing it is an entire post in itself, so I’m going to recommend that if you have content of a type listed above, you should read through Google’s help article on rich snippets and structured data and the schema.org site.

Also, check out this great article by Aaron Bradley that gets into potential relevancy effects of marking up your pages with structured data.

Rel=author/me attributes

Indicating the author with structured markup on an article or blog post shows a profile picture along with a link to the author’s Google Plus profile page.

Setting this up requires a few steps that weren’t immediately clear to me, although Rick DeJarnette explained it well in How To Create Your Digital Footprint With Links, it involves setting attributes on three links:

rel=”author” on the link from the article to your general author page (for example, https://searchengineland.com/author/danny-sullivan)
rel=”me” on the link from your general author page to your Google Profile page (https://profiles.google.com/)
rel=”me” or rel=”contributor-to” on the link from your Google profile page to your general author page. To do this find your Google profile, click edit profile, and edit “Contributor to” to add a link to your general author page.

Sitelinks

Sitelinks are the block of related extra links that show up under a top search result. It’s a good idea to check these sitelinks periodically by searching for your most popular branded searches on Google and Bing.

If you see links you don’t like on Google, you can “demote” them by logging into Google Webmaster Tools and going to Site configuration > Sitelinks. The demotion will only last for 90 days.

As motivation to check your sitelinks, here is an unfortunate set of sitelinks that I found last week when trying reset my Starbucks account password:

(Aside to anyone at Starbucks: I’m pretty sure this is happening because of the way your site returns a 200 and redirects for certain types of “page not found” pages. Contact me, and I’ll send you more information. By the way, I will work for coffee.)

Sitelinks can also occur within search results, not just at position one. For example, these two search results for the query [ancient egypt] show up with their own abbreviated sitelinks:

The standard advice for getting sitelinks to show up — again from my SMX East notes — is to make sure they are “prominent links on your site.” This Google help article also recommends making sure the links have anchor text that is “informative, compact, and avoids repetition.”

Table of content links within the same page

If your site has a lot of long, technical articles or other well-structured content that generally lends itself to having a table of contents, using fragment identifiers (also called named anchors) is a really great way to get additional links with keywords to show up in search results.

Here is an example from the query [exoplanet gravitational microlensing]:

Bing also has support for this as seen from this search for [ancient egypt]:

To increase the chances of having these show up make sure your pages are well-structured, the anchors have descriptive text, and that the pages have a table of contents with links to each individual anchor.

The table of contents containing the fragments doesn’t have to take up a lot of space on the page. Here is an example from a professor’s personal site that I thought was interesting:

This is the section of the page containing the table of contents:

Miscellaneous Tips

Rank higher

Ranking higher in the search result pages will result in a higher clickthrough rate, but that’s out of our direct control and a little beyond the scope of this post.

Character encoding

Occasionally, I see a site with character encoding issues. Usually it results from having the server configured for one character encoding while the page templates and/or the underlying database are configured with different character encoding.

Aside from server configuration issues, I’ve seen this happen with sites that include data from 3rd party sources with varying character encoding and when documents are copied and pasted from Word directly into webpages.

If character encoding issues surface on your site, it will definitely reduce click through. Compare this result:

with this one:

I faked this one by deliberately setting my browser to the wrong character encoding, but I have seen issues like this on sites. Generally, I recommend doing everything in UTF-8 as much as possible.

Instant Previews (Google)

In November 2010 Google started showing instant previews, which pops up a preview of the web page in the search results when you hover over the result. The announcement makes the claim that people who use them are “5% more likely to be satisfied with the results that they click.” We’ll take it.

You can test out your instant previews in Google Webmaster Tools at Labs > Instant Previews. There you can find out if Google is able to pre-render its instant previews or if it has to generate them on the fly. You can also see what your instant previews on mobile search look like.

If your CSS and JavaScript files are robotted out, like they are in Search Engine Land, Google will have to generate the preview on the fly, and you will see something like this in Google Webmaster Tools:

Notice how the one on the right has no formatting, like it’s a text-only cached version of the page. I didn’t notice any delay when viewing Search Engine Land’s instant preview, but I would still recommend that Google be allowed by pre-render these instant previews.

For more information check out Google’s very useful FAQ on instant previews, which is on a separate Google Sites page for some reason.

Social Signals

This is another area that is out of our direct control, but it shows some of the benefits that a good social media program can have on an organic campaign. Having friends and colleagues recommend links that show up in your search results can only increase clickthrough rate.

Bing integration with Facebook

Bing has excellent integration with Facebook, which annotates your search results with friends who have recommended the same pages. As an example, on a Bing search for [bay area college radio], I see that four of my friends recommend the venerable college station KFJC 89.7.

Google integration with everything but Facebook

With Google, depending on how the person who is searching has filled out his or her profile, you can get recommended results from Google+, Twitter, Blogger, and Buzz. I have even seen results that were recommended to me because someone I am linked to via Gmail shared it.

A recommendation from Blogger showing up in a search for [kfjc]:

A recommendation from Google+ showing up in a search for [google profile]:

Conclusion

I hope that this quick run through of different techniques that can affect how your pages show up in search results — URLs, breadcrumbs, structured markup, author tagging, sitelinks, named anchors, instant previews, correcting character encoding issues, and social signals — gives you at least a few ideas of how to increase your site’s clickthrough rate, which will multiply the effects of all the other optimizations you are doing on your site.

The Clickthrough Rate Equation In Organic Search

Todd Nemet — Thu, 06 Oct 2011 18:44:14 +0000

When I was in middle school, my favorite book and my favorite TV show were both Cosmos by Carl Sagan. I must have read the book at least 10 times, and I watched the series every time it was on the local PBS station.

One of the most interesting parts of Cosmos that has stuck with me is the Drake equation:

The Drake equation is an attempt to estimate the current number of intelligent civilizations in the Milky Way by breaking it down into component parts (such as “f(p), the fraction of stars that have planets” and “f(l), the fraction of planets capable of sustaining life”) and then multiplying them all together.

You can watch Dr. Sagan explain the Drake equation on YouTube. He pessimistically puts N at 10 (the early 80’s were a bummer, man) but then upgrades it to “millions” a less than a minute later (short term memory is also a bummer, man).

As I was talking to someone at SMX East a few weeks ago, it occurred to me that measuring conversions from organic search could be expressed similarly to the Drake equation like this:

In this version, C is the number of conversions, N(k) is the number of people searching for a keyword (or a group of keywords), f(I) is the fraction of searches where one or more links from your site show up (also called an impression), f(CTR) is the clickthrough rate from the search engine results, and f(conv) is the fraction of people who convert after clicking through.*

Then it occurred to me that a lot of attention is paid to three of these terms. Roughly speaking, N(k) is covered by keyword research, f(I) is a major goal of SEO, and f(conv) is in the realm of usability and graphic design.

Relative to the other three terms, clickthrough rate doesn’t get a lot of attention or optimization even though gains in clickthrough rate multiplies the effectiveness of these other factors. This is odd considering that most of the factors influencing CTR are within our direct control and won’t affect usability of the website at all.

So if we consider CTR as a highly leveraged but undervalued factor in converting users from organic search and one that is largely within our control, it is probably worth a column or two to take a high-level look at the various ways we can influence it.

The rest of this article only covers the title and snippet in search results and how they affect clickthrough rate. Next month in this column, I’ll cover many more.

The basic components of the search result is covered in an article by Vanessa Fox, so go check that out if you need a refresher or if you find some of the terminology is unclear to you.

Title & Meta Description

The most visible and largest components of the typical search engine result are the title and snippet. The title is generally taken from the HTML title tag of the page. The snippet can be taken from several sources, but ideally it comes from a well-written meta description tag.

Note that both the title tag and meta description aren’t generally visible when viewing the page in a browser (especially with the number of tabs I usually have open). This gives a lot of latitude in influencing the search engine results display, but it also gives enough rope to hang yourself if you aren’t careful.

Search Engines Overriding Titles & Meta Descriptions

In the example above, the snippet is pretty good. It’s descriptive, and ultimately was the result that I clicked on to refresh my memory on the topic.

However, when I checked the source code of the page to see if the snippet was pulled from the meta description this is what I found:

So while the title is coming directly from the page, the meta description clearly is not. This is some boilerplate text left in the page template. Because this text probably appears in many pages on the site and because it is clearly unrelated to the content of the page and because it’s too short, Google generated the snippet for this result from text on the page.

Usually the results aren’t that good, which is why it is important to pay attention to the meta descriptions of each page. Here are some of the other results for the same query, none of which gives me a good sense of the page:

Look at it this way: If you wouldn’t let a computer write your AdWords ads, then you shouldn’t allow a computer to write snippets for your site.

From the sites I’ve evaluated for clients, duplication of titles and meta descriptions are the main reason that they are ignored by Google or Bing, so it’s important to take care to make these unique for each page.

In the SMX East session about rich snippets, Jack Menzel from Google listed some additional reasons that Google might overwrite the title in a search result:

The title is “unclear based on the query.” (I’m taking this to mean that important keywords are missing in the title.)
If the title is missing the company or site name, Google may tack it on the end.
If the title is “overoptimized” with keywords, Google may remove a few of them.

Jack was careful to point out that Google will only modify the title when they believe it is beneficial to users, but again, I think it’s important to retain as much control as possible over the way your pages are displayed in search results.

Another issue with duplication is the special case when the title and snippet generated are both identical. When this happens, Google will only show one result, suppress the rest, and show this message at the bottom of the search results:

This is a depressing message because it means that there are pages from your site that ranked for the query but won’t be shown because Google couldn’t differentiate it from the other page that ranked. (This message could also be an issue that your site has pagination issues, which should be dealt with accordingly.)

Placement Of Keywords Within Titles

When people are reading through the search results and deciding which one to click on, they are acting more like a monkey scanning a tree for fruit than someone sitting down with a glass of wine and a copy of Ulysses to ponder the classics.

This means people are scanning for the keywords that are already in their working memory (the current search), or — according to some theories — scanning for the general shapes of these keywords.

If you combine this observation with eye-tracking studies that show how people’s eyes trace around on a typical search engine result page, like this one and this one and this one, then logically follows that important keywords should be put at the beginning of the title where they are more likely to be seen by the monkey-scanners.

(I have heard arguments against putting keywords on the left, but I’ll leave this discussion for people who are more interested in human psychology than I am.)

Thoughts On Scale For Larger Sites

For sites with hundreds of thousands of pages, obviously it’s not possible to write unique and meaningful titles and meta descriptions by hand.

It’s okay to automatically generate these in a way that strongly encourages click through using metadata for the item(s) the page is about.

Here is an example I came across recently:

If I were looking for a home in Willow Glen, it would be hard for me not to click on this result. It’s clearly generated automatically from an application database but in a way that’s unique and designed to encourage clickthrough.

In a future article, I’ll cover other factors that can affect click through rate, like URLs, breadcrumbs, structured metadata, anchors, social signals, character encoding, the phase of the moon, etc…

*After writing this post, I realized that this is similar to the searcher persona workflow as described by Vanessa Fox in her book Marketing In The Age Of Google, so check out that excellent book to explore this concept further .

Tricks For Taming Keywords With Regular Expressions

Todd Nemet — Thu, 08 Sep 2011 13:30:07 +0000

So far my articles about technical SEO have focused on how to adjust a site’s configuration or architecture to make it more crawlable and indexable. In this post, I’m writing about the other end of the technical SEO process: using analytics data to analyze traffic and user behavior by keywords.

When looking at keyword data, it’s important to group them by type. Looking at individual keywords is not only inefficient, but it will generally lead to information that is either misleading or worse, can’t be acted on.

The most precise way to group keywords is by using regular expressions. Regular Expressions are strings containing letters, numbers, and special characters that match a specific word or group of words.

Excellent tutorials for regular expressions are all over the Web, so I’m not going to include an overview here. Instead, I’ll present a few common recipes that I hope people will find useful and instructive. (Besides, because it has been scientifically proven that people learn mainly by imitation.)

If you’d like to see some tutorials, this is an excellent one, and the Google Analytics help page for regular expressions is here. SEOMoz recently posted a good overview here.

Using Regular Expressions Within Google Analytics

I’m going to focus on search keywords using Google Analytics because it has the best support for regular expressions. Other analytics packages I have worked with support most of these concepts if not exactly the same syntax. Excel’s support for matching keywords out of the box is pretty thin, but it appears to be possible to configure it to use regular expressions.

I didn’t want to show any data from my clients, so I asked my friends at Google to give me access to Search Engine Land’s Google Analytics account.* I’ll be using searchengineland.com data in my examples below.

To get to the organic keywords in the new interface, search for “organic” in the Find A Report… box:

Or, browse to Traffic Sources > Sources > Search > Organic:

Branded Keywords

The most important regular expression to nail down is the pattern for branded keywords. User behavior for queries involving brand terms is going to be quite different than other queries. Branded search traffic tends to have a lower bounce rate, fewer new users, and a longer time on site.

So metrics for a group of keywords will be much more meaningful if you can exclude (or only include) queries containing branded terms.

To create the branded terms regular expression, I like to bring up the organic keyword report and try out a bunch of regular expressions, iterating slightly with each try.

The new Google Analytics interface doesn’t accept regular expressions by default, so it’s necessary to click on the “advanced” link next to the search box and select “Matching RegExp” from the drop down:

Now we are ready to start testing keywords, starting with “search engine land”.

This gets a lot of queries, but when I exclude that pattern, selecting “Exclude” from the dropdown to the left of Keyword, I see that I have missed a lot of other branded keywords.

The next iteration is:

“search ?engine ?land”

The ? means “0 or 1 of the previous character.” Now, the pattern matches whether or not spaces are included. This change nets an additional 15k visits for the time period that I selected.

I notice that many people are spelling search “serach,” so the next iteration is:

se(ar|ra)ch ?engine ?land

The parentheses/bar combination will match either option. This matches 118 more visits.

Unfortunately, my pattern is matching the website address searchengineland.com, which I want to exclude because that traffic is basically direct traffic.

First, I try to exclude a period at the end of the pattern with search ?engine ?land[^.], but this is no good because it excludes 99% of the visits that I wanted to include.

(Square brackets will match any of the characters listed, but if the first character is ^ then it will match anything but those characters.)

What I am trying to do is to match “any character that isn’t a period or the end of the query.” I can express this with search ?engine ?land([^.]|$).

$ is a special character meaning “the end of the string.”

This matches fewer visits, but I am now able to exclude queries for the website URL.

When excluding branded queries in combination with other regular expressions, se(ar|ra)ch ?engine ?land is probably a better choice.

Now it is possible to compare the behavior of users who come to Search Engine Land from a branded versus an unbranded query. What I see is pretty typical for the sites that I work with.

Compared with visits from unbranded queries, visits from branded queries:

Are three times more likely to be new visitors
Spend five times as much time on site
Have one-half the bounce rate
View about twice as many pages per visit

In a pinch for tools with less sophisticated search, such as the Google Webmaster Tools query report or Excel, I would just use land to get a rough approximation.

Next, I’m curious about queries for search engines. This is easy to do with something like google|yahoo|bing. It isn’t always necessary to spell out the entire word if people are likely to misspell it.

For example, Baidu is searched for via three spellings (which I got by searching for ^b.*d[ou]$):

baidu, bai du, bidu

I can easily match any of those with ba?i ?du. So, I update my regex to:

google|yahoo|bing|ba?i ?du

Oops! I forgot Blekko!

google|yahoo|bing|ba?i ?du|blek

Another useful group of searches is for stock symbols. But the problem with goog is that it will match both “Google” and “GOOG.”

Here, it is necessary to use the very handy but somewhat obscure \b, which means “blank space, but only at the boundary of a word” or more simply “word break.”

So, I could use \b(goog|yhoo|msft|bidu)\b to match a group of stock symbols.

I would also track metrics for social networking-related queries with a regular expression like google ?(\+|plus)|face ?book|twitter|social net and exclude branded queries from the search.

Note that + is a special character, so I had to escape it with a \.

Of course, I would track \bnemet\b, which resulted in 25 visits this year, half of which bounced.

Other Useful Patterns

These are a few regular expression patterns that I use for every site or certain types of sites.

Long unbranded tail

The “long unbranded tail,” which I define as queries containing three or more terms, excluding branded terms, is always important to track. I have seen sites for which this accounts for over half of organic traffic.

There are several ways to write this regular expression, but .+\b.+\b.+\b.+ is the way I do it.

+ means “one or more of any character” and \b means “word break.”

The entire expression could be interpreted as “at least three word breaks inside the query string.”

Because the query [search engine land] makes up most of the three word queries, excluding the branded pattern is important:

Unbranded queries with three or more terms make up almost 70% of the organic traffic to Search Engine Land. Search features like Google Instant and autocomplete have definitely increased the average number of words per query.

Queries From Google Finance

The Google Finance page for a particular stock, like Yahoo, has a URL like this: https://www.google.com/finance?client=ob&q=NASDAQ:YHOO.

Traffic from Google.com with “q=” in the URL will get treated as query traffic by Google Analytics.

A search using the regex (nasdaq|nyse|amex):[a-z]{1,4} will match these queries. [a-z] means “any character from a to z” and {1,4} means “repeated one, two, three, or four times.”

This doesn’t include the traffic from Google Finance for arbitrary queries, of course. And depending on what types of stocks your site covers, you may need to include more indexes like ftse.

To get a more accurate sense of traffic from Google Finance, be sure to include the referring traffic from www.google.com/finance/…

Addresses

Sometimes it isn’t possible to list out all of the possible query keywords. In that case, the best you can do is write a regular expression that captures enough of the queries to get meaningful data for trending, even if the absolute numbers aren’t so reliable.

For example, it’s not possible to list every possible street address. But limiting the regex to typical elements in a street address does a surprisingly good job.

I generally use \b(road|\rd|drive|dr|lane|way|ave|avenue|st|street)\b, which probably matches about 80% of the queries for a specific address.

It would further improve the accuracy to exclude branded terms or exclude another regex like:

sale|estate|pending

Another thing to try is putting a number in front of it like this:

[0-9].*\b(road|\rd|drive|dr|way|ave|avenue|st|street)\b

The .* means “match any number (including zero) of any character,” so there could be any number or type of characters between the number and the rest of the regex.

The need to match queries containing a state abbreviation is pretty common. This regex assumes that only the two letter abbreviations are being used and that they appear at the end of the query:

\b(a[klrz]|c[aot]|d[ce]|fl|ga|hi|i[adln]|k[sy]|la|m[adeinost]|n[ehjmv]|n[cdy]|o[hkr]|pa|ri|s[cd]|t[nx]|ut|v[at]|w[aivy])$

It gets a few false positive matches (like “LA” meaning Los Angeles versus Louisiana or “CT” meaning court instead of Connecticut), but it brings back enough meaningful data for tracking metrics on these types of queries.

Other Resources

For testing or debugging regular expressions I generally use this handy dashboard widget (for Mac) or the Python interactive shell. There are many regular expression testers on-line and even Chrome extensions and Firefox add-ons.

I hope this post gave you some ideas for grouping and tracking keywords. If you have interesting regular expressions that you commonly use and want to share, please feel free to include them in the comments below.

* This is obviously a joke. My friends would want money before giving me access to someone’s Google Analytics account. ;)

How To Improve Crawl Efficiency With Cache Control Headers

Todd Nemet — Thu, 11 Aug 2011 15:45:46 +0000

Way back at the end of the last century, I worked for a company called Inktomi. Most people remember Inktomi as a search engine, but it had several other divisions. One of these divisions (the one I worked for) sold networking software, including a proxy-cache called Traffic Server.

It seems weird now, but Inktomi made more money from Traffic Server than it did from the search engine. Such were the economics of the pre-Google Internet. It was a great business until 1) bandwidth got really, really cheap and 2) almost all of the customers went out of business in late 2000/early 2001. (Most of Inktomi was acquired by Yahoo! in 2002, and Traffic Server was released as an open source project in 2009.)

Because of my work with proxy caches, I’m always surprised when I do a technical review of a site and find that it has been configured not to be cached. When optimizing a website for crawling, it’s helpful to think of a search engine crawler as a web proxy cache that is trying to prefetch the website.

One quick note: When I talk about a “cached” page, I’m not referring to the “Cached” link in Google or Bing. I’m referring to a temporarily stored version of a page in a search engine, proxy-cache, or web browser.

As an example of a typical cache-unfriendly website, here are the HTTP response headers from my site, which is running my ISP’s default Apache install and WordPress more or less out of the box:

The three lines circled in red are HTTP-ese for “Don’t cache this ever, under any circumstances.”

A little more detail about these headers:

Expires: indicates how long a proxy-cache or browser can consider a document “fresh” and not have to go back and get it. By setting this to a date two decades ago, the server is indicating that it should never be considered fresh.
Cache-control: is used to explicitly tell proxy-caches or browsers information about the cacheability of the document. “no-store” and “no-cache” tell it not to cache the document. “must-revalidate” means that the cache should never serve the document without checking with the server first. “post-check” and “pre-check” are IE-specific settings that tell IE to always retreive the document from the server.
Pragma: is an HTTP request header, so it has no meaning in this instance.

Cache Control Headers & Technical SEO

So what do cache control headers have to do with technical SEO? They matter in two ways:

They help search engines crawl sites more efficiently (because they don’t have to download the same content over and over unnecessarily).
They increase the page speed and improve user experience for most visitors to your site. It can even potentially improve the experience for first-time visitors.

In other words, by adding a few lines to your Web server configuration to support caching, it’s possible to have more of your site crawled by search engines while also speeding up your site for users.

Let’s look at crawl efficiency first.

Crawl Efficiency

Only two pairs of cache control headers matter for search engine crawling. These types of requests are called “conditional GETs” because the response to a GET will be different depending on whether the page has changed or not.

Searchengineland.com happens to support both methods, so I will be using it in the examples below.

Last-Modified/If-Modified-Since

This is the most common and widely-supported conditional GET. It is supported by both Google’s and Bing’s crawlers (and all browsers and proxy caches that I’m aware of).

It works like this. The first time a document is requested a Last-Modified: HTTP header is returned indicating the date that it was modified.

The next time the document is requested, Googlebot or Bingbot will add a If-Modified-Since: header to the request that contains the Last-Modified date that it received. (In the examples below, I’m using curl and the -H option to send these HTTP headers.)

If the document hasn’t been modified since the If-Modified-Since date, then the server will return a 304 Page Not Modified response code and no document. The client, whether it is Googlebot, Bingbot, or a browser, will use the version that it requested previously.

If the document has been modified since the If-Modified-Since date, then the server returns a 200 OK response along with the document as if it were responding to a request without an If-Modified-Since header.

ETag/If-None-Match

If-None-Match requests work in a similar way. The first time a document is requested, an Etag: header is returned. The ETag is generally a hash of several file attributes.

The second request includes an If-None-Match: header containing that ETag value. If this value matches the ETag that would have been returned, the server returns a 304 Page Not Modified header.

If the ETag doesn’t match, then a normal 200 OK response is returned.

ETag/If-None-Match is definitely supported by Bing, but it’s unclear whether Google supports it. Based on the analysis of log files that I have done, I’m pretty sure that Googlebot web requests don’t support it. (It’s possible that other Google crawlers support it, though. I’m still researching this, and I’ll post a follow up article if/when I get more information.)

One common problem with ETag/If-None-Match support pops up with websites that load-balance between different back end servers. Many times, the ETag is generated from something that varies from server to server, such as the file’s inode, which means that the ETag will be different for each back end server.

This greatly reduces the cacheability of load-balanced websites because the odds of requesting the same document from the same server decreases in proportion to the number of back end servers.

In general, I recommend implementing Last-Modified/If-Modified-Since instead of ETag/If-None-Match because it is supported more widely and has fewer problems associated with it.

When To Use These Conditional GETs

Conditional GETs should be implemented on any static Web resources, including HTML pages, XML sitemaps, image files, external JavaScript files, and external CSS files.

For Apache, the mod_cache module should be installed and configured. If the server still isn’t supporting conditional GETs check for a CacheDisable line in the httpd.conf or a .htaccess file somewhere.

For IIS7, caching is controlled by the element in the site configuration file. I’m not sure how to enable it in IIS6, though it appears to be enabled by default.

For dynamic, programmatically generated files, the HTTP headers associated with conditional GETs need to be sent from the page code. You need to do some back of the envelope calculations on two factors to determine if this is worth it.

Does it take as many resources (for example, calls to back-end databases) to determine whether the page has changed versus generating the file itself?
Does the page change often compared to how often the page is crawled by search engines?

If the answer to both questions is yes, then it may not be worth implementing support for conditional GETs in your code for dynamic pages.

Page Speed

I also recommend setting expiry times for static resources that don’t change often, such as images, JavaScript files, CSS files, etc.

This allows browsers to store these resources and reuse them on other pages on your site without having to unnecessarily download them from the Web server.

Also, it is likely that these resources will get stored in a proxy cache somewhere in the Internet where it will be served more quickly to other users, even on their first visit.

There are two ways to set an expiry time using HTTP cache control headers.

Expires: , which indicates the date before which a resource can be stored.
Cache-control: max-age=, which indicates the number of seconds that a resource can be stored.

The expiry time can be set up to a maximum of one year, according to the HTTP spec. I recommend setting it at a minimum of several months.

Configuring Expiry Time

For Apache, it requires installing the mod_expires tag and creating some ExpiresDefault or ExpiresByType lines. Cache-control also requires mod_headers.

IIS7 can be configured through IIS Manager or some command line tools. See this link for more details.

For resources that are generated dynamically, these headers can be added programmatically like any other header. Just make sure that the Expires: date is in the right format or it likely will be ignored.

Other Resources

Below are some additional resources relate to caching, since this article only scratches the surface of the HTTP cache control protocol. I recommend checking out the links below to learn more about it.

Testing cache control headers

Redbot.org, written by “mnot“, is the best cache-checking tool I am aware of. I use it all the time when assessing sites.
Microsoft has a very useful tool for looking at headers that is available here.

I’m also a big fan of using curl -I from the command line to look at headers directory.

Advanced reading

Google’s page speed article on leveraging caching.
Yahoo’s best practices article for speeding up a web site contains some information about caching (click on the “Server” category):[[[]]]
Bing outlines their support for conditional GETs and includes some helpful links here.
Mnot has an excellent, thought slightly dated, overview of caching that is very useful.

4 Ideas To Improve IIS & .NET For Technical SEO

Todd Nemet — Thu, 14 Jul 2011 13:37:55 +0000

In June 2011, I spoke at SMX Advanced about SEO issues that I commonly run in to during technical SEO site evaluations. The part of my presentation that dealt with Microsoft’s Internet Information Server (IIS) generated a lot of comments and questions afterward, so this column addresses some of those questions about how to improve techncial SEO on the Microsoft stack.

First, a caveat: The majority of my experience has been with Linux- and BSD-based operating systems, starting with SunOS way back at Berkeley, so I’m definitely not an expert on deploying servers on Windows and/or .NET.

I’ve asked Microsoft-stack expert Colin Cochrane to correct anything Windows-related that I have stated incorrectly. (Thank you, Colin. Your link is in the mail.) Any remaining errors in this article are definitely mine, and not his.

After completing technical SEO assessments on numerous sites running on IIS and .NET, I believe that it is a very scalable and production-worthy platform, but I have found that its default settings are far from optimal from a technical SEO point of view.

This article describes the most common issues I’ve seen. Several of these issues cause canonicalization problems, as described in more detail in this article about Google’s parameter handling feature.

Oh, and here is a second caveat: Please be sure to test any changes on a staging server before rolling them out to production. I would hate for something to happen to your website because I made a typo or worded something unclearly.

1. Default Pages (Default.aspx)

The problem

Directory pages are available at two URLs, one with and one without the default page. For example, these two URLs would lead to the same page:

https://www.site.com/directory/
https://www.site.com/directory/Default.aspx

In this example, the default page is Default.aspx, though it could be configured to be a different name.

Why it is bad

Link diffusion. Inbound links to the page could point at either of these two URLs. It would be much better to focus the inbound links on only one URL.
Crawl inefficiency. Crawlers have to crawl two URLs to get one page for each directory on the site.

The usual way to deal with duplicate URLs like these is to permanently (with a 301) redirect one URL to the other. However, in this case, it will result in an infinite redirect loop.

The culprit

The reason that redirecting one URL to the other leads to a redirect loop is because both of these URLs look exactly the same to the .NET application. For directory URLs, the default page is always appended to it so the application can’t tell whether it should redirect the URL or not.

Fixing it

The easiest way to fix this is to put a link rel=canonical tag on these pages and point to whichever URL you want to be the canonical. It’s not as good as a permanent redirect, but it will work in a pinch if you don’t want to mess around with your server configuration.

A more permanent fix is to use a 3rd party URL rewriter, which will redirect the URL before it gets to the .NET application. Some URL rewriters I have seen used successfully on sites are URLRewrite (for IIS7 only), URLRewriter, and ISAPI Rewrite 2.

2. Case Insensitive URLs

The problem

The path part of the URLs served by IIS is case-insensitive. So any of these URLs will usually lead to the same page:

https://www.site.com/directory/default.aspx
https://www.site.com/Directory/Default.ASPX
https://www.site.com/DIRECTORY/DeFaUlT.aSpX

Why it is bad

Crawl inefficiency. Google and Bing will crawl all of the different case variations that it sees in links, even though they all lead to the same page.
Link diffusion. Inbound links could go to any of the variations of the same URL. I’ve even seen different capitalizations of URLs used in internal links within a website.
Robots.txt problems. Because the robots.txt file is case-sensitive, if your URLs aren’t crawlers may be accessing URLs that you thought were blocked.

The culprit

My guess is that it has something to do with the Windows path handling in general, which is also case-insensitive.

Some ideas for fixing it

Similar to the first issue, the easiest way to resolve this is to use a link rel=canonical tag that points to the URL with the correct capitalization.

The URL rewriters listed above are the best option for normalizing the case. They can be configured to permanently redirect a URL to the right capitaliziation. If you pick an easy method for canonicalizing URLs, like converting everything to lower case, it can be implemented with one general rule.

Here is an example rule that rewrites a URL to all lower case that will work with URLRewrite:

  
  

If you implement something like this keep in mind that some URLs may require upper case, such as the Bing authorization file BingSiteAuth.xml. URLs like these need to be added to the rule as exceptions.

Here is a post containing 10 very useful rewriting rules, one of which converts URLs to lowercase.

3. Handling Page Not Found Errors & Internal Server Errors

The problem

In its default configuration, ASP.NET handles errors (like page not found or internal server problems) by redirecting with a 302 temporary redirect to an error page, which usually returns a 200 response.

Why it’s bad

Crawl inefficiency. Because a 302 redirect is a temporary redirect, search engines will continue to check that URL often in hopes of one day getting a page at that URL instead of a redirect. And if the target page returns a 200 response, then the search engines will index the initial URL, which means your site might start ranking with URLs that lead searchers to error pages.

This means that pages that are removed from the site or pages that throw an error will get continue to be crawled as if they were regular pages. This means that the crawler is spending time on these URLs instead of on actual pages with useful content.

And because the page not found page gets so much traffic and has so many URLs pointing to it, they tend to get crawled pretty frequently, which further reduces crawl efficiency.

“Non-graceful” site failure. If your site starts returning an error — due to a temporary database problem, for example — large portions of your site could get de-duplicated out of the index because they are suddenly redirecting to the same URL.

The culprit

This is the default behavior in ASP.NET.

Some ideas for fixing it

Fortunately, this issue has a fix that is pretty straight forward and requires a minor change to the web.config file.

Here is part of an example web.config file that prevents these redirects:

  

The attribute redirectMode needs to be set to ResponseRewrite instead of its default value of ResponseRedirect.

redirectMode is not available in all versions of .NET, so you may need to update first. More detail can be found in this article.

4. Browser-dependent code

The problem

.NET has some hooks that makes it pretty easy to write code that changes a page depending on the user agent requesting it.

Why it’s bad

Cloaking. Pages that change based on the user agent (i.e. Googlebot or Firefox) is dangerous for a lot of reasons, but from an SEO perspective it is dangerous because it could lead to unintentional cloaking of content, which can result in having a severe penalty put on your site.

By default, there is nothing user agent-dependent about the code that is served by IIS/.NET. But because the functionality is there, it is possible that browser-dependent code exists in your site.

The culprit

I believe this functionality dates back to the late 1990’s/early 2000’s when browsers had widely different support for web standards. If you are feeling nostalgic for those days, here is an old browser compatability chart that you can look at until the feeling goes away.

Some ideas for fixing it

Chances are there is nothing to fix, but if you want to look at your source code for potential browser-dependent logic, here is an article with sample code that should give you an idea of what to look for.

Conclusion

I hope this article helps you make your IIS installation more search engine-friendly. I have spoken with some very smart Windows developers who initially swore to me that there was no fix for some of the issues in this list, so there is a pretty good chance that your development team isn’t aware of all of these issues or even that these fixes exist.

Of course, these are only a few of the issues that I see with IIS on a regular basis. Others include cacheability of the site, character encoding issues, and URL redirects.

The easiest way to pinpoint these types of issues is by looking at your server logs.

(Blatant Product Placement/Disclaimer: It just so happens that at Nine By Blue, where I work, I created server log analysis software for just this purpose when I got tired of looking for all of these issues manually, so if you’re interested in that product for either your IIS or Apache logs, ask me about an invite to our private alpha.)

I guess the real lesson of this article is that IIS and .NET are a great help to SEO job security.

DIY SEO: How To Check On-Page Ranking Factors Using Google Docs

Todd Nemet — Thu, 16 Jun 2011 15:50:22 +0000

My kids and I really enjoy watching the MAKE Magazine video podcasts together. It’s one of those rare and happy things that a ten-year-old girl, an eight-year-old boy, and an adult can watch together and find interesting.

Inspired by these podcasts, I thought it would be a good idea to create a do-it-yourself SEO project. So today, we’ll make a Google Spreadsheet that checks a web page for various on-page factors that can affect SEO.

Getting Started

What you need:

A Google Account for logging into Google Spreadsheets
A URL that you want to check.

In this article, I’ll be checking https://searchengineland.com/. The spreadsheet that we will create in this article is here.

Once you are signed in to Google Spreadsheets, you will be able to make your own copy to work with by opening the spreadsheet and selecting File -> Make A Copy…

If you would rather start with a blank spreadsheet and fill it in as you go through this article, select File -> New -> Spreadsheet.

How It Works

Our on-page checking spreadsheet uses the importXML() function in Google Spreadsheets. This very useful function takes two arguments, a URL to a document to be parsed and an Xpath query that tells it which information to import into the spreadsheet.

More information about the importXML() function can be found in Google’s documentation.

Xpath is a query language that is used to match elements (better known to those of us who are more familiar with HTML as “tags,” as in “title tag” or “H1 tag.”) and the attributes of these elements (for example, “alt“ or “href“) in an XML document and to tell it what information to extract.

For example, the Xpath query “//a[@href=”index.htm”]/text()” will return the anchor text for any link pointing to the file index.htm. Don’t worry if this doesn’t make any sense yet. As you work with a few examples, it will become clearer.

A good resource for Xpath queries can be found here.

Testing The Basics

Let’s get started. First we will do a simple query to extract the title from an HTML document. To do this, follow these steps:

Enter the URL you are checking in the cell A1.
Put “Title” in cell A2.
In cell B2 enter this exact text: =importXML(A1, “(//title|//TITLE)”)

The parts that are “//title” and “//TITLE” will match all elements (tags) that are either “title” or “TITLE.” (Xpath queries are case-sensitive by default, so we are matching all uppercase or all lowercase.) The parentheses and vertical bar “|” tell Xpath to return elements that match either of the two.

Once you hit return, the text should change to the title of the page you are checking. You may see “Loading…” for a few seconds while Google retrieves and parses the page.

If something went wrong, check the following things:

Does the page at your URL have a title tag?
Does the URL redirect anywhere?
Is the title tag written as “Title” in the HTML? Remember that Xpath queries are case sensitive, so the query above will only match “title” and “TITLE.”
Did you type the URL correctly?

It’s also possible that the HTML in the page you are checking is too badly formed to be correctly parsed by the importXML() function. In this case, either pick a new URL or validate and tidy the page’s HMTL and try again.

Checking Header Tags

If everything is working up to this point, we are now ready to run more queries against our pages.

Let’s check for the header tags H1 and H2.

Follow these steps:

Put “H1” in cell A4
In cell B4 enter this text: =importXML(A1, “(//h1|//H1)”)
Put “H2” in cell A10
In cell B10 enter this text: =importXML(A1, “//h2|//H2)”)

At this point, you should see the text of the H1 and H2 tags of your page. Notice how a cell is filled out for each matching tag. It’s important to leave enough room for additional cells so that you can see all matching values.

Creating Alerts & Testing The Results

Another useful thing we can do with Google Spreadsheets is write tests that check the output of importXML() and flag any problems or deviations from best practices.

In this webmaster help video, Matt Cutts says that more than one H1 is okay for some pages, but he also recommends not to over do it. So let’s write two alerts, one to make sure there is at least one H1 tag and another one to alert us if there is more than one H1 tag. Follow these steps:

In cell C4 enter this text: =IF(ISERR(B4),”No H1 tag found!”,”OK”)
In cell C5 enter this text: =IF(COUNTA(importXML(A1,”(//H1|//h1)”))>1,”Multiple H1 tags found!”,”OK”)

The ISERR() function will check for an error in a cell, including “#N/A” which is the result of an Xpath query that doesn’t match anything.

The COUNTA() function counts the number of elements in an array, which is what is returned by importXML(). This is the most efficient way to get the number of matches for a particular Xpath query.

If you want to make the alerts stand out more, use conditional formatting in column C to turn the alerts red if they don’t pass.

To do this, select column C, go to Format > Conditional formatting… and set the text to red when the text contains an exclamation point.

Extracting Attributes

Xpath queries are also useful for extracting the value of attributes within a tag, which means that we can check the usual SEO-related meta tags.

For example, let’s look for the link canonical tag and meta robots tags on the document. Follow these steps:

Put “Robots meta” in cell A30
In cell B30 enter this text: =importXML(A1, “//meta[@name=’robots’]/@content”)
Put “Link canonical” in cell A31
In cell B31, enter this text: =importXML(A1, “//link[@rel=’canonical’]/@href”)

The “[@foo=”bar”]” syntax that we have added is a way of restricting the matching tags to only elements containing that attribute-value pair. The /@content and /@href in each Xpath query returns the values for those attributes.

Note that attribute and value matching is also case-sensitive. So if any of the elements, attributes, or values being matched contain an upper-case letter then our Xpath query won’t match it. You may need to adjust the Xpath queries to match the style of HTML that your CMS outputs.

You should now see the meta robots directives and link canonical values for the page you are checking. If you see “#N/A” in the cell after hitting return then the page doesn’t have these meta tags, you typed the Xpath query incorrectly, or there are case-sensitivity problems.

Checking Links & Anchor Text

Let’s finish with some queries that count the number of links on the page and lists the anchor text and outbound links.

Because pages usually have many links, let’s do this on a new tab so we will have enough room for the output. Follow these steps:

Go to Insert > New Sheet to create a new tab for the spreadsheet.
In cell A1 enter the URL you are checking
In cell A2 enter the following: =COUNTA(importXML(A1,”//a”)) & ” links”
In cell A3 enter the following: =importXML(A1,”(//a/text()|//a/img/@alt)”)
In cell B3 enter the following: =importXML(A1,”//a/@href”)

Remember that the parentheses and vertical bar (or “|”) in the Xpath query for cell A3 matches either one of the Xpath queries separated by a “|”. So in this example, we are returning any anchor text or alt text of an image within that link.

The ampersand (or “&”) in the query for cell A2 combines text into one string.

If everything was entered correctly, you should see the number of links on the page in cell A2 with a list of all anchor text and image alt text listed below that. In column B, you should see a list of all the links on the page.

Ideally, the list of anchor text and links will match up. But it is possible that some of the links won’t have any anchor text and will be skipped. If the text and the links don’t match up, then it is very likely that not all links have consistent anchor text.

Extra Credit

If you want to continue exploring the use of Google Spreadsheets to check on page factors, I created another spreadsheet with more examples here.

This spreadsheet contains a few more advanced examples that can check things like:

The meta description tag
The Safe browsing diagnostics page for a domain
Whether or not the page is in Google’s index
Images and their alt text
Images that don’t contain alt text

The Xpath queries are in the “Queries” tab, and you can double click on the cells to see the underlying formulas.

Make a copy for yourself and start exploring. Feel free to share any interesting Xpath queries or formulas you come up with in the comments. Happy hacking!

Search Engines Don’t Like You? Don’t Jump To Conclusions

Todd Nemet — Thu, 19 May 2011 17:15:21 +0000

One of the most frustrating things about technical problems with a site is that the ways they show up in search engines are usually unexpected or subtle. What looks like a penalty can actually be a problem introduced with a new version or new feature of a website.

Because the true causes of problems like these are usually not at all obvious, they can lead to hypotheses that border on the paranoid (“Google doesn’t like my site,”) or wild speculation: (“I was put in the sandbox and then hit with Panda. I call it the Pandbox.”).

Since Google isn’t alive and doesn’t have emotions (yet), we can safely set aside (for now) any search engine anthropomorphizing and focus on finding root causes that may be lurking in the site’s technical infrastructure.

Symptoms: Fewer Pages In The Index, Drop In Long Tail Traffic

The main causes for problems with site coverage include duplicate content, allowing pages with no SEO value to be crawled, and network problems.

Duplicate content occurs when you can get to a page through multiple URLs.

Sometimes this is caused by having an entire copy of a site available on another subdomain, like https://www1.yoursite.com/, or on an IP address, like https://192.168.1.1/.

Duplicate content can also happen at the page level, when a page is available at multiple URLs like this:

Both types of duplicate content reduce the number of pages in the index because search engines are wasting their time crawling multiple copies of a website or a page.

Search engines throw away these extra copies because there is no point in including redundant pages in the index. This means that time spent crawling more pages on your site was wasted crawling extra copies of pages that won’t be used anyway.

For the example pages above, that site would have to be crawled at least five times to get each page of the site.

If you have a duplicate site, you can use a 301 to permanently redirect any visitors to the main site.

Fixing duplicate content at the page level is a bit tricker.

Select one canonical URL from each set of potential duplicate URLs and make sure that each duplicate URL permanently redirects to the canonical one. If this isn’t possible – for example, due to tracking parameters like referral_id=1 above – use a link rel=canonical tag that points to the canonical URL and configure Bing and Google webmaster tools to ignore the appropriate parameters.

Diagnosing Crawl Inefficiencies

Allowing pages with no value to be crawled means that the search engines are spending valuable resources crawling things like API calls, log files, or pages with an infinite number of combinations like a web calendar.

Similar to duplicate content, crawl inefficiency means that search engines are crawling useless pages, at the expense of pages that you would like crawled.

These zero-value pages aren’t going to lead to any conversions, assuming that they are even indexed by search engines or rank well for anything.

To fix these types of problems, use the robots.txt file to exclude these types of pages. Be sure to test any changes to your robots.txt file in Google Webmaster Tools before pushing them live.

Networking problems can be very elusive. Most of the networking problems I have seen involve either load balancing or DNS.

Load balancing is used on larger sites to spread web requests among a number of back end servers. Sometimes it is misconfigured in a way in which most of the crawler requests go to one backend server, which eventually slows to a crawl.

DNS problems can make a website unnecessarily slow for first time visitors or in extreme cases, make it intermittently unavailable.

You can easily check your DNS configuration with an on-line tool like IntoDNS. Checking the load balancers or other aspects of the back end network is not so easy, so it’s probably best to ask a network engineer about any recent changes to the infrastructure.

Symptoms: Wrong Pages Ranking, Decline In Ranking

These symptoms are usually caused by duplicate copies of important pages or by search engines not being able to understand the linking structure of your site.

Duplicate content can have a negative effect on ranking because inbound links to a particular page – a very important signal for search engines – are spread out among different URLs. As a result, the search engine is only aware of the number of inbound links for the one copy of the page that it decides to keep.

Make sure that all of the intended inbound links count towards the page by fixing these duplicate URLs as described above.

Another important signal for search engines is how a page is linked within a site. For example, a page with a link from the homepage will be considered a more important page than a page that is orphaned on the site with no links.

Coding site navigation elements in Flash, Silverlight, or JavaScript can make it impossible for search engines to extract these links. As a result, they are missing key information about what pages on a site are the most important.

Investigate Before You Make Assumptions

This is not a complete list of root causes for indexing issues and traffic loss, but it does contain the most common issues that I have seen with sites that I have been asked to review.

Other causes of similar symptoms are page speed, cache unfriendliness, internationalization issues, server misconfigurations, and security vulnerabilities. Each one is worthy of an article in itself.

I hope this provides some additional ideas of where to hunt down causes of particularly vexing problems with the way your site is performing in search.

Fortunately, it is much easier to redirect a duplicate copy of your site or fix a DNS misconfiguration than it is to influence Google or Bing’s algorithms.

While search engines definitely penalize some sites and it is possible for a site to get caught up in algorithm changes, make sure you have thoroughly reviewed your technical architecture before jumping to any conclusions about what search engines don’t “like” about it.