If you’re dealing with a large complex website, rewriting your URLs from dynamic to static and placing all the necessary 301 redirects in place is – as programmers would say, nontrivial. The devil is in the details. Granted, the redirecting piece may not be quite as onerous anymore thanks to the advent of the search engines’ new canonical tag, but you still have to worry about the user experience for users coming in to the obsolete URLs (through bookmarks, through old links, etc.). Therefore, ideally you’ll still want the 301s in place. Regular expressions and mod_rewrite (or alternatively, ISAPI_Rewrite for IIS) to the rescue!
Last week, when I presented on the 301 Redirect panel at SMX West, I gave folks a view “under the hood” at accomplishing redirects using rewrite rules and regular expressions – complete with the code necessary to pull it off. It wasn’t for the faint of heart. (My Powerpoint can be downloaded here.)
Regular expressions are so complex there are entire books dedicated to the topic (such as the excellent Mastering Regular Expressions, 3rd Ed. by Jeffrey E. F. Friedl and published by my long-time favorite book publisher, O’Reilly). Before we delve into the use of regular expressions in rewrite rules, however, let’s step back and look at the URL rewriting process.
The three types of URL rewrites
Rewriting of search engine sub-optimal URLs can be accomplished through three approaches. The first of which – using a “URL rewriting” server module/plugin such as mod_rewrite for Apache or ISAPI_Rewrite for Microsoft IIS Server – is the most popular. If you can’t use a URL rewriting module on your server, you might recode your scripts to extract variables out of the “path_info” part of the URL instead of the “query_string”. An example of this might look like http://www.example.com/index.php/category/widgets.
With either approach, you’d want to replace all occurrences of your old URLs in links on your site with your new search-friendly URLs. Additionally, you may wish to 301 redirect the old URLs to the new ones, but this is apparently optional with the advent of the canonical tag. The third approach would be to use a proxy server based solution (e.g. GravityStream) that eliminates the need to recode your site or re-architect your CMS/e-commerce platform. This can be useful when IT department involvement with SEO projects must be minimized, for whatever reason.
Let’s assume you’re going with first approach – utilizing a rewriting module. If you are running Apache as your web server, you would place “rules” within your .htaccess file or your Apache configuration file (e.g. httpd.conf or the site-specific config file in the sites_conf directory). Similarly, if you are running IIS Server, you’d use an ISAPI plugin such as ISAPI_Rewrite and place rules in an httpd.ini config file. Note that rules can differ slightly on ISAPI_Rewrite compared to mod_rewrite. For Apache and mod_rewrite, your .htaccess would start off with:
Note that you should omit the second line above if adding the rewrites to your server config file, since RewriteBase is not supported there, only in .htaccess. We’re using RewriteBase above so that we won’t have to have “^/” at the beginning of all the rules, just “^”.
After this comes the rewrite rules. Let’s say we wanted to have requests for product page URLs of the format http://www.example.com/products/123 to display the content found at http://www.example.com/get_product.php?id=123, without the URL changing in the Location bar of the user’s browser and without you having to recode the get_product.php script. (Of course this doesn’t replace all occurrences of dynamic URLs within the links contained on all the site pages; that’s a separate issue.) Accomplishing this can be done with a single rewrite rule, like so:
RewriteRule ^products/([0-9]+)/?$ /get_product.php?id=$1 [L]
In the above example, ^ signifies the start of the URL following the domain, $ signifies the end of the URL, [0-9] signifies a digit and the + immediately following it means one or more occurrences of a digit. Similarly, the ? immediately following the / means zero or one occurrences of a slash character. The () puts whatever is wrapped within it into memory. You can then access what’s been stored in memory with $1 (i.e. what is in the first set of parentheses). Not surprisingly, if you included a second set of parentheses in the rule, you’d access that with $2. And so on. The [L] flag saves on server processing by telling the rewrite engine to stop if it matched on that rule. Otherwise all the remaining rules will be run as well.
Sound complicated? You ain’t seen nothin’ yet! Here’s a slightly more complex example, where URLs of the format http://www.example.com/webapp/wcs/stores/servlet/ProductDisplay?storeId=10001&catalogId=10001&langId=-1&categoryID=4&productID=123 would be rewritten to http://www.example.com/4/123.htm:
RewriteRule ^([^/]+)/([^/]+)\.htm$ /webapp/wcs/stores/servlet/ProductDisplay?storeId=10001&catalogId=10001&langId=-1&categoryID=$1&productID=$2 [QSA,L]
The [^/] signifies any character other than a slash. That’s because, within square brackets, ^ is interpreted as “not”. The [QSA] flag above is for when you don’t want the query string dropped (like when you want a tracking parameter preserved).
To write good rewrite rules you will need to become a master of “pattern matching” (which is simply another way to describe the use of regular expressions). Let’s look at some of the most important special characters and how they are interpreted by the rewrite engine:
* means 0 or more of the immediately preceding character + means 1 or more of the immediately preceding character ? means 0 or 1 occurrence of the immediately preceding character ^ means the beginning of the string $ means the end of it . means any character (i.e. wildcard) \ “escapes” the character that follows, e.g. \. means dot [ ] is for character ranges, e.g. [A-Za-z] for any lower or upper case letter ^ inside  brackets means “not”, e.g. [^/] means not slash
It’s incredibly easy to make errors in regular expressions. Some of the common gotchas that lead to unintentional sub-string matches include:
- using .* when you should be using .+ since .* can match on nothing
- not “escaping” with a backslash special characters that you don’t want interpreted, like when you specify . instead of \. and you really meant a dot rather than any character. (thus default.htm would match on defaultshtm)
- omitting ^ or $ on the assumption that the start or end is implied (thus default\.htm would match on mydefault.html whereas ^default\.htm$ would only match on default.htm)
- using “greedy” expressions that will match on all occurrences rather than stopping at the first occurrence.
What do I mean by “greedy”? The easiest way to explain it is to show you an example. Let me illustrate:
RewriteRule ^(.*)/?index\.html$ /$1/ [L,R=301]
will redirect requests for http://www.example.com/blah/index.html to http://www.example.com/blah//. Probably not what was intended. Why did this happen? Because .* will capture the slash character within it before the /? gets to see it. Thankfully, there’s an easy fix. Simply use [^ or .*? instead of .* to do your matching. For example, use ^(.*?)/? instead of ^(.*)/? or [^/]+/[^/] instead of .*/.*
So, to correct the above rule you could use the following:
RewriteRule ^(.*?)/?index\.html$ /$1/ [L,R=301]
Why wouldn’t you use the following?
RewriteRule ^([^/]*)/?index\.html$ /$1/ [L,R=301]
Because it would only match on URLs with one directory. URLs containing multiple subdirectories such as http://www.example.com/store/cheese/swiss/wheel would not match.
And why wouldn’t you use the following?
RewriteRule ^(.*)index\.html$ $1/ [L,R=301]
Because it would match on http://www.example.com/myindex.html as well (since it’s not specified that the character immediately preceding index must be a slash).
Does your head hurt yet? As you might imagine, testing/debugging is a big part of URL rewriting. When debugging, the RewriteLog and RewriteLogLevel directives are your friend! Set the RewriteLogLevel to 4 or more to start seeing what the rewrite engine is up to when it interprets your rules.
By the way, the [R=301] flag in the last few examples above — as you might guess — tells the rewrite engine to do a 301 redirect instead of a standard rewrite.
Continue reading Part 2 of URL Rewrites and Redirects: The Gory Details, where I cover the all-important RewriteCond directive, lookup tables, ISAPI rewrite rules, proxying, alternatives to RewriteRule for redirecting, and more.
Opinions expressed in the article are those of the guest author and not necessarily Search Engine Land.