URL Rewrites and Redirects: The Gory Details (Part 2 of 2)

Welcome back from Part 1, where I discussed in detail how to implement URL rewriting with Apache’s mod_rewrite module — complete with example rewrite rules, the more common regular expressions and how to use them. If you recall, I was just starting to get into rewrite rules for 301 redirects using the [R=301] flag. (Incidentally, […]

Chat with SearchBot

Welcome back from Part 1, where I discussed in detail how to implement URL rewriting with Apache’s mod_rewrite module — complete with example rewrite rules, the more common regular expressions and how to use them. If you recall, I was just starting to get into rewrite rules for 301 redirects using the [R=301] flag. (Incidentally, I much prefer using the RewriteRule directive for setting up my 301 redirects rather than Redirect, RedirectPermanent, or RedirectMatch; more on that later.)

There’s another handy directive I often use in conjunction with RewriteRule, called RewriteCond. You would use RewriteCond if you’re trying to match on something in the query string, the domain name, or other things not present between the domain name and the question mark in the URL (which is what RewriteRule looks at). Note that neither RewriteRule nor RewriteCond can access what is in the anchor part of a URL, i.e. whatever follows a #, because that is used internally by the browser and is not sent to the server as part of the request. The following RewriteCond example looks for a positive match on the host name before it will allow the rewrite rule that follows to be executed:

RewriteCond %{HTTP_HOST} !^www\.example\.com$ [NC]
RewriteRule ^(.*)$ https://www.example.com/$1 [L,R=301]

Let’s deconstruct what’s happening here. For any host name other than www.example.com, a 301 redirect is issued to the equivalent canonical URL on the www subdomain. The [NC] flag makes the rewrite condition case-insensitive. Where is the [QSA] flag so that the query string is preserved, you might ask? It’s not needed when redirecting; it’s implied.

If you don’t want a query string retained on a rewrite rule with a redirect, put a question mark at the end of the destination URL in the rule. Like so:

RewriteCond %{HTTP_HOST} !^www\.example\.com$ [NC]
RewriteRule ^(.*)$ https://www.example.com/$1? [L,R=301]

Note the exclamation point at the beginning of the regular expression. That is interpreted as “NOT” by the rewrite engine.

Why didn’t I use ^example\.com$ instead? Consider:

RewriteCond %{HTTP_HOST} ^example\.com$ [NC]
RewriteRule ^(.*)$ https://www.example.com/$1? [L,R=301]

Because that wouldn’t have matched on typo domains such as exampel.com that the DNS server and virtual host would be set to respond to (assuming that misspelling was a domain you registered and owned).

Under what circumstances might we want to omit the query string from the redirected URL, as was done in the last two examples? When a session ID or a tracking parameter (like “source=banner_ad1”) needs to be dropped. Retaining a tracking parameter after the redirect is not only unnecessary (because the original URL with the source code appended would have been recorded in your access log files as it was being accessed), it’s undesirable from a canonicalization standpoint. What if you wanted to drop the tracking parameter from the redirected URL, but retain the other parameters in the query string? Here’s how you’d do it, for static URLs:

RewriteCond %{QUERY_STRING} ^source=[a-z0-9]*$
RewriteRule ^(.*)$ /$1? [L,R=301]

and for dynamic URLs:

RewriteCond %{QUERY_STRING} ^(.+)&source=[a-z0-9]+(&?.*)$
RewriteRule ^(.*)$ /$1?%1%2 [L,R=301]

Need to do some fancy stuff with cookies before redirecting the user? Invoke a script that cookies the user then 301s them to the canonical URL:

RewriteCond %{QUERY_STRING} ^source=([a-z0-9]*)$
RewriteRule ^(.*)$ /cookiefirst.php?source=%1&dest=$1 [L]

Note the lack of a [R=301] flag above. That’s on purpose. No need to expose this script to the user. Use a rewrite and let the script itself send the 301 after it has done its work.

Other canonicalization issues worth correcting with rewrite rules and the [R=301] flag include when the engines indexes: 1) online catalog pages under HTTPS URLs, and 2) URLs missing a trailing slash that should be there. First the HTTPS fix:

# redirect online catalog pages in the /catalog/ directory if HTTPS
RewriteCond %{HTTPS} on
RewriteRule ^catalog/(.*) https://www.example.com/catalog/$1 [L,R=301]

Note that if your secure server is separate from your main server, you can skip the RewriteCond line above.

Now to append the trailing slash:

RewriteRule ^(.*[^/])$ /$1/ [L,R=301]

WordPress handles missing trailing slashes by default. Yay WordPress!

Speaking of WordPress, did you know that when you update the “post slug” on a published post (i.e. revise the URL), WordPress will automatically 301 redirect all requests for the previous URL to the new URL? In fact, if you modify the post slug multiple times, all previous iterations will be responded to with a 301! And there won’t be a series of 301s daisy chained together; there is just one redirect issued to the latest iteration. Thus, you can employ a continuous improvement approach to your URL optimization, employing the “thin slicing” methodology I described in a recent column and my SEO Title Tag plugin to mass edit all your permalink post URLs and let WordPress handle the 301s automagically. It’s a beautiful thing.

After completing a URL rewriting project to migrate from dynamic URLs to static, you’ll want to phase out the dynamic URLs not just by replacing all occurrences of the legacy URLs on your site, but also by 301 redirecting the legacy dynamic URLs to their static equivalents. That way, any inbound links pointing to the retired URLs will end up leading both spiders and humans to the correct new URL — thus ensuring the new URLs are the ones that are indexed, blogged about, linked to, and bookmarked. Generally, here’s how you’d accomplish that:

RewriteCond %{QUERY_STRING} id=([0-9]+)
RewriteRule ^get_product\.php$ /products/%1.html? [L,R=301]

However, you’ll get an infinite loop of recursive redirects if you’re not careful. One quick-and-dirty way to avoid that situation is by adding a nonsense parameter to the destination URL for the rewrite and ensuring this nonsense parameter isn’t present before doing the redirect. Specifically:

RewriteCond %{QUERY_STRING} id=([0-9]+)
RewriteCond %{QUERY_STRING} !blah=blah
RewriteRule ^get_product\.php$ /products/%1.html? [L,R=301]

RewriteRule ^products/([0-9]+)/?$ /get_product.php?id=$1&blah=blah [L]

Notice above that I used two RedirectCond lines, stacked on top of each other. All redirect conditions listed together in the same block will be “ANDed” together. If you wanted the conditions to be “ORed”, it would require the use of the [OR] flag.

Enough about redirects. Let’s move on to lookup tables and RewriteMap, a directive that functions within your server config file (not .htaccess). Let’s say you’d like to rewrite URLs that contain ID numbers to URLs that contain keywords. A laudable goal. Now let’s say you don’t have the lookup table in your database. You could reference a flat file — in text, or in DBM format (which is faster) — containing your mappings of ID numbers to keywords using RewriteMap, then base your RewriteRule on data found in that flat file. Here’s a hypothetical lookup table:

canon-g10-digital-camera /get_product.php?id=1001&blah=blah
128-gig-ipod-classic /get_product.php?id=1002&blah=blah

And here’s what the corresponding RewriteMap and RewriteRule directives might look like:

RewriteMap prodmap txt:/home/someusername/prodmap.txt
RewriteRule ^/products/(.+)\.html$ ${prodmap:$1} [L]

Conversely, you’ll want to 301 all of the legacy ID-containing URLs to the new keyword-containing ones. Like so:

RewriteMap prodmap2 txt:/home/someusername/prodmap2.txt
RewriteCond %{QUERY_STRING} id=([0-9]+)
RewriteCond %{QUERY_STRING} !blah=blah
RewriteRule ^get_product\.php$ ${prodmap2:%1}? [L,R=301]

The corresponding lookup table for the above (“prodmap2.txt”) would look something like:

1001 /products/canon-g10-digital-camera.html
1002 /products/128-gig-ipod-classic.html

You aren’t restricted to text or DBM files. You could alternatively install a script that looks up what the rewrite rule had captured into memory (between the parentheses) and then delivers back to the rewriting engine the corresponding destination. Here’s the slightly modified RewriteMap for such an instance:

RewriteMap prodmap prg:/home/someusername/mapscript.pl
RewriteRule ^/products/(.+)\.html$ ${prodmap:$1} [L]

On to a different problem. Let’s say you wanted to rewrite to a URL located on another server. You can do it with the [P] flag. The “P” stands for “proxy”. For example, you could proxy the Google home page on your server (sans images) with the following rewrite rule:

RewriteRule /google\.html$ https://www.google.com/ [P,L]

Without the [P] flag, the rewrite rule above would behave like a redirect.

You might be wondering how to accomplish all this wizardry if your server is running Microsoft IIS Server instead of Apache. As I mentioned in Part 1, the rules don’t differ greatly between mod_rewrite and ISAPI_Rewrite. For instance, instead of initializing things with “RewriteEngine on”, you would specify “[ISAPI_Rewrite]” on the first line of the httpd.ini file. Instead of [R=301], you would use [RP] to issue a 301. Instead of [NC] for case insensitivity, you would use [I]. And so on. The easiest way to convey this is through some illustrative examples:

#Capitalization and IIS' case insensitivity with regard to URLs
RewriteRule (.*) https://www.example.com$1 [I,RP,L]

#Non-www and typo domains
RewriteCond Host: (?!www\.example\.com)
RewriteRule (.*) https://www.example.com$1 [I,RP,L]

#Drop the "default"
RewriteRule (.*)/default.htm $1/ [I,RP,L]

#Add trailing slash if it's missing
RewriteCond Host: (.*)
RewriteRule ([^.?]+[^.?/]) http\://$1$2/ [I,RP,L]

At the start of this article, I promised we’d revisit my reasoning as to why rewrite rules are my preferred method of redirecting. It’s simply because RewriteRule is so darned powerful and flexible in comparison to Redirect, RedirectPermanent and RedirectMatch. Even though RedirectMatch supports regular expressions, it doesn’t offer nearly as comprehensive of a feature set as RewriteRule and RewriteCond. However, you may be on a web host or server that doesn’t have mod_rewrite installed/enabled. If that’s the case, it may be helpful to see a few examples of these alternative directives, which can be used in either .htaccess or httpd.conf:

# 301 an individual URL
Redirect 301 /old_url.htm https://www.example.com/new_url.htm

# 301 the contents of a directory
Redirect 301 /old_dir/ https://www.example.com/new_dir/
# 301 an entire domain
Redirect 301 / https://www.example.com

# drop the index.html off the end of subdirectories and 301
RedirectMatch 301 ^/(.+)/index\.html$ https://www.example.com/$1/

That’s all I’ve got — for now. I promise in my next article I won’t geek out so much. If you stayed with me this whole time, you deserve a cookie. Hit me up at the next conference and I’ll swipe one for you out of the speaker lounge.


Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.


About the author

Stephan Spencer
Contributor
Stephan Spencer is the creator of the 3-day immersive SEO seminar Traffic Control; an author of the O’Reilly books The Art of SEO, Google Power Search, and Social eCommerce; founder of the SEO agency Netconcepts (acquired in 2010); inventor of the SEO proxy technology GravityStream; and the host of two podcast shows Get Yourself Optimized and Marketing Speak.

Get the must-read newsletter for search marketers.