A Marketer’s Guide To Using Regular Expressions In SEO

Regular expressions (regex) are one of the most powerful tools we have in our SEO arsenal,  but they’re incredibly intimidating! Here are some tips and tricks from one SEO to another that I hope will help you dip your toes into the powerful world of regex. I must begin with a disclaimer: I’m not a […]

Chat with SearchBot

Regular expressions (regex) are one of the most powerful tools we have in our SEO arsenal,  but they’re incredibly intimidating! Here are some tips and tricks from one SEO to another that I hope will help you dip your toes into the powerful world of regex.

HTTP_shutterstock

I must begin with a disclaimer: I’m not a coder, a developer or a network admin. My use of regex is very entry-level, but what I’m about to share has worked well for me across a variety of platforms. I want to share three of these with you: Google Analytics, Screaming Frog and htaccess.

Fundamentals Of Regular Expressions (Regex)

Let’s begin with a few fundamentals of what regex is and what it can do for you. Regex commands basically help you find (and/or replace) non definite values. For example, let’s say you have a list of URLs and you need to break them down into just the TLD (Top Level Domain).

You can use a simple find/replace for http and www, but how do you easily knock all of the filenames off? You could remove all of them manually, but that’s a pain. Using a simple regex wildcard (/*), you can drop the slash and everything that comes after it.

Basic Commands In Regex

Now, to begin with, it helps to have a grasp of the most basic commands and what they mean:

  • First, there’s this little guy: (.*).  While this is technically a combination of a couple discrete commands, for the regex newbie, just know that it means “match an unlimited number of characters.” Basically, what this command means is anything. You usually use it before or after something else — so that you’re saying, “show me anything that starts with, ends with or contains x” — depending on what you’re looking for. For example, let’s say you wanted to find any keyword in a list that contains “tiggers,” regardless of what came before it. You would use the command (.*)tiggers
  • The (^) will signal the command to only match items that “start with” whatever you put after it. So, if you wanted to pull all the values that start with “tiggers,” you could use this: ^tiggers

  • The ($) ends a query. It keeps other extraneous values like query strings from being included in a match you select. For example, let’s say you wanted to match anything that contains “tiggers,” but only if tiggers is the end of the string. You would use a query like this: (.*)tiggers$

    This will match “I-like-tiggers” but not “the-best-thing-is-tiggers-are-wonderful-things.”

    Annie Cushing has a great little trick to remember these two — she says you “lead with a carrot(^), but at the end of the day, it’s all about money($).”

  • The next one you want to know is the ($1) command. This allows you to replace one thing with something else but keep whatever else came before or after it. For example, let’s say you have pages all about tiggers, but you’ve decided to go into piglets instead. You want to replace all of the values that contain “tiggers” with “piglet” and they all follow the same structure. Let’s say your values are:

    tiggers-piglets

    Since all of these start with “tigger,” it’s easy to achieve this with regex. All you have to do is set ^/tigger/(.*)$ so that it becomes /piglet/$1

    Basically, what you’re saying with the command above is: for anything that starts with (^) “/tigger/,” take anything that comes after it (.*) and replace “tigger” with “piglet,” but keep everything ($1) that comes after it the same. The dollar sign can be used for multiple queries in the same command; $1 is the first value, $2 is the second, $3 the third, and so on.

    For an example of 2 dollar signs in one command, say you had something/tigger/bouncing/something-else and you wanted to replace “tigger” with “piglet” but keep everything else the same regardless of what it was. You would use:

    ^/(.*)/tigger/(.*) becomes /$1/piglet/$2 (“becomes” is not a valid operator, just an example)

    This will cause the value above to result in something/piglet/bouncing/something-else.

  • Finally, you should know about the pipe bar to separate queries, because it’s a powerful little tool. The pipe bar allows you to give options. In the example above, maybe you currently have pages on tigger, but you also have pages about kanga. So you need to replace both tigger and kanga. To do this, you would use the same command, except include tigger and kanga both as options. It would be written this way:

    ^/(.*)/(tigger|kanga)/(.*) becomes /$1/piglet/$2 (“becomes” is not a valid operator, just an example)

    The pipe bar means “or.” So the command above says take anything that starts with anything, contains either tigger or kanga in the middle, and replace only tigger or kanga with piglet. But keep everything else around those values.

    It’s a lot harder to show these concepts without concrete examples, so below, I’ve provided an example of how this works in a real program.

Regex in Google Analytics

Have you ever used regex matching in Google Analytics? It’s so powerful. Let’s say you have a brand name called Hooli and a product called Pied Piper. You want to see how much of your traffic is coming from landing pages that don’t have the brand name in the URL. You could do a separate report for each brand name and then deduplicate and subtract from the total, or you could just use regex. And let’s say Hooli is often misspelled holi and wholi. You can account for those too.

In analytics, select “landing page” as the primary dimension. Then click on “advanced” and select “Exclude” and “Matching RegExp.” Formulate your query to include any of the options – hooli, holi, or wholi. There are other ways to do this; for example, you could use w?hoo?li as the command instead, but that gets a little too complicated. So stick with hooli|holi|wholi and it will eliminate any landing pages that contain those words in the URL.

regex-analytics-ex1

Want to add pied and piper too? Just add them: hooli|holi|wholi|pied|piper

Let’s take another example. Say you need a report that filters only pages from a certain part of your site, like “music.” But your site architecture is broken, and the /music subdirectory might appear in any position. You only need the one that appears in the third position. You can’t use starts with, or ends with, or contains, so what do you do? The answer is regex. Using what you’ve learned above,  you can create a report that only shows music in the third subdirectory. You would code it like this: ^/.*/.*/music/.*

regex-analytics-ex2

The command is telling analytics to match any landing page that starts with a slash, then anything, then a slash, then anything, then a slash, and then music – which may or may not have something after it. In other words, only match “music” if it’s three directories deep.

You can imagine how you could learn a few more commands to pull all pages with more than x directories, or to create really detailed custom segments.

Regular Expressions With Screaming Frog

Now how about Screaming Frog? Did you know you can crawl just certain areas of the site, or look for specific bits of code even when they’re non standard? Here are two of my favorites:

Includes/Excludes: under the configuration tab in Screaming Frog, you can select Include or Exclude. The example given in the interface is a very simple one. For example, if you didn’t want to include the blog in your crawl efforts, you could exclude https://www.site.com/blog/.*. But if you wanted to try something a little more complicated, you could easily use a regex expression like one of the ones above. For example, if you know that the login and admin pages of a site are going to be a problem, you could modify the above command to: https://www.site.com/(login|admin)/.* if you aren’t sure where in the hierarchy the login or admin directory would appear, you could use
.*(login|admin).*

Custom Config: One of the most powerful capabilities of screaming frog though, is the custom configuration feature. This can tell you if a certain thing or string appears anywhere in the code of a page that you crawl. It’s particularly powerful for picking out nofollow links when you’re doing a link audit. For example, let’s say that you have a list of pages where inbound links are located to your site. You want to know if those pages still contain your link, and if they do, is it nofollowed? It’s easy to modify a regex code to do this:

<a.{0,100}href=.{0,100}?website\.com(.{0,100}?)(nofollow)

This will show you any link where your target website (replace website.\com with your target) has a nofollow tag appearing after the website address.

screamingfrog-regex-ex1

Notice I said modify, not create. And this code contains curly braces, something we didn’t talk about above. One of the great things about regex is that once you find code that works, you can modify it for your own purposes.

For example, if you wanted to take the same code and find any pages that contain images as links, you could easily modify nofollow to img instead. But always test and retest your code – it’s easy to make a mistake, especially if you don’t really understand what the code is doing. In the code above, the {0,100} means any amount from 0 to 100 characters can appear. In this particular case, that’s fine. But if you were modifying this code for something where you needed to look forward or backward more than 100 characters, you’d probably want to do this a different way.

Redirects & Regular Expressions

And finally, htaccess. If you don’t know what this is, it’s the file that controls how your server responds to requests. You can control IP addresses, WordPress functions, user agent detection and lots of other things with this file.

For purposes of this exercise, let’s talk about URLs. Now, I must begin with a warning: HTACCESS CAN BRING DOWN YOUR ENTIRE SITE! Always, before you touch htaccess, do these four critical things:

  1. Make a backup copy of your htaccess file. You will need it if you mess something up. And you will.
  2. Do not change an htaccess file if you do not have root access to the server via FTP. If you screw up and bring your site down, you will not be able to get to the cpanel or wordpress login to fix it. I strongly recommend never changing htaccess inside of a plugin, cpanel, or anywhere other than a text editor.
  3. Make sure there are no other plugins doing things like redirection, 404 commands, or other server based commands that would interfere with your changes. These aren’t the devil, but you do need to know exactly what they are doing.
  4. Ensure that there is only one htaccess file and that you are editing the right one. There should only be one, but if you see more than one (perhaps in a different directory), you probably need professional help.

Ok, now that you’re safe and secure and your original is backed up (it is backed up, right?!) you can start to play. You can do simple 301 redirects this way:

Redirect 301 /somefile https://www.hooli.com/someotherfile

But RedirectMatch is much more powerful and allows you to use those cool regular expressions. A quick aside… I know RewriteEngine is more elegant, but it’s also a lot more complicated. My goal is to share some simple techniques I’ve used that work for a regex newbie.

Using our Hooli example again, let’s say you’re moving from a structure where all of your blog URLs have /blog/ in them and in the new structure, they’ll have the same filenames but /blog/ will be removed. You can redirect all of them (whether there’s 10 or 10,000) with a single command:

RedirectMatch 301 ^/blog/(.*)$ https://www.hooli.com/$1

This command basically means – for URLs where the filename starts with blog, with anything at all after it, redirect with a 301 status to the domain with the same things after it, but without /blog/.

Now let’s say you’re not just removing /blog/, you’re replacing it with /news/silicon-valley/. You’ll change the command to this:

RedirectMatch 301 ^/blog/(.*)$ https://www.hooli.com/news/silicon-valley/$1

(Note: This may display as line-wrapping, but the actual command should all be on one line.)

Now, what if you have a very messy legacy site and you need to change all of these urls to point to one particular page? All you have to do is look for a repeatable pattern:

https://www.hooli.com/products/pied-piper
https://www.hooli.com/products/pied
https://www.hooli.com/products/pie-piper
https://www.hooli.com/products/pieds-pipers

These all contain “pie” after /products. You’ll want to make sure the new site isn’t going to have any legitimate pages that match this pattern, but once you know they won’t, you can redirect all of these with a single command.

RedirectMatch 301 ^/products/pie(.*)$ https://www.hooli.com/services/pied-piper

(Note: This may display as line-wrapping, but the actual command should all be on one line.)

Finally, maybe you have a structure where everything ends in .html, and none of your new pages will end that way. Again, find the pattern if there is one:

RedirectMatch 301 ^/(.*).html$ https://www.hooli.com/$1

(Note: This may display as line-wrapping, but the actual command should all be on one line.)

What if everything matches but one or maybe two specific files? There’s a fix for that too. Add an exclusion like this:

RedirectMatch 301 ^/ (?!(notthisfile)/) (.*).html$ https://www.hooli.com/$1

(Note: This may display as line-wrapping, but the actual command should all be on one line.)

If you have more than one or two exclusions, you really need to use Rewrite instead.

This won’t work for every redirect you have; you will definitely still have some 1:1s, but it will help a lot and make your structure much more manageable going forward. For further reading on using RedirectMatch, check out apache.org. For more on regular expressions, this tutorial is great.

Did You Mess Something Up?

Don’t panic. Save what you built somewhere off the server, and re-upload that saved backup file. You didn’t keep a backup? That was dumb. Now you have to call someone and pay them to help you. Sorry, but that’s why you always keep a backup! Worst case scenario, upload an empty htaccess file (unless it’s WordPress), then include that bit at the top that makes WordPress function. If you lost that too, Google it. This will take your site back to “factory settings” for the server. You won’t lose any content, but you’ll reset any redirects you had.

Hopefully these tips and tricks will help you become more efficient at your daily work as an SEO. Like I said above, I’m no regex genius, so if I’ve said something wrong or missed a caveat, please let me know in the comments. Likewise, if you have a trick you want to share with the community, please do!

 (Stock image via Shutterstock.com. Used under license.)

Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.


About the author

Jenny Halasz
Contributor
Jenny Halasz is the President of an online marketing consulting company offering SEO, PPC, and Web Design services. She's been in search since 2000 and focuses on long term strategies, intuitive user experience and successful customer acquisition. She occasionally offers her personal insights on her blog, JLH Marketing.

Get the must-read newsletter for search marketers.