Programming Data Collection For SEO Research


Last month, I showed you three tricks I use when gathering data on websites. I used these techniques to download webpages into a local folder. In and of themselves, these procedures are not SEO; however, a search engine optimization professional working on a large or enterprise website ought to know how to do this. In this article, I’ll show you how to:

  1. Make a list of pages inside a folder
  2. Set up a development environment
  3. Open webpages from a script and extract data

If you learn these procedures, I am certain you will find legitimate opportunities to use each, together or alone.

Create A List Of Files In A Directory

Mac users may wonder why I am bothering to go over how to take a list of files inside a directory and turn their names into a text list. On the Mac, you just have to:

  • Select all the files names in the folder and hit copy
  • Create an empty text file
  • In the menu, select Edit, then Paste and Match Style

In Windows, on the other hand, there is no easy way to do this. Here’s my recipe:

  • Create a text file named dir.bat
  • Into the file, type the line dir /b /o:en > dir.txt
  • Save, close, then drop this file into the directory for which you want the list of files
  • Double click on the file to run the script
  • The .bat file will make a new text file with the list of file names

File Directory and bat Script Now that you have a list of files, let’s open one up and find the content you are looking for.

Set Up A PHP Environment

If the thought of setting-up a PHP environment scares you, relax. All you need is a hosted website or a disk drive. A hosting account is the easiest way to go here. It includes PHP, so all you need is an FTP program to create a subfolder and upload your script files.

For example, I created a simple Hello World script on one of my sites. If you do not have a hosted site, you can create your own Apache with PHP environment for free with XAMPP. XAMPP installs Apache, PHP, MySQL, and some other programs that, together, create a development environment on your disk. I keep XAMPP on the thumb drive so I always have my scripts available wherever there’s a PC with a USB port. After installing XAMPP:

  • Visit the XAMPP directory and run xampp-control.exe
  • Start Apache
  • In a Web browser, visit http://localhost

Website spaces go into /xampp/htdocs/ as a subfolder; for example, the scripts go in /xampp/htdocs/scripts/. Copy and save the following as /xampp/htdocs/scripts/hello-world.php:

<?php echo “<html><head><title>Hello World</title></head><body><p style=\”font-family: \’Segoe UI\’, Tahoma, Geneva, Verdana, sans-serif; font-size: x-large; color: #FFFFFF; background-color: #8E8E17; margin-top: 250px; padding: 25px 25px 25px 25px;text-align:center\”>Hello World. Seattle Calling. </p><p style=\”font-family: \’Segoe UI\’, Tahoma, Geneva, Verdana, sans-serif; font-size: x-large; color: #FFFFFF; background-color: #8E8E17; margin-top: 0px; padding: 25px 25px 25px 25px;text-align:center\”>You can <a href=\”javascript:history.go(-1)\”>[Go Back]</a> now.</p></body></html>”; ?>

To run the script, visit http://localhost/scripts/hello-world.php. Whether you put a subfolder on a hosted Web account or install XAMPP, either will function as your development environment. I prefer XAMPP on my thumb drive because I can save and execute files without having to upload them.

Extract Content From A Webpage

Enter this script into your development environment as twitter-followers.php.

<?php $a[1]=’’; $a[2]=’’; $a[3]=’’; $a[4]=’’; $a[5]=’’; foreach($a as $objectURL){ $handle = file_get_contents($objectURL); if(!$handle) die(“Can’t open device”); preg_match(‘/(<strong>)(.*)(<\/strong> Followers)/i’, $handle, $followers); echo “\”" . $objectURL . “\”;\”" . $followers[2] . “\”<br />”; sleep(.5); } ?>

Now, run it: http://localhost/scripts/twitter-followers.php or wherever your development environment is located. It should output a delimited file with each Twitter URL and the number of followers the account has. Here is what is happening in this script.


Each of lines 2 to 6 define a variable as a Twitter address. Notice the [1], [2], [3], etc. This makes it easy to write a formula in Excel that will write a line of PHP code for each Twitter address. $a[1]=’’;

foreach($a as $objectURL){}

This creates a loop that will go through all the $a[ ] variables you created. With each pass it assigns the contents of $a[ ] to $objectURL.

$handle = file_get_contents($objectURL);

This line reads the web URL and stores the HTML markup into $handle.

preg_match(‘/(<strong>)(.*)(<\/strong> Followers)/i’, $handle, $followers);

This is where the magic happens. This is a Perl Regular Expression Match command.
    • It checks against the HTML markup in $handle and writes the results to $followers
    • It stores anything between parentheses () in $followers
    • The beginning / and ending / are the start and end of the test
    • .* is a regular expression that matches any set or length of characters, in this case between <strong> and </strong> Followers
    • The \ character is an escape. When placed before a reserved character, like / or *, it tells PHP to treat it like a regular or real letter, number or punctuation
    • The small i after the second / instructs the matching command to ignore letter case, or be case insensitive

echo “\”" . $objectURL . “\”;\”" . $followers[2] . “\”<br />”;

This line prints the results. Notice the escape characters before the printed quotes. Also note the 2 in $followers[2]; this matches the second set of parentheses.

Knowing what you have learned, what do you think this line of code will do?

preg_match(‘/(<div class=\”fsm fwn fcg\”>)(.*)( likes · )(.*)( talking about this<\/div>)/i’, $handle, $followers);

If you said, “match Facebook like and talking about,” you’re right. However, if you try it, it will not work. Why? Because Facebook tests for a user agent, and the script does not provide one. Here is a case where you can go back to part one, write a macro that will download the webpages to your local computer, then capture the data you want. In your script, just change the URLs to files.


PHP is perfectly capable of sending a user agent; however, you can get trapped for different reasons. Another example is the requirement to be logged into your account on websites like Open Site Explorer.

preg_match(‘/(subdomain_mozrank\”:)(.*)(,\”subdomain_mozrank_raw)/i’, $handle, $followers);

Rather than learning a PHP for every situation, the iMacros I shared last month will get you going immediately. If you want to take this further, read these pages and study up on regular expressions:

There is a whole world of learning out there. It does not have to be PHP, either. Python, Ruby, PERL, and others will all work. Do some research, browse through some tutorials and talk to your developer friends. Make the choice that is right for you.

A word of caution: some of the straight quotes may be changed to smart quotes. Be sure to use straight quotes in your coding.

Opinions expressed in the article are those of the guest author and not necessarily Search Engine Land.

Related Topics: All Things SEO Column | Channel: SEO | How To | How To: SEO


About The Author: operates Schmitz Marketing, an Internet Marketing consultancy helping brands succeed at Inbound Marketing, Social Media and SEO. You can read more from Tom at Hitchhiker's Guide to Traffic.

Connect with the author via: Email | Twitter | Google+ | LinkedIn


Get all the top search stories emailed daily!  


Other ways to share:

Read before commenting! We welcome constructive comments and allow any that meet our common sense criteria. This means being respectful and polite to others. It means providing helpful information that contributes to a story or discussion. It means leaving links only that substantially add further to a discussion. Comments using foul language, being disrespectful to others or otherwise violating what we believe are common sense standards of discussion will be deleted. Comments may also be removed if they are posted from anonymous accounts. You can read more about our comments policy here.
  • Michael Martinez

    You’re just bound and determined to teach people how to do black hat SEO tactics. What’s up with that? I would REALLY appreciate you’re not teaching people how to scrape Websites. This is highly unethical behavior. This is a really shitty thing for you to be doing.

  • Patrick Wagner

    I agree with @Michael_Martinez:disqus – I want Google to be even harder on those who practice black hat SEO tactics.

  • TomSchmitz

    I think you’re painting with too broad a brush. If all web scraping were unethical then Google, Bing, SEOmoz, and Majestic SEO are all unethical companies. Of course that’s patently ridiculous. Part of being an SEO professional is data collection. It’s great to purchase or subscribe to tools. But sometimes, to get the data we want, it’s best to do it ourselves. I could do it by hand or outsource with Mechanical Turk or I can make my own tool in the form of a script or macro. The work being done and the end result is the same. Using a script does not make it black hat, intent and extent does. Yes this is powerful knowledge. Yes it can be misused. So can programs like Xenu and Screaming Frog, but we don’t lock those up and throw away the key.

  • TomSchmitz

    Would it surprise you to know I agree with your sentiment? The methods and procedures I present are neither good nor bad. It all depends on how you use them.

  • Michael Martinez

    Google sends traffic. The rest of those companies ARE unethical. You’re just rationalizing your article with bullshit. There is nothing professional about scraping other people’s Websites for your own personal gain — especially when it’s for “Link Research”, which is THE most idiotic time-wasting practice people in this industry invest themselves in.

    People aren’t going to get links from these sites — they’re just going to eat up server resources and bandwidth without compensating the site owners for the inconvenience.

    Only human filth try to justify this kind of behavior.

  • Michael Martinez

    No, the methods you’re presenting are bad. They are teaching bad SEO practices (time-wasting nonsense) that are couple with theft of bandwidth. Put a macro in your spreadsheet that looks for robots.txt and honors exclusion directives and I’ll lighten up on you.


Get Our News, Everywhere!

Daily Email:

Follow Search Engine Land on Twitter @sengineland Like Search Engine Land on Facebook Follow Search Engine Land on Google+ Get the Search Engine Land Feed Connect with Search Engine Land on LinkedIn Check out our Tumblr! See us on Pinterest


Click to watch SMX conference video

Join us at one of our SMX or MarTech events:

United States


Australia & China

Learn more about: SMX | MarTech

Free Daily Search News Recap!

SearchCap is a once-per-day newsletter update - sign up below and get the news delivered to you!



Search Engine Land Periodic Table of SEO Success Factors

Get Your Copy
Read The Full SEO Guide