Search Engine Land » Platforms » Apple » Programming Data Collection For SEO Research

Programming Data Collection For SEO Research

Last month, I showed you three tricks I use when gathering data on websites. I used these techniques to download webpages into a local folder. In and of themselves, these procedures are not SEO; however, a search engine optimization professional working on a large or enterprise website ought to know how to do this. In this […]

Tom Schmitz on May 23, 2013 at 9:57 am | Reading time: 7 minutes

Chat with SearchBot

Last month, I showed you three tricks I use when gathering data on websites. I used these techniques to download webpages into a local folder. In and of themselves, these procedures are not SEO; however, a search engine optimization professional working on a large or enterprise website ought to know how to do this. In this article, I’ll show you how to:

Make a list of pages inside a folder
Set up a development environment
Open webpages from a script and extract data

If you learn these procedures, I am certain you will find legitimate opportunities to use each, together or alone.

Create A List Of Files In A Directory

Mac users may wonder why I am bothering to go over how to take a list of files inside a directory and turn their names into a text list. On the Mac, you just have to:

Select all the files names in the folder and hit copy
Create an empty text file
In the menu, select Edit, then Paste and Match Style

In Windows, on the other hand, there is no easy way to do this. Here’s my recipe:

Create a text file named dir.bat
Into the file, type the line dir /b /o:en > dir.txt
Save, close, then drop this file into the directory for which you want the list of files
Double click on the file to run the script
The .bat file will make a new text file with the list of file names

Now that you have a list of files, let’s open one up and find the content you are looking for.

Set Up A PHP Environment

If the thought of setting-up a PHP environment scares you, relax. All you need is a hosted website or a disk drive. A hosting account is the easiest way to go here. It includes PHP, so all you need is an FTP program to create a subfolder and upload your script files.

For example, I created a simple Hello World script on one of my sites. If you do not have a hosted site, you can create your own Apache with PHP environment for free with XAMPP. XAMPP installs Apache, PHP, MySQL, and some other programs that, together, create a development environment on your disk. I keep XAMPP on the thumb drive so I always have my scripts available wherever there’s a PC with a USB port. After installing XAMPP:

Visit the XAMPP directory and run xampp-control.exe
Start Apache
In a Web browser, visit https://localhost

Website spaces go into /xampp/htdocs/ as a subfolder; for example, the scripts go in /xampp/htdocs/scripts/. Copy and save the following as /xampp/htdocs/scripts/hello-world.php:

<?php echo “<html><head><title>Hello World</title></head><body><p style=\”font-family: \’Segoe UI\’, Tahoma, Geneva, Verdana, sans-serif; font-size: x-large; color: #FFFFFF; background-color: #8E8E17; margin-top: 250px; padding: 25px 25px 25px 25px;text-align:center\”>Hello World. Seattle Calling. You can <a href=\”javascript:history.go(-1)\”>[Go Back]</a> now.</body></html>”; ?>

To run the script, visit https://localhost/scripts/hello-world.php. Whether you put a subfolder on a hosted Web account or install XAMPP, either will function as your development environment. I prefer XAMPP on my thumb drive because I can save and execute files without having to upload them.

Extract Content From A Webpage

Enter this script into your development environment as twitter-followers.php.

<?php $a[1]=’https://twitter.com/seomoz’; $a[2]=’https://twitter.com/sengineland’; $a[3]=’https://twitter.com/apple_worldwide’; $a[4]=’https://twitter.com/microsoft’; $a[5]=’https://twitter.com/smartsheet’; foreach($a as $objectURL){ $handle = file_get_contents($objectURL); if(!$handle) die(“Can’t open device”); preg_match(‘/()(.*)(<\/strong> Followers)/i’, $handle, $followers); echo “\”” . $objectURL . “\”;\”” . $followers[2] . “\” ”; sleep(.5); } ?>

Now, run it: https://localhost/scripts/twitter-followers.php or wherever your development environment is located. It should output a delimited file with each Twitter URL and the number of followers the account has. Here is what is happening in this script.

$a[1]=’https://twitter.com/seomoz’;

Each of lines 2 to 6 define a variable as a Twitter address. Notice the [1], [2], [3], etc. This makes it easy to write a formula in Excel that will write a line of PHP code for each Twitter address. $a[1]=’https://twitter.com/seomox’;

foreach($a as $objectURL){}

This creates a loop that will go through all the $a[ ] variables you created. With each pass it assigns the contents of $a[ ] to $objectURL.

$handle = file_get_contents($objectURL);

This line reads the web URL and stores the HTML markup into $handle.

preg_match(‘/()(.*)(<\/strong> Followers)/i’, $handle, $followers);

This is where the magic happens. This is a Perl Regular Expression Match command.

It checks against the HTML markup in $handle and writes the results to $followers
It stores anything between parentheses () in $followers
The beginning / and ending / are the start and end of the test
.* is a regular expression that matches any set or length of characters, in this case between  and  Followers

The \ character is an escape. When placed before a reserved character, like / or *, it tells PHP to treat it like a regular or real letter, number or punctuation
The small i after the second / instructs the matching command to ignore letter case, or be case insensitive

echo “\”” . $objectURL . “\”;\”” . $followers[2] . “\” ”;

This line prints the results. Notice the escape characters before the printed quotes. Also note the 2 in $followers[2]; this matches the second set of parentheses.

Knowing what you have learned, what do you think this line of code will do?

preg_match(‘/(<div class=\”fsm fwn fcg\”>)(.*)( likes · )(.*)( talking about this<\/div>)/i’, $handle, $followers);

If you said, “match Facebook like and talking about,” you’re right. However, if you try it, it will not work. Why? Because Facebook tests for a user agent, and the script does not provide one. Here is a case where you can go back to part one, write a macro that will download the webpages to your local computer, then capture the data you want. In your script, just change the URLs to files.

$a[1]=’c:\file-1.html’;

PHP is perfectly capable of sending a user agent; however, you can get trapped for different reasons. Another example is the requirement to be logged into your account on websites like Open Site Explorer.

preg_match(‘/(subdomain_mozrank\”:)(.*)(,\”subdomain_mozrank_raw)/i’, $handle, $followers);

Rather than learning a PHP for every situation, the iMacros I shared last month will get you going immediately. If you want to take this further, read these pages and study up on regular expressions:

There is a whole world of learning out there. It does not have to be PHP, either. Python, Ruby, PERL, and others will all work. Do some research, browse through some tutorials and talk to your developer friends. Make the choice that is right for you.

A word of caution: some of the straight quotes may be changed to smart quotes. Be sure to use straight quotes in your coding.

Contributing authors are invited to create content for Search Engine Land and are chosen for their expertise and contribution to the search community. Our contributors work under the oversight of the editorial staff and contributions are checked for quality and relevance to our readers. The opinions they express are their own.

Add Search Engine Land to your Google News feed.