How To Prune The Enterprise Link Tree

As Barry Schwartz pointed out earlier this month, Google’s warning sites about spammy link practices. And it’s no April Fool’s joke. While most of the attention’s been focused on affiliates, link networks and the like, enterprise sites need to take a careful look at their own link profiles. But that’s not easy. Instead of hundreds […]

Chat with SearchBot

As Barry Schwartz pointed out earlier this month, Google’s warning sites about spammy link practices. And it’s no April Fool’s joke. While most of the attention’s been focused on affiliates, link networks and the like, enterprise sites need to take a careful look at their own link profiles.

Tree pruning

If only it were this easy

But that’s not easy. Instead of hundreds of links, you may be looking at thousands, or tens of thousands. Or more. In my experience, a moderately-popular enterprise client can have 30,000-40,000 links from 2,000-3,000 domains.

You need a process, and a few tools, if you’re going to complete this task and maintain your sanity. You have to automate what you can and reduce the steps necessary for any required hand-filtering.

Do We Have To…?

The first question I usually hear from this type of client is: “Why do we even have to check our link profile? We’re a big company. We’ve accumulated lots of links over the years. We’re fine, right?”

Maybe. Maybe not. I’m not just spreading FUD here. Google has made it crystal-clear that they’re cracking down on all manner of ‘over-optimization’, both on- and offsite. Unless you know every SEO tactic that’s ever been used on your site, you need to audit your link profile.

The Tools

To run an enterprise-scale link profile audit, you’re going to need a few tools:

  1. A link database. SEOMOZ’s Open Site Explorer or MajesticSEO’s database will work. Using both will work even better. ahrefs has a new tool that’s worth a look, too.
  2. Microsoft Excel. Say what you want about Microsquish. Excel is still the most kickass toolset an SEO can have. Google Spreadsheets is awesome, but Excel still has the edge. If you somehow don’t already have it, get it.
  3. WHOIS data. You’ll want access to the WHOIS database, either via scripting (see the next item) or through a paid service. The ability to perform bulk WHOIS lookups will save you a lot of time, so paying a bit extra for a service like whoisxmlapi.com could make sense. It’s cheaper than therapy.
  4. A Web crawler of some kind. Screaming Frog or Xenu will do the trick.
  5. A scripting language. Yes, I said it again. you need to know a programming language. If you don’t, OK, but this would really be a good time to learn.

The 19-Step Process

Here’s how I go about it. Of course, this is not the only way. It’s probably not even the best. I tend to find these shortcuts and design this kind of stuff on the fly.

On the other hand, this process lets me sift through 30,000+ links in less than 3 hours. Which means more Skyrim time – a win-win.

  1. Create a ‘whitelist’. That’s a list of domain names that are 100% (cough OK 90%) legitimate link sources.
  2. Grab the basic link data from Open Site Explorer and Majestic. Import both into Excel.
  3. Combine the two URL lists, including SEOMOZ Domain Authority and/or Majestic ACRank so that you have a single list of all linking URLs. Filter out any duplicates.
  4. Pull a list of unique domain names from that list. I use Python to do this. You can use Excel’s Text to Columns feature, too: Split the text up at each “.”, remove any folders and queries, and you should have a list of domain names.
  5. Remove any whitelisted domains.
  6. Run a WHOIS query on each domain name. Be sure to get the hostname, registrant name and status, at a minimum. Store that in Excel, too. I use Python to perform the bulk lookup. You can also send a list of domains to a paid service and they’ll do it for you.
  7. Grab the IP address of each domain. You can use NSLOOKUP to do this, if you want to get all geeky about it. There are a few tools you can add to Excel, or you can script it in Google Spreadsheets. None of this is trivial, I know. It’s the price of success – you wanted your terrifying in-house SEO job for a Fortune 100. Time to pay up!
  8. Use Vlookup to combine the domains, WHOIS results and Majestic/SEOMOZ/ahrefs data. It’s important that you have all of this in one place.
  9. Now, look for sites that share common registrants. Ignore the private domain registration companies. Yes, that’s a lot of them. But you’ll be amazed how many link networks still operate ‘in the clear’.
  10. If you find groups of sites owned by a single person or company, flag them. Why? Because multiple sites under a single owner may be part of a link network.
  11. Compare IP addresses, the same way you did registrants. If you have collections of sites under the same IP address, flag those, too.
  12. Now you should have a list of flagged domains.
  13. Grab those domains and run your Web crawler, fetching the home page of each domain. I use Python for this, saving the HTML for each page for the next few steps.
  14. Check the results for phrases that are a dead giveaway for spam: “High pagerank,” “Link building,” “Upgrade your link” and “Free link” are some of my favorites.
  15. Get a word and link count for each page. Compute the ratio of words to links. I use Python and BeautifulSoup (an HTML parser for Python) to do this.
  16. Pull all this data into your domains list.
  17. Score your domains. I use a holistic 1-10 scale: The more ‘spam factors’ in evidence, the higher the score. So a page that’s part of a 10-domain portfolio, has spammy-sounding phrases on it and has a low ratio of words to link will get a really high score.
  18. Sort your spreadsheet by score. Then do a quick check of the worst offenders. If they’re spam, get those links removed.
  19. Repeat this process as necessary.

Getting Fancy

A few additional, easily-automated steps you can try:

  1. Use natural language processing to compare 5 blog posts on any given blog. If they have little or no relation to each other—one’s about pharmaceuticals, and the next is about vacationing in Miami, for example—that could be a spam blog.
  2. Check the writing grade level. Super-low or super-high may mean badly written, spun content.
  3. Use an automated grammar checker like Queequeg and get an error count. More errors means a higher likelihood of spun content.
  4. Check for blog sites using default templates. That’s a sure sign of a spam blog.
  5. Check for big collections of footer links. Then look for sites that are interlinked into ‘wheels’ or whatever the link sellers are calling them now.

Accumulate Knowledge

As you do this process, save your data. Keep a list of the best and worst domains, site owners and IP blocks. It’ll make future audits far easier.

Prune, But Also Plant

Of course, don’t neglect authority-building. Your content strategy, social media strategy and branding will help you grow your authority profile even as you prune back your low-quality links.

None of this is easy. But almost all of it can be automated. Put in the time now and you can stay ahead of future Google warnings, improve SEO and build a lasting information asset for your company.


Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.


About the author

Ian Lurie
Contributor
Ian Lurie founded Portent, Inc. and is now an independent digital marketing consultant advising clients on SEO, content, analytics, and strategy. You can find him at https://www.ianlurie.com

Get the must-read newsletter for search marketers.