• Search Engine Land
  • Sections
    • SEO
    • SEM
    • Local
    • Retail
    • Google
    • Bing
    • Social
    • Resources
    • More
    • Home
  • Search Engine Land
  • SEO
  • SEM
  • Local
  • Retail
  • Google
  • Bing
  • Social
  • Resources
  • Live
  • More
  • Events
  • SUBSCRIBE

Search Engine Land

Search Engine Land
  • SEO
  • SEM
  • Local
  • Retail
  • Google
  • Bing
  • Social
  • Resources
  • More
  • Newsletters
  • Home
SEO

3 Steps To Find And Block Bad Bots

Is your Web analytics data being skewed by bot visits to your site? If so, columnist Ben Goodsell has the solution.

Ben Goodsell on August 31, 2015 at 9:49 am
  • More
robots-txt-automation1-ss-1920

Most SEOs have heard about using Log Files to understand Googlebot behavior, but few seem to know they can be used to identify bad bots crawling your site. More and more, these bots are executing JavaScript, inflating analytics, taking resources and scraping and duplicating content.

The Incapsula 2014 bot traffic report looked at 20,000 websites (of all sizes) over a 90-day period and found that bots account for 56% of all website traffic; 29% were malicious in nature. Additional insight showed the more you build your brand, the larger a target you become.

distribution-bad-good-bot-traffic

While there are services out there that automate much more advanced techniques than what’s shown here, this article is meant to be an easy starting point (using Excel) to understand the basics of using Log Files, blocking bad bots at the server level and cleaning up Analytics reports.

1. Find Log Files

All servers keep a list of every request to the site they host. Whether a customer is using the Firefox browser or Googlebot is looking for newly created pages, all activity is recorded in a simple file.

The location of these log files depends on the type of server or host you have. Here are some details on common platforms.

  • cPanel:  A common interface for Apache hosts (seen below); makes finding log files as easy as clicking a link.
log files for seo and bad bots
  • Apache: Log Files are typically found in /var/log and subdirectories; also, using the locate access.log command will quickly spot server logs.
  • IIS: Microsoft servers “logging” can be enabled and configured in the Internet Services Manager. Go to Control Panel -> Administrative Tools -> Internet Services Manager -> Select website -> Right-click then Properties -> Website tab -> Properties -> General Properties tab.

2. Identify Number Of Hits By IP & User Agents

Once files have been found, consolidate, then open in Excel (or your preferred method). Due to the size of some log files, this can often be more easily said than done. For most smaller to medium sites, using a computer with a lot of processing power should be sufficient.

Below, .log files were manually consolidated into a new .txt file using a plain text editor, then opened in Excel using text-to-columns and a “space” delimiter, with a little additional cleanup to get the column headers to line up.

consolidated log files for seo and bad bots

Find Number of Hits by IP

After consolidating and opening logs in Excel, it’s fairly easy to find the number of hits by IP.

To do this:

  1. Create a Pivot Table, look at Client IP and get counts.
  2. Copy and paste, rename column headers to Client IP and Hits, sort by descending, then finally insert a User Agent column to the right of Hits.
log files for seo and bad bots client IP pivot tables

Find User Agents By IP

As a final step in identifying potential bad bots, find which user agents are associated with IPs hitting your site the most. To do this, go back to the pivot table and simply add the User Agent to the Row Label section of the Pivot Table.

Now, finding the User Agent associated with the top-hitting IP is as simple as a text search. In this case, the IP has no declared User Agent (was from China) and hit the site over 80,000 times more than any other IP.

log files for seo and bad bots find top hitting IP no user agent

3. Block IPs From Accessing Site And Displaying In Analytics.

Now that the malicious IP has been identified, use these instructions to prevent number inflation in Analytics, then block that IP from accessing the site completely.

Blocking An IP In Analytics

Using Filters in Google Analytics, you can exclude IPs. Navigate to Admin -> Choose View (always a good idea to Create New View when making changes like this) ->  Filters -> + New Filter -> Predefined -> Exclude traffic from the IP addresses -> Specify IP (regular expression).

log files for seo and bad bots exclude IP in google analytics

Tip: Google Analytics automatically blocks known crawlers identified by IAB (a $14,000 value for non-members). Just navigate to Admin -> View Settings, and under where it says “Bot Filtering,” check “Exclude all hits from known bots and spiders.” It’s always a best practice to create a new view before altering profile settings.

If you use Omniture, there are three methods to exclude data by IP.

  1. Exclude by IP. Excludes hits from up to 50 IPs.
  2. Vista Rule. For companies that need more than 50.
  3. Processing Rule. It’s possible to create a rule that prevents showing data from particular IPs.

Blocking An IP At The Server Level

Similar to identifying where the log files are located, the method of blocking IPs from accessing your site at the server level changes depending on the type of server you use.

  • cPanel: Using the IP Address Deny Manager, IPs can be blocked and managed on an ongoing basis.
ip deny manager
  • Apache: mod_authz_host is the recommended module for this, but the .htaccess can also be used.
  • IIS:  Open IIS Manager -> Features View ->  IPv4 Address and Domain Restrictions -> Actions Pane -> Add Deny Entry.

Conclusion

Third-party solutions route all traffic through a network to identify bots (good and bad) in real time. They don’t just look at IPs and User Agent Strings, but also HTTP Headers, navigational site behavior and many other factors. Some sites are using methods like reCAPTCHA to ensure their sites visitors are human.

What other methods have you heard of that can help protect against the “rise of the bad bots?”


Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.



About The Author

Ben Goodsell
Ben Goodsell is a lead SEO for RKG Merkle, a results-oriented digital marketing company. With deep experience in technical SEO, social media, link building, and content strategy. Ben has worked with some of the largest sites and brands on the web. Prior to RKG, Ben was the senior SEO technician for AudetteMedia, a leading boutique SEO agency.

Related Topics

All Things SEO ColumnChannel: SEOGoogle: AnalyticsHow ToHow To: AnalyticsSEO - Search Engine Optimization

We're listening.

Have something to say about this article? Share it with us on Facebook, Twitter or our LinkedIn Group.

Get the daily newsletter search marketers rely on.

Processing...Please wait.

See terms.

ATTEND OUR EVENTS

Lorem ipsum doler this is promo text about SMX events.

June 15-16, 2021: SMX Advanced

June 21-22, 2021: SMX Advanced Europe

August 17, 2021: SMX Convert

November 9-10, 2021: SMX Next

December 14, 2021: SMX Code

Available On-Demand: SMX

Available On-Demand: SMX Report

Available On-Demand: SMX Create

×


Learn More About Our SMX Events

Discover actionable tactics that can help you overcome crucial marketing challenges. Our next conference will be held:

Next Event: Sept. 14-15, 2021

Available On-Demand: March 2021

Available On-Demand: October 2020

×

Attend MarTech - Click Here


Learn More About Our MarTech Events

White Papers

  • SEO Wars: How to Resist the Dark Side and Earn Links Organically
  • Data & Organizational Roadblocks? Your Path to Frictionless Revenue Optimization
  • Converting with Conversational AI
  • 4 Ways Chatbot Marketing Can Drive Sales
  • Client Reporting Best Practices Guide
See More Whitepapers

Webinars

  • Drive Customer Engagement with the Power of Personalization
  • 7 Use Cases That Prove Why You Should Implement DAM
  • Accelerate Your SEO & Content Marketing Program with 4 Key Milestones
See More Webinars

Research Reports

  • Local Marketing Solutions for Multi-Location Businesses
  • Enterprise Digital Asset Management Platforms
  • Identity Resolution Platforms
  • Customer Data Platforms
  • B2B Marketing Automation Platforms
  • Call Analytics Platforms
See More Research

Attend SMX For Only $199

h
Receive daily search news and analysis.

Channels

  • SEO
  • SEM
  • Local
  • Retail
  • Google
  • Bing
  • Social

Our Events

  • SMX
  • MarTech

Resources

  • White Papers
  • Research
  • Webinars

About

  • About Us
  • Contact
  • Privacy
  • Marketing Opportunities
  • Staff

Follow Us

  • Facebook
  • Twitter
  • LinkedIn
  • Newsletters
  • RSS
  • Youtube

© 2021 Third Door Media, Inc. All rights reserved.

Your privacy means the world to us. We share your personal information only when you give us explicit permission to do so, and confirm we have your permission each time. Learn more by viewing our privacy policy.Ok