One of the most frustrating things about technical problems with a site is that the ways they show up in search engines are usually unexpected or subtle. What looks like a penalty can actually be a problem introduced with a new version or new feature of a website.
Because the true causes of problems like these are usually not at all obvious, they can lead to hypotheses that border on the paranoid (“Google doesn’t like my site,”) or wild speculation: (“I was put in the sandbox and then hit with Panda. I call it the Pandbox.”).
Since Google isn’t alive and doesn’t have emotions (yet), we can safely set aside (for now) any search engine anthropomorphizing and focus on finding root causes that may be lurking in the site’s technical infrastructure.
Symptoms: Fewer Pages In The Index, Drop In Long Tail Traffic
The main causes for problems with site coverage include duplicate content, allowing pages with no SEO value to be crawled, and network problems.
Duplicate content occurs when you can get to a page through multiple URLs.
Duplicate content can also happen at the page level, when a page is available at multiple URLs like this:
Both types of duplicate content reduce the number of pages in the index because search engines are wasting their time crawling multiple copies of a website or a page.
Search engines throw away these extra copies because there is no point in including redundant pages in the index. This means that time spent crawling more pages on your site was wasted crawling extra copies of pages that won’t be used anyway.
For the example pages above, that site would have to be crawled at least five times to get each page of the site.
If you have a duplicate site, you can use a 301 to permanently redirect any visitors to the main site.
Fixing duplicate content at the page level is a bit tricker.
Select one canonical URL from each set of potential duplicate URLs and make sure that each duplicate URL permanently redirects to the canonical one. If this isn’t possible – for example, due to tracking parameters like referral_id=1 above – use a link rel=canonical tag that points to the canonical URL and configure Bing and Google webmaster tools to ignore the appropriate parameters.
Diagnosing Crawl Inefficiencies
Allowing pages with no value to be crawled means that the search engines are spending valuable resources crawling things like API calls, log files, or pages with an infinite number of combinations like a web calendar.
Similar to duplicate content, crawl inefficiency means that search engines are crawling useless pages, at the expense of pages that you would like crawled.
These zero-value pages aren’t going to lead to any conversions, assuming that they are even indexed by search engines or rank well for anything.
To fix these types of problems, use the robots.txt file to exclude these types of pages. Be sure to test any changes to your robots.txt file in Google Webmaster Tools before pushing them live.
Networking problems can be very elusive. Most of the networking problems I have seen involve either load balancing or DNS.
Load balancing is used on larger sites to spread web requests among a number of back end servers. Sometimes it is misconfigured in a way in which most of the crawler requests go to one backend server, which eventually slows to a crawl.
DNS problems can make a website unnecessarily slow for first time visitors or in extreme cases, make it intermittently unavailable.
You can easily check your DNS configuration with an on-line tool like IntoDNS. Checking the load balancers or other aspects of the back end network is not so easy, so it’s probably best to ask a network engineer about any recent changes to the infrastructure.
Symptoms: Wrong Pages Ranking, Decline In Ranking
These symptoms are usually caused by duplicate copies of important pages or by search engines not being able to understand the linking structure of your site.
Duplicate content can have a negative effect on ranking because inbound links to a particular page – a very important signal for search engines – are spread out among different URLs. As a result, the search engine is only aware of the number of inbound links for the one copy of the page that it decides to keep.
Make sure that all of the intended inbound links count towards the page by fixing these duplicate URLs as described above.
Another important signal for search engines is how a page is linked within a site. For example, a page with a link from the homepage will be considered a more important page than a page that is orphaned on the site with no links.
Investigate Before You Make Assumptions
This is not a complete list of root causes for indexing issues and traffic loss, but it does contain the most common issues that I have seen with sites that I have been asked to review.
Other causes of similar symptoms are page speed, cache unfriendliness, internationalization issues, server misconfigurations, and security vulnerabilities. Each one is worthy of an article in itself.
I hope this provides some additional ideas of where to hunt down causes of particularly vexing problems with the way your site is performing in search.
Fortunately, it is much easier to redirect a duplicate copy of your site or fix a DNS misconfiguration than it is to influence Google or Bing’s algorithms.
While search engines definitely penalize some sites and it is possible for a site to get caught up in algorithm changes, make sure you have thoroughly reviewed your technical architecture before jumping to any conclusions about what search engines don’t “like” about it.
Opinions expressed in the article are those of the guest author and not necessarily Search Engine Land.