A Data-Centric Approach To Identifying 404 Pages Worth Saving

A critical part of doing a site or link audit is checking to see how many 404 (page not found) pages there are in a site. I can’t tell you how many times I’ve seen an audit that lists the total number of 404 pages and advises developers to find appropriate pages to redirect these 404 pages to.

That’s no big deal if we’re talking about just 20 to 30 pages. But, when a site has 404 pages in the thousands, and you tell the developers to fix these pages, you’re going to look more than a little ridiculous. So, how can you find out which of those 404 pages are actually important?

Two of the most important metrics to look at are backlinks to make sure you don’t lose the most valuable links and total landing page visits in your analytics software. You may have others, like looking at social metrics. Whatever you decide those metrics to be, you want to export them all from your tools du jour and wed them in Excel.

Gather 404 Pages

There are several different sources you can use to find your site’s 404 pages. My two favorites are Screaming Frog and Google Webmaster Tools (GWT).

To find your site’s 404s with Screaming Frog, after running a scan of the site, go to Response Codes > Filter: Client Error (4xx) > Export.

Screaming Frog 404s

Click for larger image

To flesh out your site’s 404 pages with Google Webmaster Tools, go to Health > Crawl Errors > URL Errors > Web or Mobile: Not found > Download.


Click for larger image

Strip your csv download of everything but the list of 404 URLs, and save the file as an xlsx file.

Pull Landing Page Data

If you’re only responsible for SEO, you may want to restrict your export to organic traffic. In Google Analytics (GA), you’d navigate to Traffic Sources > Sources > Search > Organic > Primary Dimension: Landing Page.

But, I think that approach is a bit shortsighted. I much prefer looking for all important landing pages, which you get to by navigating to Content > Site Content  > Landing Pages.

To pull all of them, you need to look at the total number of landing pages in the bottom-right corner of the report (e.g., 1 – 10 of 441). If you have more than 5oo landing pages, you’ll need to use this trick to get them all.

To get the full URL of your landing pages, you’ll need to use this technique in the third section of the post, especially if your site uses subdomains and you’re not including hostname in your content reports.

Get Your Backlink Data

First, pull backlinks from your favorite tool. It’s outside of the scope of this post for how to do that; but, here are links to learn more about how to use each of the tools:

Pull It All Together

Once you know that all of your URLs follow the same syntax (by either having them all start with http:// or removing the http:// from all of the URLs), you’re ready to stitch all of these metrics together using VLOOKUPs in Excel. If you’re new to VLOOKUPs, check out this introduction on the Microsoft site.

Make sure you format your dataset as a table so that you can sort the data by the number of landing page visits, backlinks, or page authority — or whatever else you want to pull into the dataset.

By taking this kind of data-centric approach, you can fairly easily identify the backlinks you actually need to address and fix.

Opinions expressed in the article are those of the guest author and not necessarily Search Engine Land.

Related Topics: Beginner | Channel: Analytics | How To | How To: Analytics | How To: Links | How To: SEO | Link Week Column


About The Author: is an SEO and analytics consultant. Her areas of expertise are analytics, technical SEO, and everything to do with data — collection, analysis, and beautification. She’s on a mission to rid the world of ugly data, one spreadsheet at a time. If you just can’t get enough data visualization tips, you can check out her blog, Annielytics.com.

Connect with the author via: Email | Twitter | Google+ | LinkedIn


Get all the top search stories emailed daily!  


Other ways to share:

Read before commenting! We welcome constructive comments and allow any that meet our common sense criteria. This means being respectful and polite to others. It means providing helpful information that contributes to a story or discussion. It means leaving links only that substantially add further to a discussion. Comments using foul language, being disrespectful to others or otherwise violating what we believe are common sense standards of discussion will be deleted. Comments may also be removed if they are posted from anonymous accounts. You can read more about our comments policy here.
  • http://www.websitedoctor.com/ Alastair McDermott

    “Broken Link Checker” is a useful plugin for anyone running WordPress who wants to find, and easily remove links to 404 pages. It won’t automatically fix the problems though, only makes removing the links easier.

  • http://www.annielytics.com/ Annie Cushing

    Yeah, I don’t recommend doing that. It masks the problem and can make your posts unusable (like if you link to an important tool you talk about in your post or something like that).

  • http://twitter.com/jessemcfarlane Jesse McFarlane

    It also thrashes your server, so be judicious if you do use it.

  • http://www.websitedoctor.com/ Alastair McDermott

    Sure, but I’d always recommend running a caching plugin to avoid server thrashing.

  • http://www.websitedoctor.com/ Alastair McDermott

    Agree 100% with you that you shouldn’t just remove the link without making some sort of changes for usability, e.g. don’t leave “click here for PDF download” which doesn’t link to anything – edit the content appropriately.

  • http://twitter.com/paposhki Ernesto Badillo

    Nice info, also, in ordeto get the one indexed by Google, you can export them by using the SEO Quake addon once you search using the command site

  • http://www.annielytics.com/ Annie Cushing

    That’s great information to have for sure, but I didn’t include it in this analysis b/c it wouldn’t be a good determining factor for if it should be saved. The reason is you could have 404s that are still indexed that will eventually be removed and pages that aren’t in the index but will be reindexed after they’re fixed.

  • http://www.annielytics.com/ Annie Cushing

    And I, in turn, agree with you. :)

  • adam

    I’m slightly confused by your post. It does not appear you go into great detail as to how each part works with the other. E.g. how does pulling all the landing page info from GA give me any insight into which 404′s need addressing, same goes with back links.

    Maybe I missed the mark, but this has just confused me…..

    Would really appreciate some additional assistance.

    Thank you

  • http://twitter.com/jwdlatif Jawad Latif

    A great post indeed.

  • http://www.annielytics.com/ Annie Cushing

    Hey Adam, I look at analytics data to see which pages have gotten traffic in the past year. If the site has ecommerce tracking in place, I also look at revenue. You don’t want to lose a page that’s been getting traffic and/or making you money. And I look at inbound links b/c if a page is lost to a 404 (or 410 or anything in the 400 suite of status codes), you also lose the benefit of whatever links you had pointing to the page. When you fix those 404s you potentially reclaim whatever links were pointing to the page. Let me know if you have any other questions. I know it’s easy to get tangled up in this stuff!

  • http://www.annielytics.com/ Annie Cushing


  • 4u2discuss

    Annie Many times there is just a typo and the problem can be fixed by editing the link to point to the correct URL.

    Also be sure to have a good 404 landing page, asking the end-user to contact the web master with the relevant information. My host has enabled this for me in the Cpanle…

    NEVER JUST REMOVE THE LINK == EDIT THE LINK AND POINT IT SOME WHERE WHERE IT WILL OPEN A PAGE. even if that page just explains that there was an error and you are working on fixing the problem, with a form to contact either your marketing staff or your public relations staff.

    Fixing 404 pages improves your SERP’s drastically, so take your time and do what needs to be done. your SERP’s are worth the effort.

  • 4u2discuss

    Thanx for a great article on issues around the page not found error (404)

    Sometimes there are typos, and one just needs to fix these, but as you say when there are many many 4o4 errors, then there are big issues, and they need to be sorted fast…. what many SEOP (Search Engine Optimisation Practitioners) fail to tell site owners is that these 404′s slowly start to add negative SEOV (Search Engine Optimisation Value) to your work. This negative value starts of very very small, and slowly increases as the age of the 404 grows.

    another problem is that each 404 adds to the negative SEOV score in an accumulative manner, and the SEA’s (Search Engine Algorithms) use these in more advanced ways to day than they did a few years back. The issue of user happiness and related concepts of user friendly pages are becoming much more valuable in the SEA’s mathematical equations, and negative values here tend to cause sites to slowly drop out of the SERP’s (Search Engine Results Pages)

    As the age of each 404 increases so does its corresponding negative SEOV, and this is apparently connected to a complicated algorithm which has a logarithmic curve connected to the age of each individual 404 that the search engine locates within your entire domain, and adds the accumulative value into other algorithms which are used to create a number of different metrics which the SEA uses in other areas such as page rank, site rank, author rank, publisher rank and possibly many others.

    The tools you discuss will allow web site owners to locate and identify the source of these 404 requests, and your SEO team should be appointed to ensure that these requests are no longer forth coming.

    Please take note that internal requests that result in 404 errors are extremely harmful to your SERP’s and should be fixed as soon as they are discovered.

    Note for Adam EVERY SINGLE 404 should be addressed with the primary focus on all internal 404 first… as these add negative SEOV to the page that they originate from, the sub site they originate from as well as your domain as a whole.

  • http://www.annielytics.com/ Annie Cushing

    Thanks for the SEO advice. ;)

  • http://twitter.com/paposhki Ernesto Badillo

    Yeap, right about it, anyway, really nice post.

  • adam

    Thanks Annie,

    I really appreciate the help :)

    I guess I was just trying to rap my head around the excel part and using the Vlookup to assist in making this easier. I watched the videos you suggested, and will give it a ago to.

    I’ll let you know if I run into any other issues.

    Thank you again!

  • adam


    Could please provide me your lookup values?

    What is your lookup value, table array, ect?


    P.s. I am using 404′s, landing pages, revenue, goals and backlinks

    Thank you

  • http://www.annielytics.com/ Annie Cushing

    I didn’t save the pivot table because it wasn’t something I could share on the site. I’m sorry. If you want to send me an Excel sheet, I can take a look at it and help you. Then you can see what I did.

  • adam

    I would love to :)

    Thank you

    Whats the best email to reach you on?


  • http://www.facebook.com/rhvankar.vankar Rhvankar Vankar


  • Tory Reiss

    Thanks for the great SEO article. This is the first article I’ve read the accurately covers 404 issues. Great Job Annie Cushing!

  • http://www.annielytics.com/ Annie Cushing

    Thanks! Glad it helped!

  • http://www.annielytics.com/ Annie Cushing

    I disagree that every 404 page needs to be fixed. However, yes, webmasters should regularly go through their sites and make sure they don’t have broken links to other pages on their site. Screaming Frog is very useful for this purpose.

  • 4u2discuss

    Repairing the link or fixing any typos is a much better option than deleting / removing any bad links.

  • Finn_Jake

    Aside from ScreamingFrog and GWT, you can also use ColibriTool (http://colibritool.com) to locate 404 error pages :)

  • http://twitter.com/AllSearch52 AllSearch52

    Hey Annie, on a site with 15k indexed pages and 5k 404′s, do you still recommend the same process ? Given that 5k 404 pages is quite a few to work through. Also any tips for avoiding 404′s during site migration ?

  • http://www.annielytics.com/ Annie Cushing

    I absolutely recommend this same process b/c it enables you to scale your decision-making process. All of the 404s that haven’t gotten traffic in the past year or links can be filtered out. For the most part, that is. I had a client once who had an image that was 404ing across multiple pages. When I looked into it, it had the call to action on most of their top money pages. So fixing that link was critical. As for avoiding 404s, I did a webinar with Conductor on this topic last year: http://www.conductor.com/resource-center/webinars/intelligent-site-redesign-and-migration-webinar-seer-interactive. Hope it helps!

  • http://www.annielytics.com/ Annie Cushing

    Not always. You have to do a cost-benefit analysis with any kind of SEO-oriented fixes, especially with enterprise sites.

  • 4u2discuss

    Some times by fixing a typo in a shared border or other shared include section you can fix many 404 at once, it depends on how you constructed your pages, and where your pages are extracting their data. If your pages are extracting bad data that is used to construct in page links, and these can be detected then a fix may be relatively cheap.

    That is one of the main reasons that you need to consult with an SEOP (Search Engine Optimisation Practitioner) who uses the FUFISM philosophy. That way you will enure that 404′s are dealt with instantly, to avoid the SEPA’s (Search Engines Penalty Algorithms) which have a substantial impact on your long-tail organic SERP’s (Search Engine Results Pages) which are normally your key business generating visitors who have done an extensive multi-phrase search.

    So this is an awful lot more important to your LTSEO (Long Term Serach Engine Optimisation) than what you realise.

  • 4u2discuss

    Take a serious look at how your shared borders and other shared content is constructed and make sure there are no loop holes here….

    Also remember that search engines run checks seeking out 404 pages, and if located they are indexed and labled with a variety of indicators. these indicators are used by Google and other search engines to evaluate your site, and give your site and SEO rating that is used in other calculations such as Page Rank, publisher, rank site rank and others.

    404 can thus have quite a serious knock on effect on your SERP’s (Search Engine Results Pages). this may not be evident in your primary key words, but is extremely relevant when it comes to LTSEO (Long Tail Search Engine Optimisation)

  • 4u2discuss

    Hi, take care and get rid of all your 404 pages…. they are bad for your SEO.

    Also be sure that your custom made 404 landing page has all the relevant links to your admin section, your web master and other pages including your in-site search page if you have one.

    Google and other search engines check these things and they form part of the variables used to calculate your page rank, site rank, publisher rank and other metrics.

    If you are getting many 404′s from strange sites, try and locate the exact page they were coming from, and see if you can get the web master in charge of the offending page to fix. some guys will help and others may tell you to take a hike, but you got ta try and fix this, as it is bad for your long term long tail search engine optimisation.

  • http://www.annielytics.com/ Annie Cushing

    The only acronym in this comment that’s actually an industry acronym is SERPs.

  • CloudComputingExpert

    Thanks for sharing nice information.

    Using Minalyzer, you can crawl and index a website, and get SEO related information like broken links, server response code errors, missing meta data, keywords, description, duplicate content, HTML code errors and more. Correcting those deficiencies enables your website to attain a higher rank in SEO.

    Neelam Sharma
    Minalyzer Development Team Member


Get Our News, Everywhere!

Daily Email:

Follow Search Engine Land on Twitter @sengineland Like Search Engine Land on Facebook Follow Search Engine Land on Google+ Get the Search Engine Land Feed Connect with Search Engine Land on LinkedIn Check out our Tumblr! See us on Pinterest


Click to watch SMX conference video

Join us at one of our SMX or MarTech events:

United States


Australia & China

Learn more about: SMX | MarTech

Free Daily Search News Recap!

SearchCap is a once-per-day newsletter update - sign up below and get the news delivered to you!



Search Engine Land Periodic Table of SEO Success Factors

Get Your Copy
Read The Full SEO Guide