Google Using OCR To Index Scanned Documents

Published: October 30, 2008 at 7:14 pm

Read Time: 2 minutes

Published: October 30, 2008 at 7:14 pm

Read Time: 2 minutes

Written by Matt McGee

It used to be that, if you hoped Google would index a PDF file, you had to create a PDF that was text-based, not image-based; Googlebot couldn’t recognize the content of scanned or image-based documents. According to an announcement today, that’s no longer the case. Google says it’s now using OCR (Optical Character Recognition) technology […]

It used to be that, if you hoped Google would index a PDF file, you had to create a PDF that was text-based, not image-based; Googlebot couldn’t recognize the content of scanned or image-based documents. According to an announcement today, that’s no longer the case.

Google says it’s now using OCR (Optical Character Recognition) technology to read any scanned documents that it finds in PDF format:

This Optical Character Recognition (OCR) technology lets us convert a picture (of a thousand words) into a thousand words — words that can be searched and indexed, so that these valuable documents are more easily found.

Google’s announcement includes a few examples where you can see the results of OCR scanning in action. On a search for repairing aluminum wiring, the first result is a Consumer Product Safety Commission PDF that was clearly scanned as an image. You can now get the text of that image thanks to Google’s OCR scanning and the “View as HTML” link on the search results page. As with any use of OCR, results are probably not going to be perfect. But the examples Google provides do look quite accurate.

Countless new documents are now available to searchers — documents that were never available before. On the other hand, if you’ve been scanning and uploading image-based PDFs knowing that they’d never be found by searchers — and I know people who have — you may want to rethink that strategy.

Topics on this page

Google Google Search Googlebot HTML Optical character recognition PDF Web indexing

+2 more

Contributing authors are invited to create content for Search Engine Land and are chosen for their expertise and contribution to the search community. Our contributors work under the oversight of the editorial staff and contributions are checked for quality and relevance to our readers. Search Engine Land is owned by Semrush. Contributor was not asked to make any direct or indirect mentions of Semrush. The opinions they express are their own.

About the Author

Matt McGee

Matt McGee joined Third Door Media as a writer/reporter/editor in September 2008. He served as Editor-In-Chief from January 2013 until his departure in July 2017. He can be found on Twitter at @MattMcGee.

Webinars

Beyond Navigation: Advanced Architecture and AI

Webflow AEO: From insight to action, at scale

2026 CX Trends Shaping Customer Trust and Personalization

View all Webinars

Intelligence Reports

Enterprise Digital Asset Management Platforms: A Marketer’s Guide

Email Marketing Platforms: A Marketer’s Guide

Identity Resolution Platforms: A Marketer’s Guide

View all Intelligence Reports

White Papers

Closing the Franchise Gap: Local Marketing in the Age of AI

Life on the Edge: Thriving at the frontier of human & artificial intelligence

How top brands are shifting from platform inflation to real revenue impact.

View all White Papers

Is your organic traffic disappearing?

Google Using OCR To Index Scanned Documents

Topics on this page

About the Author

Related Articles

Google adds Channel Diagnostics to Performance Max

Google Search now sends searchers directly to publisher-hosted AMP pages

GraphRAG: What entity-first retrieval means for SEO

Webinars

Intelligence Reports

White Papers

Find Your SEO Issues in 30 Seconds