Google Using OCR To Index Scanned Documents

It used to be that, if you hoped Google would index a PDF file, you had to create a PDF that was text-based, not image-based; Googlebot couldn’t recognize the content of scanned or image-based documents. According to an announcement today, that’s no longer the case.

Google says it’s now using OCR (Optical Character Recognition) technology to read any scanned documents that it finds in PDF format:

This Optical Character Recognition (OCR) technology lets us convert a picture (of a thousand words) into a thousand words — words that can be searched and indexed, so that these valuable documents are more easily found.

Google’s announcement includes a few examples where you can see the results of OCR scanning in action. On a search for repairing aluminum wiring, the first result is a Consumer Product Safety Commission PDF that was clearly scanned as an image. You can now get the text of that image thanks to Google’s OCR scanning and the “View as HTML” link on the search results page. As with any use of OCR, results are probably not going to be perfect. But the examples Google provides do look quite accurate.

Countless new documents are now available to searchers — documents that were never available before. On the other hand, if you’ve been scanning and uploading image-based PDFs knowing that they’d never be found by searchers — and I know people who have — you may want to rethink that strategy.

Related Topics: Google: SEO | Google: Web Search | Top News


About The Author: is Search Engine Land's Executive News Editor, responsible for overseeing our daily news coverage. His news career includes time spent in TV, radio, and print journalism. His web career continues to include a small number of SEO and social media consulting clients, as well as regular speaking engagements at marketing events around the U.S. He blogs at Small Business Search Marketing and can be found on Twitter at @MattMcGee and/or on Google Plus.


SMX - Search Marketing Expo

SearchCap: Get all the top search stories emailed each day!

Name: Company: Email:

Like This Story? Please Share!

Other ways to share:

Like Our Site? Follow Us!

Search Engine Land on Google+

LinkedIn over 34,000 members
Subscribe to Our Feed! 80,565 subscribers take our RSS feed
 

Comments are closed.

Get Our News, Everywhere!

 
  • Advertise With Us
 

Click to watch SMX conference video

Join us at an upcoming SMX event:

Search Engine Land produces SMX, the Search Marketing Expo conference series. SMX events deliver the most comprehensive educational and networking experiences - whether you're just starting in search marketing or you're a seasoned expert.

SMX Site » | SMX Difference » | SMX News »


Learn more about search marketing with our free online webcasts and webinars from our sister site, Search Marketing Now. Upcoming online events include:


 

Search Engine Land Periodic Table of SEO Ranking Factors

Get Your Copy
Read The Full SEO Guide