Google Using OCR To Index Scanned Documents

It used to be that, if you hoped Google would index a PDF file, you had to create a PDF that was text-based, not image-based; Googlebot couldn’t recognize the content of scanned or image-based documents. According to an announcement today, that’s no longer the case.

Google says it’s now using OCR (Optical Character Recognition) technology to read any scanned documents that it finds in PDF format:

This Optical Character Recognition (OCR) technology lets us convert a picture (of a thousand words) into a thousand words — words that can be searched and indexed, so that these valuable documents are more easily found.

Google’s announcement includes a few examples where you can see the results of OCR scanning in action. On a search for repairing aluminum wiring, the first result is a Consumer Product Safety Commission PDF that was clearly scanned as an image. You can now get the text of that image thanks to Google’s OCR scanning and the “View as HTML” link on the search results page. As with any use of OCR, results are probably not going to be perfect. But the examples Google provides do look quite accurate.

Countless new documents are now available to searchers — documents that were never available before. On the other hand, if you’ve been scanning and uploading image-based PDFs knowing that they’d never be found by searchers — and I know people who have — you may want to rethink that strategy.

Related Topics: Channel: SEO | Google: SEO | Google: Web Search | Top News


About The Author: is Editor-In-Chief of Search Engine Land. His news career includes time spent in TV, radio, and print journalism. His web career continues to include a small number of SEO and social media consulting clients, as well as regular speaking engagements at marketing events around the U.S. He blogs at Small Business Search Marketing and can be found on Twitter at @MattMcGee and/or on Google Plus. You can read Matt's disclosures on his personal blog.

Connect with the author via: Email | Twitter | Google+ | LinkedIn


SMX - Search Marketing Expo

SearchCap:

Get all the top search stories emailed daily!  

Like This Story? Please Share!

Other ways to share:

Like Our Site? Follow Us!

Subscribe to Our Feed! Join our LinkedIn Group Check out our Tumblr! See us on Pinterest Get Search Engine Land on your mobile device!
 

Read before commenting! We welcome constructive comments and allow any that meet our common sense criteria. This means being respectful and polite to others. It means providing helpful information that contributes to a story or discussion. It means leaving links only that substantially add further to a discussion. Comments using foul language, being disrespectful to others or otherwise violating what we believe are common sense standards of discussion will be deleted. Comments may also be removed if they are posted from anonymous accounts. You can read more about our comments policy here.

Comments are closed.

Get Our News, Everywhere!

 
  • Advertise With Us
 

Click to watch SMX conference video

Join us at an upcoming SMX event:

North America

EMEA

APAC

Search Engine Land produces SMX, the Search Marketing Expo conference series. SMX events deliver the most comprehensive educational and networking experiences - whether you're just starting in search marketing or you're a seasoned expert.

SMX Site » | SMX Difference » | SMX News »




 

Search Engine Land Periodic Table of SEO Ranking Factors

Get Your Copy
Read The Full SEO Guide