Google Using OCR To Index Scanned Documents


It used to be that, if you hoped Google would index a PDF file, you had to create a PDF that was text-based, not image-based; Googlebot couldn’t recognize the content of scanned or image-based documents. According to an announcement today, that’s no longer the case.

Google says it’s now using OCR (Optical Character Recognition) technology to read any scanned documents that it finds in PDF format:

This Optical Character Recognition (OCR) technology lets us convert a picture (of a thousand words) into a thousand words — words that can be searched and indexed, so that these valuable documents are more easily found.

Google’s announcement includes a few examples where you can see the results of OCR scanning in action. On a search for repairing aluminum wiring, the first result is a Consumer Product Safety Commission PDF that was clearly scanned as an image. You can now get the text of that image thanks to Google’s OCR scanning and the “View as HTML” link on the search results page. As with any use of OCR, results are probably not going to be perfect. But the examples Google provides do look quite accurate.

Countless new documents are now available to searchers — documents that were never available before. On the other hand, if you’ve been scanning and uploading image-based PDFs knowing that they’d never be found by searchers — and I know people who have — you may want to rethink that strategy.



Matt McGee is the Search Engine Land Assignment Editor, and offers search marketing consulting and training to businesses of all sizes. He blogs at Small Business Search Marketing and HyperlocalBlogger.com.

See more articles by Matt McGee >


Share, Bookmark & Discuss This Article
More:


Keep Updated: News Via Email | News Via RSS Feed | News Via Twitter


See more stories like this in the Members Library! Check out the Google: SEO, Google: Web Search, Top News sections of the Members Library where this story is filed. Members also get access to exclusive video content, a members-only weekly & monthly newsletter, plus more. Check out all the benefits!

Comments are closed.


RECENT COMMENTS

  • solarian said " Your article would be more consistent if you provide some links to not optimized for search engines "
  • KevinSpence said " The AP & other news companies forget how much of their content is syndicated. So alright, maybe "
  • Avintrue said " So I would have to say that principle of pointing a finger with three pointing back at you is a good"

See All »


FREE DAILY SEARCH NEWS RECAP!

Stay on top of all the search news with our daily summary, the SearchCap newsletter. View a sample ›

STAY CURRENT THROUGHOUT THE DAY

RSS Feeds

The Search Engine Land feed keeps you informed as news happens. SEE ALL FEEDS »

Upcoming Search Engine Land Conferences

Advertise With Us »

Search Engine Land produces SMX, the Search Marketing Expo conference series. SMX events deliver the most comprehensive educational and networking experiences - whether you're just starting in search marketing or you're a seasoned expert.


SMX Web Site » | SMX Difference » | SMX News »


Join us at an upcoming SMX event:

Search Marketing Now Learn more about search marketing with our free online webcasts and webinars from our sister site, Search Marketing Now. Upcoming online events include:


See more webcast topics »

TRACK US SOCIALLY
Upcoming Search Engine Land Conferences

Get Your Search Engine Land
Premium Membership!

Become a premium member today and receive:

  • Express commenting privileges & photo.
  • Exclusive videos & newsletters.
  • Discounts to our SMX conferences.
  • Access to "How To" & Other Archives.

Learn More

Upcoming Search Engine Land Conferences
Add to GoogleAdd to My Yahoo!Add to BloglinesAdd to NetvibesAdd to Windows Live