Google Using OCR To Index Scanned Documents


It used to be that, if you hoped Google would index a PDF file, you had to create a PDF that was text-based, not image-based; Googlebot couldn’t recognize the content of scanned or image-based documents. According to an announcement today, that’s no longer the case.

Google says it’s now using OCR (Optical Character Recognition) technology to read any scanned documents that it finds in PDF format:

This Optical Character Recognition (OCR) technology lets us convert a picture (of a thousand words) into a thousand words — words that can be searched and indexed, so that these valuable documents are more easily found.

Google’s announcement includes a few examples where you can see the results of OCR scanning in action. On a search for repairing aluminum wiring, the first result is a Consumer Product Safety Commission PDF that was clearly scanned as an image. You can now get the text of that image thanks to Google’s OCR scanning and the “View as HTML” link on the search results page. As with any use of OCR, results are probably not going to be perfect. But the examples Google provides do look quite accurate.

Countless new documents are now available to searchers — documents that were never available before. On the other hand, if you’ve been scanning and uploading image-based PDFs knowing that they’d never be found by searchers — and I know people who have — you may want to rethink that strategy.



Matt McGee is the Search Engine Land Assignment Editor, and offers search marketing consulting and training to businesses of all sizes. He blogs at Small Business Search Marketing and HyperlocalBlogger.com.

See more articles by Matt McGee >


Share, Bookmark & Discuss This Article
More:


Keep Updated: News Via Email | News Via RSS Feed | News Via Twitter


See more stories like this in the Members Library! Check out the Google: SEO, Google: Web Search, Top News sections of the Members Library where this story is filed. Members also get access to exclusive video content, a members-only weekly & monthly newsletter, plus more. Check out all the benefits!

Comments are closed.


RECENT COMMENTS

  • Carrie Hill said " Hi Fran, Did you click on the "comparison" view in the top right corner of the graphic - look at the"
  • stefanw said " I love it when great minds come together. And then I'm there to bring it all down. But seriously I c"
  • davebarnes said " Or you can use AdBlock Plus with this filter: *google-analytics.com*"

See All »


FREE DAILY SEARCH NEWS RECAP!

SearchCap is a once-per-day newsletter update:

STAY CURRENT THROUGHOUT THE DAY

Our feed & social options update you as news happens.


Advertise With Us »

Search Marketing Expo

Search Engine Land produces SMX, the Search Marketing Expo conference series. SMX events deliver the most comprehensive educational and networking experiences - whether you're just starting in search marketing or you're a seasoned expert.


SMX Web Site » | SMX Difference » | SMX News »


Join us at an upcoming SMX event:

Search Marketing Now Learn more about search marketing with our free online webcasts and webinars from our sister site, Search Marketing Now. Upcoming online events include:


See more webcast topics »

FOLLOW US SOCIALLY
Upcoming Search Engine Land Conferences

Get Your Search Engine Land
Premium Membership!

Become a premium member today and receive:

  • Express commenting privileges & photo.
  • Exclusive videos & newsletters.
  • Discounts to our SMX conferences.
  • Access to "How To" & Other Archives.

Learn More

Upcoming Search Engine Land Conferences
Add to GoogleAdd to My Yahoo!Add to BloglinesAdd to NetvibesAdd to Windows Live