• Search Engine Land
  • Sections
    • SEO
    • SEM
    • Local
    • Retail
    • Google
    • Bing
    • Social
    • Resources
    • More
    • Home
  • Follow Us
    • Follow
  • Search Engine Land
  • SEO
  • SEM
  • Local
  • Retail
  • Google
  • Bing
  • Social
  • Resources
  • Live
  • More
  • Events
    • Follow
  • SUBSCRIBE

Search Engine Land

Search Engine Land
  • SEO
  • SEM
  • Local
  • Retail
  • Google
  • Bing
  • Social
  • Resources
  • More
  • Newsletters
  • Home
Consumer

When OCR Goes Bad: Google’s Ngram Viewer & The F-Word

Google launched its Google Books Ngram Viewer this week, a tool that lets you research how popular words and phrases have been over several centuries, based on their appearance in books. But can you trust it? In the case of the F-word, no — and perhaps in many other cases, as well. I read several […]

Danny Sullivan on December 19, 2010 at 1:56 pm
  • More

Google launched its Google Books Ngram Viewer this week, a tool that lets you research how popular words and phrases have been over several centuries, based on their appearance in books. But can you trust it? In the case of the F-word, no — and perhaps in many other cases, as well.

I read several mainstream news stories about the viewer after it launched, including a long piece in the Wall Street Journal. Those articles were generally filled with excitement. My own reaction to the tool was more muted. I immediately wondered if the underlying data was actually that accurate.

Counting Words Often Goes Wrong

For years, I’ve seen people try to use regular search data to plot the popularity of terms and trends over time. That’s been fraught with issues, in particular, when web pages have the wrong date on them. With the Ngram viewer, I figured it might have its own issues, such as:

  • Does Google Books get the dates of some books wrong?
  • Is the distribution adjusted? IE, if you have more books in a particular year, can that cause some terms to spike?
  • Are the books “even” in subject matter? IE, do you have more scientific works scanned in one year than maybe another year?

Scanning Isn’t Perfect

I hadn’t thought of an even more basic problem: OCR errors. OCR stands for optical character recognition, the technology of scanning an image of a word and recognizing it digitally as that word. It’s how Google has “read” the 5 million books that the Ngram Viewer lets you search against.

OCR isn’t perfect. Sometimes words aren’t recognized correctly. Google’s Ngram Viewer FAQ page addresses this (and covers some other issues like those I’ve raised above, and how they’re adjusted for):

Why does the word “Internet” occur before 1950?
Most of those are OCR errors; we do a good job at filtering out books with low OCR quality scores, but some errors do slip through.

What A Difference An S Makes

That leads me to the F-word. For those who are sensitive, look away. I’ll be using the full word shortly, as it’s pretty awkward to write about this particular case without using it.

Yesterday, I saw venture capitalist Dave McClure mention a tweet from Brad Feld that linked to a chart of the word “fuck” being used from the 1600s through today. Curious, I took a deeper look. Here’s the chart:

You can see these huge spikes in usage early on the chart, but then by the 1800s, usage disappears until around 1960. What happened?

Well, at the bottom of the chart, you can see different years listed. Click on one of those year segments, and you get back a listing of books that contain the word, for that time period.

For the first period, 1650-1676, this is what I got:

You can see the mentions of “fuck” highlighted in bold. You can also see that they make little sense. From one:

their desires to fuck the blood

Fuck the blood? Was that supposed to be “suck the blood?” Yes, it was. The F in most of these cases — probably all of them — is in reality an S.

The Medial S

What happened? Blame the “medial s” (more about it here and here) That’s an archaic form of the letter S, where it looks similar to an F.

American students who puzzled over early government documents like The Bill Of Rights and seeing mentions of “Congrefs” are familiar with this (the image at the top of this article comes from an image of the Bill Of Rights from Wikipedia).

As a result, this usage of suck from the 1600s:

Is treated the same as the actual word “fuck” as written in 1991:

Google’s Ngram Viewer FAQ mentions this is a problem:

Why do I see so many misspellings like thif from pre-1800 Englifh books?

Use of the medial s.

To me, this seems like a big issue. S is a common word in the English language. If it’s not being distinguished from F, how accurate are all these charts being produced?

Not Found: First Written Usage Of “Fuck”

By the way, that 1991 reference about “fuck” is from Bill Bryson’s book, The Mother Tongue, where he explores the history of English. You can see in the screenshot from it above that Bryson writes that the first printed usage of the word “fuck” is in a poem by William Dunbar from 1503.

Google Books goes back that far, but ironically, it doesn’t find Dunbar’s poem with that word:

Instead, to locate it, I had to do some further research outside of Google Books, to locate the exact work attributed with the usage — “A Brash Of Wowing” — and discover that the exact spelling is “fukkit” rather than “fuck,” as you see here:

See the challenge? If you’re trying to track back to the first use of “fuck” (or any word) using the Ngram viewer, you’d better be checking for all forms of that word — and that means having a good knowledge of how language has changed, over time.

Further, the task is complicated by reprints. After several searches, I couldn’t find the original printing of “A Brash Of Wowing” from the 1500s (which doesn’t surprise me, as it has to be extremely rare). But I had no problem finding copies from later dates, such as 2003. Those reprints may skew the usage of words higher, potentially, over time.

Searcher, Beware

I’m hoping that the academic researchers using this material are indeed adjusting for these and other potential traps. It would be terrible if they’re simply taking whatever numbers the Ngram viewer spits out without doing some deep analysis in each case they study.

For the casual searcher, the Ngram viewer needs to be taken with a huge grain of salt, I’d say. It’s fun. It might give you some idea of trends. But it could also be putting out data that’s all fukkit up.

Postscript: Gary Price of ResourceShelf pointed out this post from the Binder Blog that takes another look at problems with the Ngram viewer.



About The Author

Danny Sullivan
Danny Sullivan was a journalist and analyst who covered the digital and search marketing space from 1996 through 2017. He was also a cofounder of Third Door Media, which publishes Search Engine Land, Marketing Land, MarTech Today and produces the SMX: Search Marketing Expo and MarTech events. He retired from journalism and Third Door Media in June 2017. You can learn more about him on his personal site & blog He can also be found on Facebook and Twitter.

Related Topics

Channel: ConsumerGoogle: Book Search

We're listening.

Have something to say about this article? Share it with us on Facebook, Twitter or our LinkedIn Group.

Get the daily newsletter search marketers rely on.
See terms.

ATTEND OUR EVENTS

Lorem ipsum doler this is promo text about SMX events.

February 23, 2021: SMX Report

April 13, 2021: SMX Create

May 18-19, 2021: SMX London

June 8-9, 2021: SMX Paris

June 15-16, 2021: SMX Advanced

August 17, 2021: SMX Convert

November 9-10, 2021: SMX Next

October 2021: SMX Advanced Europe

December 17, 2021: SMX Code

Available On-Demand: SMX

×


Learn More About Our SMX Events

Discover actionable tactics that can help you overcome crucial marketing challenges. Our next conference will be held:

MarTech 2021: March 16-17

MarTech 2021: Sept. 14-15

MarTech 2020: Watch On-Demand

×

Attend MarTech - Click Here


Learn More About Our MarTech Events

White Papers

  • The State of Local Marketing Report 2020-2021
  • Quality CRM Data: The Key to Delivering Great Customer Experiences
  • How the Microsoft Search Network Can Maximize Your Search Campaigns
  • The Marketer’s Playbook for Customer Acquisition
  • How To Optimize SEO With UGC
See More Whitepapers

Webinars

  • How to Avoid the Digital Transformation Trap
  • How to Build a Marketing System of Record
  • Meet BIMI: The brand-boosting email security marketers must have for 2021
See More Webinars

Research Reports

  • Local Marketing Solutions for Multi-Location Businesses
  • Enterprise Digital Asset Management Platforms
  • Identity Resolution Platforms
  • Customer Data Platforms
  • B2B Marketing Automation Platforms
  • Call Analytics Platforms
See More Research

h
Receive daily search news and analysis.
Search Engine Land
Download the Search Engine Land App on iTunes Download the Search Engine Land App on Google Play

Channels

  • SEO
  • SEM
  • Local
  • Retail
  • Google
  • Bing
  • Social

Our Events

  • SMX
  • MarTech

Resources

  • White Papers
  • Research
  • Webinars
  • Search Marketing Expo
  • MarTech Conference

About

  • About Us
  • Contact
  • Privacy
  • Marketing Opportunities
  • Staff
  • Connect With Us

Follow Us

  • Facebook
  • Twitter
  • LinkedIn
  • Newsletters
  • Instagram
  • RSS
  • Youtube
  • iOS App
  • Google Play

© 2021 Third Door Media, Inc. All rights reserved.

Your privacy means the world to us. We share your personal information only when you give us explicit permission to do so, and confirm we have your permission each time. Learn more by viewing our privacy policy.Ok