The Case For Speech-to-Text Analysis In Multimedia Content Discovery
There have been several recent announcements surrounding the application of speech-to-text analysis in consumer search settings. Google announced its Political Gadget, enabling visitors to search the spoken word of content within YouTube Presidential candidate’s channels. Adobe plans to include speech-to-text features in future versions of its video authoring applications, such as Premier. Sites such as […]
There have been several recent announcements surrounding the application of speech-to-text analysis in consumer search settings. Google announced its Political Gadget, enabling visitors to search the spoken word of content within YouTube Presidential candidate’s channels. Adobe plans to include speech-to-text features in future versions of its video authoring applications, such as Premier. Sites such as WEEI and FOX Sports have been using similar tools to power search and publishing applications for their multimedia archives for some time (disclaimer: my company provides enabling technology to WEEI and FOX Sports).
Why all of the interest around speech-to-text?
First, a quick primer in speech-to-text. Speech recognition engines have been around for decades, however, their application to web-scale applications is relatively new. A speech recognizer essentially “listens” to audio or video files in order to create a text transcript of the spoken word incorporating different language models and controlled vocabularies in order to drive accuracy. Once processed, the transcript can be used in several down-stream applications, including search, publishing, content management, and categorization. The value is obvious to web publishers: the actual content and context, or “aboutness”, of an audio or video file are captured in what is said and is essential to compliment the editorially provided title, description, and tags. Without the transcript, content is invisible to web crawlers and site search applications. The analog would be Yahoo! or Google building a new search engine for web pages that only indexed the title and tags, while ignoring the body text of the document – not very useful.
When thinking about the relevance of speech-to-text in a content discovery setting, it’s important to understand how multimedia content is currently discovered online. According to hitwise, between April 2007 and April 2008 the paradigm for multimedia content discovery has shifted significantly in favor of search engines:
1. Direct navigation, where content is discovered by going directly to a publisher’s web site, has remained steady over the past year, accounting for approximately 42% of referrals to online video.
2. Social/viral media, where content is discovered on sites such as MySpace and YouTube, accounted for 36% of video referrals in April 2007, dropping to 29% in April of 2008.
3. Search engines, such as Yahoo!, MSN, and Google, provided 22% of referrals in April 2007, increasing to 29% in April 2008.
Why the shift? This can be explained primarily by two factors. The first is that the audience that consumes online multimedia continues to grow in terms of size and amount of content they consume regularly. As video consumption goes mainstream, one would expect that the web audience relies more heavily on search engines for content discovery, just as they do for text content. In fact, it’s not uncommon for web searches to have “audio” or “video” appended to their phrases to bias SERPs towards multimedia content.
The second factor driving this shift is the amount of professionally produced content that is published to the web. Contrary to popular belief, most of the video watching internet audience is not interested in consuming user generated content. Rather, they are looking to the convenience and personalization that the web offers to provide a more engaging video watching experience for high quality, professionally produced content. As the business model comes into focus, big media is responding by publishing more content to their web sites. Statistics from eMarketer reveal that premium content accounts for more than 90% of monthly streams:
1. Entertainment, including movies, television shows, cartoons and trailers, makes up approximately 50% of all streamed content.
2. Infotainment, including current events, weather, sports and entertainment news, business media, and infomercials, represents approximately 40% of streamed content.
3. User generated, including amateur content, home movies, and other low-budget web-only content, make-up the remaining 10% of streamed content (not surprisingly, while YouTube has 37%+ share of streams viewed, the vast majority of these are apparently not streams of user generated content).
For the purposes of content discovery, speech-to-text analysis is most relevant when applied against Infotainment content. Most people don’t “search” for Entertainment content as they are already aware of the brand and know where to go to consume the content. If I am interested in watching “The Office”, I simply go to its website. This must account for most of the direct navigation referrals reported by hitwise. Furthermore, searching within the transcript of a typical sitcom is less interesting as the content has minimal informational value. User Generated Content, on the other hand, suffers from lack of spoken word – typically less than 30% of UGC contains a speech audio layer. UGC is also consumed based on what’s popular, what has been shared with me, and how it is tagged. In other words, UGC is driven by “browsing”, not search.
Infotainment, on the other hand, is topic-driven. Political content is about “candidates”, “vice presidential nominees”, “universal health care” and “the war in Iraq”. Sports media is about “Alex Rodriguez” and business media is about “the mortgage meltdown”. When people are interested in learning about topics, more often than not, they rely on search engines to discover content. Hence, a typical text-based content site generates more than 20% of monthly referrals from the crawler based engines.
As audio and video gradually replaces text-based content consumption on the web, consumers expect multimedia content to be readily findable within the major search engines. While this is intuitive, gaining inclusion of video in the search engines is difficult due to the flash-based applications that house the video and the lack of text describing the content. With much of the informational value of Infotainment content trapped within the spoken word, speech-to-text applications provide a scalable and cost effective means for unlocking content and presenting it to crawlers, hence, placing multimedia content on equal footing as text-based content. The same search uplift can be applied in a site search setting, where content discovery and consumption is measured by the efficacy of “recall” and “precision”. In this context, “recall” is impaired due to lack of descriptive text – if the content isn’t included in the result, then how could it possibly be discovered?
In a future post, I’ll discuss the advantages of transcript creation using automated speech-to-text applications versus human editors, the challenge of leveraging closed captioning in online settings, and “how accurate” is “accurate enough” for a transcript to be useful.