A Visual Dictionary For The Web

One of the most popular vertical search features on the web is image search index. What’s really remarkable, however, is how little has changed in the core technology approach to the indexing of multimedia over the last decade. When I was the head of product at FAST back in 1999, we launched the web’s biggest […]

Chat with SearchBot

One of the most popular vertical search features on the web is image search index. What’s really remarkable, however, is how little has changed in the core technology approach to the indexing of multimedia over the last decade. When I was the head of product at FAST back in 1999, we launched the web’s biggest image search on Lycos with over 50 million images (which seemed like a lot at the time!). The service included many leading edge features including black and white and color image filtering, size filtering, and filetype filtering. The main differentiator today continues to be index size and freshness, and companies with the strongest technology in web content discovery have the most advantage. Not surprisingly, Google leads here, as their web crawling capability is far superior to anyone’s on the web. What is surprising, however, is how little has changed over the last decade. Audio and video isn’t significantly different in this regard. The majority of multimedia indexing today relies on the classic “titles and tags” approach, and Google Video is perhaps Google’s most underwhelming search product because of this limitation.


While text-based keyword search continues to dominate web navigation, one can see a future where the search input has multiple formats. I’ve seen many demos where you can provide a particular image of a mountain scene and get back remarkable similar images. The same is true of video. The problems with these approaches today are twofold. First, image and video processing is still incredibly resource intensive, although with Moore’s Law and cloud computing capabilities this problem seems only temporary. The bigger challenge is the lack of a “visual dictionary” on the web.

What is the visual dictionary?

The state of the art in multimedia processing today, specifically around images, is to use a pixel mapping process to find similar images visually. The challenge with this approach is that the pixels still don’t convey the “aboutness” of the image it is processing. Said another way, the computer knows it looks like a mountain based on the original image provided, but can’t tell the user it’s a mountain scene. Facial recognition has a similar problem. Facial recognition can find a similar face to one provided, and has improved to the point where it can actually find the same face rather than a similar face. But again, it doesn’t know the name of the person it has discovered. What’s needed is an approach I’ll call the visual dictionary. The visual dictionary would be a master meta data collection that would have tagged all of the pixel representations of an object for its “aboutness.” This would enable many exciting possibilities:

  • Automatic tagging of new images: The moment an image is loaded to the web, it would have a set of “best fit” tags from the visual dictionary that would describe it.
  • Finding similar: The visual dictionary would aid in the discovery of similar images online, either from a user presenting a keyword or an image as the “query.”
  • Classification: Multimedia files could be automatically dropped into taxonomies and ontologies, which for the most part rely on text-based Boolean rules.

Google has recognized this problem and has put forward a human generated approach, similar to Amazon’s Mechanical Turk. With this approach, two anonymous people are paired together to “tag” an image. The pairing helps cut down on spam and maximizes tag coverage. While this approach is likely to yield high quality results, human tagged approaches have not scaled particularly well in search environments. The challenge with video is even greater. At 30 frames per second, a three minute clip creates 5400 “images!” In the case of a video news clip, the three minutes likely cover several entirely distinct concepts, people, and places, and therefore require granular tagging to really provide robust indexing.

The combination of computing power and natural language processing seem poised to create real innovation in an area that has seen little over the last decade.

Tom Wilde is the CEO of EveryZing, a Cambridge-based company specializing in next-generation Universal Search and video search engine optimization (video SEO). The Video Search column appears on Thursdays at Search Engine Land.


Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.


About the author

Tom Wilde
Contributor

Get the must-read newsletter for search marketers.