A Visual Dictionary For The Web

One of the most popular vertical search features on the web is image search index. What’s really remarkable, however, is how little has changed in the core technology approach to the indexing of multimedia over the last decade. When I was the head of product at FAST back in 1999, we launched the web’s biggest image search on Lycos with over 50 million images (which seemed like a lot at the time!). The service included many leading edge features including black and white and color image filtering, size filtering, and filetype filtering. The main differentiator today continues to be index size and freshness, and companies with the strongest technology in web content discovery have the most advantage. Not surprisingly, Google leads here, as their web crawling capability is far superior to anyone’s on the web. What is surprising, however, is how little has changed over the last decade. Audio and video isn’t significantly different in this regard. The majority of multimedia indexing today relies on the classic “titles and tags” approach, and Google Video is perhaps Google’s most underwhelming search product because of this limitation.

While text-based keyword search continues to dominate web navigation, one can see a future where the search input has multiple formats. I’ve seen many demos where you can provide a particular image of a mountain scene and get back remarkable similar images. The same is true of video. The problems with these approaches today are twofold. First, image and video processing is still incredibly resource intensive, although with Moore’s Law and cloud computing capabilities this problem seems only temporary. The bigger challenge is the lack of a “visual dictionary” on the web.

What is the visual dictionary?

The state of the art in multimedia processing today, specifically around images, is to use a pixel mapping process to find similar images visually. The challenge with this approach is that the pixels still don’t convey the “aboutness” of the image it is processing. Said another way, the computer knows it looks like a mountain based on the original image provided, but can’t tell the user it’s a mountain scene. Facial recognition has a similar problem. Facial recognition can find a similar face to one provided, and has improved to the point where it can actually find the same face rather than a similar face. But again, it doesn’t know the name of the person it has discovered. What’s needed is an approach I’ll call the visual dictionary. The visual dictionary would be a master meta data collection that would have tagged all of the pixel representations of an object for its “aboutness.” This would enable many exciting possibilities:

  • Automatic tagging of new images: The moment an image is loaded to the web, it would have a set of “best fit” tags from the visual dictionary that would describe it.

  • Finding similar: The visual dictionary would aid in the discovery of similar images online, either from a user presenting a keyword or an image as the “query.”
  • Classification: Multimedia files could be automatically dropped into taxonomies and ontologies, which for the most part rely on text-based Boolean rules.

Google has recognized this problem and has put forward a human generated approach, similar to Amazon’s Mechanical Turk. With this approach, two anonymous people are paired together to “tag” an image. The pairing helps cut down on spam and maximizes tag coverage. While this approach is likely to yield high quality results, human tagged approaches have not scaled particularly well in search environments. The challenge with video is even greater. At 30 frames per second, a three minute clip creates 5400 “images!” In the case of a video news clip, the three minutes likely cover several entirely distinct concepts, people, and places, and therefore require granular tagging to really provide robust indexing.

The combination of computing power and natural language processing seem poised to create real innovation in an area that has seen little over the last decade.

Tom Wilde is the CEO of EveryZing, a Cambridge-based company specializing in next-generation Universal Search and video search engine optimization (video SEO). The Video Search column appears on Thursdays at Search Engine Land.

Opinions expressed in the article are those of the guest author and not necessarily Search Engine Land.

Related Topics: Channel: Video | Video Search


About The Author: is the CEO of EveryZing, a Cambridge-based company specializing in next-generation Universal Search and video search engine optimization (video SEO).

Connect with the author via: Email


Get all the top search stories emailed daily!  


Other ways to share:

Read before commenting! We welcome constructive comments and allow any that meet our common sense criteria. This means being respectful and polite to others. It means providing helpful information that contributes to a story or discussion. It means leaving links only that substantially add further to a discussion. Comments using foul language, being disrespectful to others or otherwise violating what we believe are common sense standards of discussion will be deleted. Comments may also be removed if they are posted from anonymous accounts. You can read more about our comments policy here.

Comments are closed.


Get Our News, Everywhere!

Daily Email:

Follow Search Engine Land on Twitter @sengineland Like Search Engine Land on Facebook Follow Search Engine Land on Google+ Get the Search Engine Land Feed Connect with Search Engine Land on LinkedIn Check out our Tumblr! See us on Pinterest


Click to watch SMX conference video

Join us at one of our SMX or MarTech events:

United States


Australia & China

Learn more about: SMX | MarTech

Free Daily Search News Recap!

SearchCap is a once-per-day newsletter update - sign up below and get the news delivered to you!



Search Engine Land Periodic Table of SEO Success Factors

Get Your Copy
Read The Full SEO Guide