DeepDyve Explores The Invisible Web

As web search engines have improved over the years, there’s been less attention paid to an “inconvenient truth” about the indexes of our favorite information finding tools—namely, that search engines still miss the lion’s share of information available on the web. This so-called “deep web” remains largely impenetrable to search engines for a variety of reasons, and for many types of queries that’s just fine. But if you’re a serious searcher, looking for the best information possible, you can’t afford to overlook this vast “hidden” store of information.

And that’s a challenge, because search tools that probe the deep web are for the most part either obscure or fee-based. That’s changing, thanks to a company formerly known as Infovell and now called DeepDyve. The eponymous rolls out today with an innovative approach to finding invisible web content that, despite limited coverage at the outset, impressed me with both what it finds and the tools it offers to make the searching experience even richer.

DeepDyve’s approach is like no other I’ve seen. Its chief scientists come from a background in genomics research, rather than computer science or linguistics. Genomics researchers strive to decode the information contained in DNA to understand the very building-blocks of life. Unlike search engineers who focus on text and keywords, genomics researchers look at a billion three letter “words” spelled out in the four letter alphabet of DNA. These words are combined in “sequences” that determine everything from hair color to whether we’re predisposed to a particular disease. To crack these codes requires massive amounts of data and the ability to see—and understand—hidden patterns of immense complexity.

DeepDyve takes a similar approach to understanding information on the web. Going far beyond basic keyword-based search, DeepDyve indexes every word in a document, but also computes the factorial combination of words and phrases in the document and uses some industrial strength statistical techniques to assess the “informational impact” of these combinations. In essence, this approach looks at the meaning of an entire document and uses that to compute relevance, rather than factors like snippets of text or anchor text in links pointing to documents.

It’s an interesting approach, and one that makes it easy to refine searches in a powerful way quickly and easily. “We think that search is going away from keywords toward where content is your query,” said William Park, DeepDyve’s CEO.

Today’s launch is relatively modest, with DeepDyve currently allowing searches in the areas of life sciences, patents and Wikipedia—about 500 million pages of deep web content (and arguably, Wikipedia isn’t really part of the deep web given its prominence in many Google, Microsoft and Yahoo search results, but that’s a minor quibble). Park says that the company is working hard to expand its coverage, adding physical sciences content in the areas of information technology, clean technology and energy, doubling DeepDyve’s index by year end.

The company also offers a premium version for $45 per month, with some nifty features like a “more like this” button that uses the full-text of a document as a query, with some pretty impressive results.

DeepDyve isn’t a threat to Google now or likely any time in the future. Instead, it’s a great tool for serious searchers wanting to do comprehensive research in the content areas that DeepDyve covers (it’s also, much like Powerset, a vastly more powerful way to search Wikipedia). DeepDyve also offers a genuinely different “second opinion” of the web if you’re wanting to look beyond the top results returned by Google and the other major search engines.

With its limited initial offering, DeepDyve has just scratched the surface of what’s available on the invisible web, albeit in a very useful way. However, truly cracking the invisible web problem still seems like a distant dream.

Related Topics: Channel: Consumer | Features: General | Search Engines: Other Search Engines | Search Engines: Wikipedia | Top News


About The Author: (@CJSherman) is a Founding Editor of and President of Searchwise LLC, a Boulder Colorado based Web consulting firm. He also programs and co-chairs the Search Marketing Expo - SMX conference series.

Connect with the author via: Email | Twitter | Google+ | LinkedIn


Get all the top search stories emailed daily!  


Other ways to share:

Read before commenting! We welcome constructive comments and allow any that meet our common sense criteria. This means being respectful and polite to others. It means providing helpful information that contributes to a story or discussion. It means leaving links only that substantially add further to a discussion. Comments using foul language, being disrespectful to others or otherwise violating what we believe are common sense standards of discussion will be deleted. Comments may also be removed if they are posted from anonymous accounts. You can read more about our comments policy here.

Comments are closed.

Get Our News, Everywhere!

Daily Email:

Follow Search Engine Land on Twitter @sengineland Like Search Engine Land on Facebook Follow Search Engine Land on Google+ Get the Search Engine Land Feed Connect with Search Engine Land on LinkedIn Check out our Tumblr! See us on Pinterest


Click to watch SMX conference video

Join us at one of our SMX or MarTech events:

United States


Australia & China

Learn more about: SMX | MarTech

Free Daily Search News Recap!

SearchCap is a once-per-day newsletter update - sign up below and get the news delivered to you!



Search Engine Land Periodic Table of SEO Success Factors

Get Your Copy
Read The Full SEO Guide