As web search engines have improved over the years, there’s been less attention paid to an “inconvenient truth” about the indexes of our favorite information finding tools—namely, that search engines still miss the lion’s share of information available on the web. This so-called “deep web” remains largely impenetrable to search engines for a variety of reasons, and for many types of queries that’s just fine. But if you’re a serious searcher, looking for the best information possible, you can’t afford to overlook this vast “hidden” store of information.
And that’s a challenge, because search tools that probe the deep web are for the most part either obscure or fee-based. That’s changing, thanks to a company formerly known as Infovell and now called DeepDyve. The eponymous DeepDyve.com rolls out today with an innovative approach to finding invisible web content that, despite limited coverage at the outset, impressed me with both what it finds and the tools it offers to make the searching experience even richer.
DeepDyve’s approach is like no other I’ve seen. Its chief scientists come from a background in genomics research, rather than computer science or linguistics. Genomics researchers strive to decode the information contained in DNA to understand the very building-blocks of life. Unlike search engineers who focus on text and keywords, genomics researchers look at a billion three letter “words” spelled out in the four letter alphabet of DNA. These words are combined in “sequences” that determine everything from hair color to whether we’re predisposed to a particular disease. To crack these codes requires massive amounts of data and the ability to see—and understand—hidden patterns of immense complexity.
DeepDyve takes a similar approach to understanding information on the web. Going far beyond basic keyword-based search, DeepDyve indexes every word in a document, but also computes the factorial combination of words and phrases in the document and uses some industrial strength statistical techniques to assess the “informational impact” of these combinations. In essence, this approach looks at the meaning of an entire document and uses that to compute relevance, rather than factors like snippets of text or anchor text in links pointing to documents.
It’s an interesting approach, and one that makes it easy to refine searches in a powerful way quickly and easily. “We think that search is going away from keywords toward where content is your query,” said William Park, DeepDyve’s CEO.
Today’s launch is relatively modest, with DeepDyve currently allowing searches in the areas of life sciences, patents and Wikipedia—about 500 million pages of deep web content (and arguably, Wikipedia isn’t really part of the deep web given its prominence in many Google, Microsoft and Yahoo search results, but that’s a minor quibble). Park says that the company is working hard to expand its coverage, adding physical sciences content in the areas of information technology, clean technology and energy, doubling DeepDyve’s index by year end.
The company also offers a premium version for $45 per month, with some nifty features like a “more like this” button that uses the full-text of a document as a query, with some pretty impressive results.
DeepDyve isn’t a threat to Google now or likely any time in the future. Instead, it’s a great tool for serious searchers wanting to do comprehensive research in the content areas that DeepDyve covers (it’s also, much like Powerset, a vastly more powerful way to search Wikipedia). DeepDyve also offers a genuinely different “second opinion” of the web if you’re wanting to look beyond the top results returned by Google and the other major search engines.
With its limited initial offering, DeepDyve has just scratched the surface of what’s available on the invisible web, albeit in a very useful way. However, truly cracking the invisible web problem still seems like a distant dream.