Powerset Launches “Understanding Engine” For Wikipedia Content
After nearly two years in the making — and plenty of hype — Powerset has finally rolled out a "natural language" search engine. It’s not a Google killer. It’s barely a business model right now. But at least it’s something the world can finally play with, and under the hood, there’s lots of potential. By […]
After nearly two years in the making — and plenty of hype —
finally rolled out a "natural language" search engine. It’s not a Google killer.
It’s barely a business model right now. But at least it’s something the world
can finally play with, and under the hood, there’s lots of potential.
By the time you read this, the Powerset site should have changed into a tool
that allows you search
against material within Wikipedia. Why bother using Powerset rather than using Wikipedia’s own search tool or even Google
set to look only within Wikipedia
pages? The Powerset pitch is that you’ll get better results because
Powerset’s technology has read
and understood what every word within Wikipedia actually means.
An Understanding Engine, Not Natural Language Search
To understand that more, I beg that you forget you ever heard "natural language"
being associated with Powerset. That’s not really describing what they do in
comparison to regular search engines.
To explain, you have to understand that Google and the other major search
engines are largely stupid.
They don’t really understand the content on the pages that they "read." If they see the word "walk" in a sentence, they don’t know if walk is
being used as a verb or a noun. In very general terms, they don’t even know that
words are words. Words are more or less patterns to them — collections of
letters — and when someone
searches, they try to find the pages that have those patterns in them or in
links to those pages.
That’s VERY simplified, OK? The major search engines DO have some smarts, some
ability to know that walk is related to walking or that walk and run might be
similar words. But this is largely done through statistical guessing, rather
than comprehending what the individual words actually mean, especially in terms
of their exact grammatical usage.
Powerset is different. It says that its technology reads and comprehends each
word on a page. It looks at each sentence. It understand the words in each
sentence and how they related to each other. It works out what that sentence
really means, all the facts that are being presented. This means it knows what
any page is really about.
In lieu of a better phrase, call it an "understanding engine." Maybe that’s
not the right phrase, but natural language search isn’t it, either.
Understanding engines at least highlights the uniqueness of Powerset — that’s because
understands what pages are about — it can extract facts from those pages plus
comprehend how those facts, as well as those pages, relate to each other.
Wikipedia Discovery Tool
One of the chief uses for Powerset is employing it as a Wikipedia discovery
or query refinement tool. To use the Powerset example they gave me during a briefing last week, consider a
search for [henry viii]. What’s someone interested in in when they search on
that topic, given Henry did a lot of things during his reign?
Over at Google, we get query refinement suggestions at the bottom of the
page, like this:
Most of these are generated by looking at the relationships between those who have
searched for one topic and then may have gone off and done another search. Yahoo
has the most sophisticated of the pack (see
Search Suggestions On
Steroids: Yahoo Search Assist), but it still hasn’t actually
"read" about Henry VIII and tried to group him into subtopics, in the way a human
That’s what Powerset tries. Here’s what you get in a search for Henry VIII:
Notice the tabs at the top, where it recognizes Henry VIII could refer to the
person, the opera, the play, or even a television drama. OK, so not too amazing
when you think about it. But look further to the "Factz" area. Here you can see
that Powerset, after reading through Wikipedia, has figured out that Henry VIII
"dissolved" things like monasteries or that he "granted" things like land. And
yes, he "married" a few people.
There’s even more facts that can be found like this:
This is nice refinement. Running down the list, you can quickly scan the many
facts that define Henry’s life. And from the list, with a click, you can drill
in more about topics and jump right to particular pages within Wikipedia:
See how there’s a link to the
page? Powerset has seen that there’s something Henry VIII built mentioned on
that page, Pendennis Castle. That’s not covered on the main
Henry VIII page,
but because Powerset has read both pages and understands what they are about, it
can link the facts together.
Overkill For Now?
In short, the refinement is cool. What’s not to love about it? For one, it
might be overkill. During the demo, Powerset made a big deal on how Powerset
could build information from across various Wikipedia pages that isn’t written
on any single one of them. For example, a search for [hulk hogan]
brought this up:
See how those who Hulk Hogan has defeated are itemized? It’s nice — but do
you really trust that all the defeats have been captured? I wouldn’t. I’d
probably still go looking for an authoritative list that had been reviewed by a
human. Moreover, I can get lists
like that without great refinement. A search for
victories on Google brings me to this
page on About.com listing his world title victories.
In addition, while Powerset did a nice job of breaking down Henry VIII
according to Wikipedia, Wikipedia’s human editors do a pretty nice job right in
the opening paragraphs to the Henry VIII page:
Henry VIII (28
June 1491 –
King of England and
Lord of Ireland, later
King of Ireland and claimant to the
Kingdom of France, from
1509 until his
death. Henry was the second monarch of the
House of Tudor, succeeding his father,
Henry VIII was a significant figure in the history of the English monarchy.
Although in the first parts of his reign he energetically suppressed the
Reformation of the
Anglican Church, which had been building steam since
John Wycliffe of the fourteenth century, he is more often known for his
ecclesiastical struggles with Rome. These struggles ultimately led to him
separating the Anglican Church from Roman authority, the
Dissolution of the Monasteries, and establishing the English monarch as
Supreme Head of the Church of England. Although some claim he became a
Protestant on his death-bed, he advocated Catholic ceremony and doctrine
throughout his life; royal backing of the English Reformation was left to his
Edward VI and
Elizabeth I. Henry also oversaw the legal union of
Laws in Wales Acts 1535–1542). He is noted in popular culture for being
married six times.
I suspect most people hitting Wikipedia are already going to find an opening
paragraph like that, which does a
pretty good job guiding them in refining their topics about Henry VIII and pointing them to
That’s a problem for Powerset, which told me it hopes to attract lots of
those Wikipedia users to its own site, where they’ll be eventually shown ads
alongside the content (ads aren’t present at launch).
Powerset was at pains to explain how popular Wikipedia is and what a well
used resource it has become. Agreed — and plenty of those people wind up there
because they’ve done searches at Google. About 70 percent of Wikipedia users
come via search engines, according to Powerset itself. That’s a huge audience
that is NOT going to magically be routed to Powerset instead. Yes, some know to go directly
to Wikipedia. No doubt some of these users will hear of the new
Powerset tool and go there. However, it will be a
stunning achievement if these are more than a fraction of those who hit the main Wikipedia site.
Powerset has another trick up its sleeve that might pull in the people. For
any page you visit, there’s an "Article Outline" box that appears within it,
It’s very slick. Select an item, and you’re jumped to the spot within the
document related to it:
I think it’s self-evident that Powerset adds some nice value to Wikipedia.
Indeed, everyone would probably be smart to go to it directly rather than
Wikipedia itself. But as I’ve covered above, that’s not what I expect to happen.
Future In Site Search?
If Powerset fails to capture a wide audience, then what’s the way forward for
it? One area is to
provide better site-specific searching. Powerset’s technology can be applied to
any set of documents, to make it easier for people to find what they are looking
for within them. Site specific search allows those visiting a particular web
site to look just within that site. That market, along with enterprise search
(making intranets searchable) continues to grow. And the audience doing those
types of search are likely more inclined to seek out refinement options and
other exploratory tools than they are when performing general searches.
Powerset said this is a market they’re interested in, so perhaps we’ll see it
grow in that area. But for those expecting it to produce Google-wealth, keep in
mind that long-time and mature enterprise search player FAST
sold for $1.2
billion earlier this year. Yes, that’s a huge amount of money, but it’s not
the multibillions Yahoo was going to go for, and it’s much less than what Google’s valued at.
Speaking of Yahoo, it used to be the leading candidate in the past of who might
acquire Powerset, especially given some close ties between the companies (Powerset
has a number of former Yahoos on staff). Given Yahoo’s current troubles and
unstable state, I wouldn’t expect much here.
Could a tie-up with a major player like Google or Microsoft happen? Sure.
Aside from site search, the technology that allows machines to automatically
comprehend what text documents are about ought to have other applications and be
worth something. What those are and how much it is worth isn’t clear. Powerset’s
been smart to snap up many licenses and patents around the technology that
should make it attractive to a larger search player like Google or Microsoft to
acquire. Within one of these organizations, I suspect more innovative things
FYI, I wrote the above paragraph last Friday, before the rumors (see
here on News.com
and here on
Techmeme) that Microsoft might want to buy it came out over the weekend.
Actually, I started writing this article several months ago and in that, was
covering how it might be an acquisition target. It’s a fairly obvious move to
expect any of the majors to take a look, and when I talked to Powerset several
months ago, I was given the impression that all the majors had taken a look.
Since then, of course, no one has acquired it — plus the company went
through a management
shake-up last year. It was already under fire for not getting a product out
for so long. Add to these strikes as a potential Google killer the fact that it takes
Powerset about a month to comprehend Wikipedia’s 2.5 million topic pages. In
that time, many of those pages will have changed — thus needing to be reread
again. Powerset’s impressive, but with the web having in excess of 20 BILLION
constantly change pages, this is no overnight secret weapon that Microsoft might
buy and employ to take the search lead.
Indeed, what Powerset says it
has developed — along with patents locked up to protect it — is overkill for
what’s needed today. It will be more useful probably five years from now, in
ways we’re not even envisioning. For those players thinking long-term, which
include both Google and Microsoft, sure — it might well make sense to buy.
By the way, the Powerset launch will no doubt inspire interest in another
"natural language" search engine, Hakia. Someday I want to revisit Hakia and
explain more about why I also dislike the term "natural language" being applied
to it. In the meantime, you can read Vanessa Fox’s excellent article from last
October on the service,
Social Networking Through Search: Hakia Helps You Meet Others. And if you
need a deflation of natural language hype, see
The Google Challengers:
2008 Edition. In the section on Powerset, I summarize a long rant I did on
the history and hype of natural language search.
For related discussion, see Techmeme.