Making Sense Of All The Data: Google, Hadoop & Cloudera
The article, Hadoop, a Free Software Program, Finds Uses Beyond Search, explains the very interesting history behind Hadoop. What is Hadoop? It’s distributed computing software that enables data mining and analysis on a huge scale. It also, apparently, is an open-source version of proprietary software developed by Google to process and analyze massive volumes of […]
The article, Hadoop, a Free Software Program, Finds Uses Beyond Search, explains the very interesting history behind Hadoop. What is Hadoop? It’s distributed computing software that enables data mining and analysis on a huge scale. It also, apparently, is an open-source version of proprietary software developed by Google to process and analyze massive volumes of data for search. Here’s how the NY Times explains the problem Google was addressing:
By 2003, Google found it increasingly difficult to ingest and index the entire Internet on a regular basis. Adding to these woes, Google lacked a relatively easy to use means of analyzing its vast stores of information to figure out the quality of search results and how people behaved across its numerous online services.
To address those issues, a pair of Google engineers invented a technology called MapReduce that, when paired with the intricate file management technology the company uses to index and catalog the Web, solved the problem.
The MapReduce technology makes it possible to break large sets of data into little chunks, spread that information across thousands of computers, ask the computers questions and receive cohesive answers. Google rewrote its entire search index system to take advantage of MapReduce’s ability to analyze all of this information and its ability to keep complex jobs working even when lots of computers die.
MapReduce represented a couple of breakthroughs. The technology has allowed Google’s search software to run faster on cheaper, less-reliable computers, which means lower capital costs. In addition, it makes manipulating the data Google collects so much easier that more engineers can hunt for secrets about how people use the company’s technology instead of worrying about keeping computers up and running.
Hadoop was developed as something of an open-source response to MapReduce by Doug Cutting, who was later hired by Yahoo. Yahoo then spent millions, according to the article, to further develop Hadoop. Other internet giants and companies such as Facebook, IBM, Microsoft and Autodesk, have used Hadoop extensively to analyze huge volumes of data in ways that extend far beyond search.
Now former employees of Google, Yahoo and Facebook have come together to launch Cloudera to deliver data analysis around Hadoop:
“What if Google decided to sell the ability to do amazing things with data instead of selling advertising?” Mr. Hammerbacher asked.
The company has just released its own version of Hadoop. The software remains free, but Cloudera hopes to make money selling support and consulting services for the software. It has only a few customers, but it wants to attract biotech, oil and gas, retail and insurance customers to the idea of making more out of their information for less.
This is data mining on a gigantic scale, taking Google’s original techniques, as translated by Hadoop, and seeking to bring them the masses (of enterprises that is).