Big Data Technology Strategy: Is Hadoop Already Outdated?

 

Is Hadoop Already Outdated?

Logical architecture of the Hadoop stack
The Hadoop Ecosystem

An article posted to Information Age 18 February, 2013 Teradata CTO Stephen Brobst highlights the schism that has overtaken traditional Decision Support and the new-age Big Data camp, noting at a recent Stanford University very-large-database conference “The Hadoop guys were saying, ‘relational databases are dead, SQL programming is for dinosaurs, long live the new kings Hadoop and MapReduce.'”  (Swabey, 2013 ).  The inclusion of the Hadoop platform by name and the technology’s rapid ascendancy is striking in its proliferation progressing from initial release to core services in multinational platforms in less than six years (Hadoop Releases, 2013), yet it represents the lion’s share of the commercial Big Data marketplace.  Fanatical zeal aside, should it be the sole platform for knowledge management and creation?

Much is made of the dimensions by which we assign special treatment to “Big Data”.   These facets are known popularly as “The Three V’s”, which are defined by Gartner as “high-volume, high-velocity and high-variety information assets”.  Additional V’s are sometimes added to suit the audience as necessary including Veracity (What is big data?, 2013), Variability, and Value (Fan, 2013).  In the December 2013 issue of the ACM SIGKDD, Wei Fan and Albert Bifet explore the current and future state of Big Data.  They allude to signals that the technology adoption has overshot the technical ecosystem’s ability to give it proper perspective providing seven factors they consider to be controversial (Fan, 2012):

  •  There is no need to distinguish Big Data analytics from data analytics, as data will continue growing, and it will never be small again […]
  •  Big Data may be a hype to sell Hadoop based computing systems. Hadoop is not always the best tool […]
  • In real time analytics, data may be changing. In that case, what it is important is not the size of the data, it is its recency […]
  • Claims to accuracy are misleading […]
  • Bigger data are not always better data.  It depends if the data is noisy or not, and if it is representative of what we are looking for […]
  • …[Is it] ethical that people can be analyzed without knowing it […]
  • Limited access to Big Data creates new digital divides […]

Further supporting Fan and Bifet’s arguments, Stephen Brobst notes, “A lot of people are talking about the ‘velocity of big data’ but if that just means that data values are updating quickly, it’s nothing new.  What’s new is the velocity of change in the structure of data.” (Swabey, 2013).

Google (noticeably silent in the Big Data marketplace) abandoned the batch processing approach underlying Hadoop in favor of a real-time, service-based processing architecture originally called Dremel and outlined in a paper from Google research (Melnik, 3010).  Google’s BigQuery cloud service, used extensively at Google internally, takes a differing tack that “builds on ideas from web search and parallel DBMSs”—core competencies for the company.  In a January 2013 consortium organized by IBM and Arizona State University, Dr. K. Selcuk Candan (Candan, 2013) highlights six key outcomes which may be summarized as a need for better data fusion, data analysis algorithms, data models, scalable architectures, and real-time analysis.  While several vendors are visibly out front with custom Hadoop builds for real-time analysis, two non-Hadoop projects, S4 in the Apache Incubator and the production-ready Storm (http://storm-project.net/) show promise a general-purpose parallel computing engines.

While Apache Hadoop project has staged an impressive entrance, broken through the Relational and OLAP paradigms, and shown the viability of open source software, I intend to keep an eye on the companies that have avoided the hype such as Google (Regalado, 2013) and observe as the market polarizes into real-time analysis and those who never needed it.

 

References:

Candan, K. Selcuk. (2013, June 25). Hunting for the Value Gaps in Data Management, Services, and Analytics.  Retrieved from http://wp.sigmod.org/?p=904 .

Fan, Wei, and Albert Bifet, Mining Big Data: Current Status, and Forecast to the Future, December 2014, Vol. 4, Issue 2.  Downloaded from http://www.kdd.org/sites/default/files/issues/14-2-2012-12/V14-02-01-Fan.pdf

Gilyadov, Camuel. (2013, July 2). OpenDremel: Google BigQuery / Dremel implementation.  Retrieved from http://bigdatacraft.com/opendremel

Hadoop Releases. (2013, June 14). Retrieved from http://hadoop.apache.org/releases.html

Melnik, Sergey, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis (2010). “Dremel: Interactive Analysis of Web-Scale Datasets”. Proc. of the 36th International Conference on Very Large Data Bases (VLDB).

Regalado, Antonio. (2013, June 11).  Just Don’t Call it Big Data: Why Google fears the totalitarian connotations of the buzzword big data.  Retrieved from http://www.technologyreview.com/view/515941/just-dont-call-it-big-data/

Swabey, Pete. (2013, February 18).  Teradata seeks compromise in the big data Holy Wars.  Retrieved from http://www.information-age.com/technology/information-management/123456802/teradata-seeks-compromise-in-the-big-data-holy-wars-

What is big data? (n.d.). Retrieved from http://www-01.ibm.com/software/data/bigdata/