Example Big Data dev cluster topology

Below is an example dev cluster topology for a Big Data development cluster as I’ve actually used for some customers.  It’s composed of 6 Amazon Web Service (AWS) servers, each with a particular purpose.  We have been able to perform full lambda using this topology along with Teiid (for data abstraction) on terabytes of data.  It’s not sufficient for a production cluster but is a good starting point for a development group.  The total cost of this cluster as configured (less storage) is under $6/hour.

Here’s a link to this dev_topology in Excel.

 

Service Category Server1 Server2 Server3 Server4 Server5 Server6
Cloudera Mgr Cluster Mgt Alert pub Server Host mon Svc Mon Event Svr Act Mon
HDFS Infra Namenode SNN/DN/JN/HA DN DN/JN DN DN/JN
Zookeeper Infra Server Server Server
YARN Infra Node Mgr Node Mgr JobHist Node Mgr RM/NM
Redis Infra Master Slave Slave
Hive Data Hive server Metastore Hcat
Impala Data App Master Cat Svr Daemon Daemon Daemon
Storm Data Nimbus/UI Supervisor Supervisor Supervisor
Hue UI Server
Pentaho BI UI BI Server
IP ADDRESS
AWS details              
Name m3.2xlarge m3.2xlarge m3.2xlarge r3.4xlarge r3.4xlarge r3.4xlarge
vCPU 8 8 8 16 16 16
Memory (Gb) 30.0 30.0 30.0 122.0 122.0 122.0
Instance storage (Gb) SSD 2 x 80 SSD 2 x 80 SSD 2 x 80 SSD 1 x 320 SSD 1 x 320 SSD 1 x 320
I/O High High High High High High
EBS option Yes Yes Yes Yes Yes Yes

 

NIST Big Data Working Group

The US National Institute of Standards and Technology (NIST) kicked off their Big Data Working Group on June 19th 2013.  The sessions have now been broken down into subgroups for Definitions, Taxonomies, Reference Architecture, and Technology Roadmap.  The charter for the working group:

NIST is leading the development of a Big Data Technology Roadmap. This roadmap will define and prioritize requirements for interoperabilityreusability, and extendibility for big data analytic techniques and technology infrastructure in order to support secure and effective adoption of Big Data. To help develop the ideas in the Big Data Technology Roadmap, NIST is creating the Public Working Group for Big Data.

Scope: The focus of the NBD-WG is to form a community of interest from industry, academia, and government, with the goal of developing a consensus definitionstaxonomiesreference architectures, and technology roadmap which would enable breakthrough discoveries and innovation by advancing measurement science, standards, and technology in ways that enhance economic security and improve quality of life. Deliverables:

  • Develop Big Data Definitions
  • Develop Big Data Taxonomies
  • Develop Big Data Reference Architectures
  • Develop Big Data Technology Roadmap

Target Date: The goal for completion of INITIAL DRAFTs is Friday, September 27, 2013. Further milestones will be developed once the WG has initiated its regular meetings.

Participants: The NBD-WG is open to everyone. We hope to bring together stakeholder communities across industry, academic, and government sectors representing all of those with interests in Big Data techniques, technologies, and applications. The group needs your input to meet its goals so please join us for the kick-off meeting and contribute your ideas and insights.

Meetings: The NBD-WG will hold weekly meetings on Wednesdays from 1300 – 1500 EDT (unless announce otherwise) by teleconference. Please click here for the virtual meeting information.> Questions: General questions to the NBD-WG can be addressed to BigDataInfo@nist.gov

 

To participate in helping the US Government in their efforts, sign up at http://bigdatawg.nist.gov/home.php

Big Data Technology Strategy: Is Hadoop Already Outdated?

 

Is Hadoop Already Outdated?

Logical architecture of the Hadoop stack
The Hadoop Ecosystem

An article posted to Information Age 18 February, 2013 Teradata CTO Stephen Brobst highlights the schism that has overtaken traditional Decision Support and the new-age Big Data camp, noting at a recent Stanford University very-large-database conference “The Hadoop guys were saying, ‘relational databases are dead, SQL programming is for dinosaurs, long live the new kings Hadoop and MapReduce.'”  (Swabey, 2013 ).  The inclusion of the Hadoop platform by name and the technology’s rapid ascendancy is striking in its proliferation progressing from initial release to core services in multinational platforms in less than six years (Hadoop Releases, 2013), yet it represents the lion’s share of the commercial Big Data marketplace.  Fanatical zeal aside, should it be the sole platform for knowledge management and creation?

Much is made of the dimensions by which we assign special treatment to “Big Data”.   These facets are known popularly as “The Three V’s”, which are defined by Gartner as “high-volume, high-velocity and high-variety information assets”.  Additional V’s are sometimes added to suit the audience as necessary including Veracity (What is big data?, 2013), Variability, and Value (Fan, 2013).  In the December 2013 issue of the ACM SIGKDD, Wei Fan and Albert Bifet explore the current and future state of Big Data.  They allude to signals that the technology adoption has overshot the technical ecosystem’s ability to give it proper perspective providing seven factors they consider to be controversial (Fan, 2012):

  •  There is no need to distinguish Big Data analytics from data analytics, as data will continue growing, and it will never be small again […]
  •  Big Data may be a hype to sell Hadoop based computing systems. Hadoop is not always the best tool […]
  • In real time analytics, data may be changing. In that case, what it is important is not the size of the data, it is its recency […]
  • Claims to accuracy are misleading […]
  • Bigger data are not always better data.  It depends if the data is noisy or not, and if it is representative of what we are looking for […]
  • …[Is it] ethical that people can be analyzed without knowing it […]
  • Limited access to Big Data creates new digital divides […]

Further supporting Fan and Bifet’s arguments, Stephen Brobst notes, “A lot of people are talking about the ‘velocity of big data’ but if that just means that data values are updating quickly, it’s nothing new.  What’s new is the velocity of change in the structure of data.” (Swabey, 2013).

Google (noticeably silent in the Big Data marketplace) abandoned the batch processing approach underlying Hadoop in favor of a real-time, service-based processing architecture originally called Dremel and outlined in a paper from Google research (Melnik, 3010).  Google’s BigQuery cloud service, used extensively at Google internally, takes a differing tack that “builds on ideas from web search and parallel DBMSs”—core competencies for the company.  In a January 2013 consortium organized by IBM and Arizona State University, Dr. K. Selcuk Candan (Candan, 2013) highlights six key outcomes which may be summarized as a need for better data fusion, data analysis algorithms, data models, scalable architectures, and real-time analysis.  While several vendors are visibly out front with custom Hadoop builds for real-time analysis, two non-Hadoop projects, S4 in the Apache Incubator and the production-ready Storm (http://storm-project.net/) show promise a general-purpose parallel computing engines.

While Apache Hadoop project has staged an impressive entrance, broken through the Relational and OLAP paradigms, and shown the viability of open source software, I intend to keep an eye on the companies that have avoided the hype such as Google (Regalado, 2013) and observe as the market polarizes into real-time analysis and those who never needed it.

 

References:

Candan, K. Selcuk. (2013, June 25). Hunting for the Value Gaps in Data Management, Services, and Analytics.  Retrieved from http://wp.sigmod.org/?p=904 .

Fan, Wei, and Albert Bifet, Mining Big Data: Current Status, and Forecast to the Future, December 2014, Vol. 4, Issue 2.  Downloaded from http://www.kdd.org/sites/default/files/issues/14-2-2012-12/V14-02-01-Fan.pdf

Gilyadov, Camuel. (2013, July 2). OpenDremel: Google BigQuery / Dremel implementation.  Retrieved from http://bigdatacraft.com/opendremel

Hadoop Releases. (2013, June 14). Retrieved from http://hadoop.apache.org/releases.html

Melnik, Sergey, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis (2010). “Dremel: Interactive Analysis of Web-Scale Datasets”. Proc. of the 36th International Conference on Very Large Data Bases (VLDB).

Regalado, Antonio. (2013, June 11).  Just Don’t Call it Big Data: Why Google fears the totalitarian connotations of the buzzword big data.  Retrieved from http://www.technologyreview.com/view/515941/just-dont-call-it-big-data/

Swabey, Pete. (2013, February 18).  Teradata seeks compromise in the big data Holy Wars.  Retrieved from http://www.information-age.com/technology/information-management/123456802/teradata-seeks-compromise-in-the-big-data-holy-wars-

What is big data? (n.d.). Retrieved from http://www-01.ibm.com/software/data/bigdata/