Below is an example dev cluster topology for a Big Data development cluster as I’ve actually used for some customers. It’s composed of 6 Amazon Web Service (AWS) servers, each with a particular purpose. We have been able to perform full lambda using this topology along with Teiid (for data abstraction) on terabytes of data. It’s not sufficient for a production cluster but is a good starting point for a development group. The total cost of this cluster as configured (less storage) is under $6/hour.
The US National Institute of Standards and Technology (NIST) kicked off their Big Data Working Group on June 19th 2013. The sessions have now been broken down into subgroups for Definitions, Taxonomies, Reference Architecture, and Technology Roadmap. The charter for the working group:
NIST is leading the development of a Big Data Technology Roadmap. This roadmap will define and prioritize requirements for interoperability, reusability, and extendibility for big data analytic techniques and technology infrastructure in order to support secure and effective adoption of Big Data. To help develop the ideas in the Big Data Technology Roadmap, NIST is creating the Public Working Group for Big Data.
Scope: The focus of the NBD-WG is to form a community of interest from industry, academia, and government, with the goal of developing a consensus definitions, taxonomies, reference architectures, and technology roadmap which would enable breakthrough discoveries and innovation by advancing measurement science, standards, and technology in ways that enhance economic security and improve quality of life. Deliverables:
Develop Big Data Definitions
Develop Big Data Taxonomies
Develop Big Data Reference Architectures
Develop Big Data Technology Roadmap
Target Date: The goal for completion of INITIAL DRAFTs is Friday, September 27, 2013. Further milestones will be developed once the WG has initiated its regular meetings.
Participants: The NBD-WG is open to everyone. We hope to bring together stakeholder communities across industry, academic, and government sectors representing all of those with interests in Big Data techniques, technologies, and applications. The group needs your input to meet its goals so please join us for the kick-off meeting and contribute your ideas and insights.
Meetings: The NBD-WG will hold weekly meetings on Wednesdays from 1300 – 1500 EDT (unless announce otherwise) by teleconference. Please click here for the virtual meeting information.> Questions: General questions to the NBD-WG can be addressed to BigDataInfo@nist.gov
An article posted to Information Age 18 February, 2013 Teradata CTO Stephen Brobst highlights the schism that has overtaken traditional Decision Support and the new-age Big Data camp, noting at a recent Stanford University very-large-database conference “The Hadoop guys were saying, ‘relational databases are dead, SQL programming is for dinosaurs, long live the new kings Hadoop and MapReduce.'” (Swabey, 2013 ). The inclusion of the Hadoop platform by name and the technology’s rapid ascendancy is striking in its proliferation progressing from initial release to core services in multinational platforms in less than six years (Hadoop Releases, 2013), yet it represents the lion’s share of the commercial Big Data marketplace. Fanatical zeal aside, should it be the sole platform for knowledge management and creation?
Much is made of the dimensions by which we assign special treatment to “Big Data”. These facets are known popularly as “The Three V’s”, which are defined by Gartner as “high-volume, high-velocity and high-variety information assets”. Additional V’s are sometimes added to suit the audience as necessary including Veracity (What is big data?, 2013), Variability, and Value (Fan, 2013). In the December 2013 issue of the ACM SIGKDD, Wei Fan and Albert Bifet explore the current and future state of Big Data. They allude to signals that the technology adoption has overshot the technical ecosystem’s ability to give it proper perspective providing seven factors they consider to be controversial (Fan, 2012):
There is no need to distinguish Big Data analytics from data analytics, as data will continue growing, and it will never be small again […]
Big Data may be a hype to sell Hadoop based computing systems. Hadoop is not always the best tool […]
In real time analytics, data may be changing. In that case, what it is important is not the size of the data, it is its recency […]
Claims to accuracy are misleading […]
Bigger data are not always better data. It depends if the data is noisy or not, and if it is representative of what we are looking for […]
…[Is it] ethical that people can be analyzed without knowing it […]
Limited access to Big Data creates new digital divides […]
Further supporting Fan and Bifet’s arguments, Stephen Brobst notes, “A lot of people are talking about the ‘velocity of big data’ but if that just means that data values are updating quickly, it’s nothing new. What’s new is the velocity of change in the structure of data.” (Swabey, 2013).
Google (noticeably silent in the Big Data marketplace) abandoned the batch processing approach underlying Hadoop in favor of a real-time, service-based processing architecture originally called Dremel and outlined in a paper from Google research (Melnik, 3010). Google’s BigQuery cloud service, used extensively at Google internally, takes a differing tack that “builds on ideas from web search and parallel DBMSs”—core competencies for the company. In a January 2013 consortium organized by IBM and Arizona State University, Dr. K. Selcuk Candan (Candan, 2013) highlights six key outcomes which may be summarized as a need for better data fusion, data analysis algorithms, data models, scalable architectures, and real-time analysis. While several vendors are visibly out front with custom Hadoop builds for real-time analysis, two non-Hadoop projects, S4 in the Apache Incubator and the production-ready Storm (http://storm-project.net/) show promise a general-purpose parallel computing engines.
While Apache Hadoop project has staged an impressive entrance, broken through the Relational and OLAP paradigms, and shown the viability of open source software, I intend to keep an eye on the companies that have avoided the hype such as Google (Regalado, 2013) and observe as the market polarizes into real-time analysis and those who never needed it.
Candan, K. Selcuk. (2013, June 25). Hunting for the Value Gaps in Data Management, Services, and Analytics. Retrievedfrom http://wp.sigmod.org/?p=904 .