When creating a new Amazon Web Services (AWS) hadoop cluster it is overwhelming for most people to put together a configuration plan or topology.
I’ve done this many times and as part of my focus on tools and templates thought I’d add a template you can use as a basic guideline for planning your Cloudera big data cluster. The template includes configurations for:
the cluster topology
metastore detail for hive, YARN, hue, impala, sqoop, oozie, and Cloudera Manager
and additional detail for custom service descriptors (CSD) for Storm and Redis
No Warranty Expressed or Implied
It’s not meant to be exhaustive as there are many items not covered (AWS security groups, network optimization, dockerization, continuous integration, monitors, etc.) but it is an example of a real-world cluster in AWS (details of instance and AZ changed for security).
Cloudera hadoop cluster configuration template for Amazon Web Services (AWS)
The US National Institute of Standards and Technology (NIST) kicked off their Big Data Working Group on June 19th 2013. The sessions have now been broken down into subgroups for Definitions, Taxonomies, Reference Architecture, and Technology Roadmap. The charter for the working group:
NIST is leading the development of a Big Data Technology Roadmap. This roadmap will define and prioritize requirements for interoperability, reusability, and extendibility for big data analytic techniques and technology infrastructure in order to support secure and effective adoption of Big Data. To help develop the ideas in the Big Data Technology Roadmap, NIST is creating the Public Working Group for Big Data.
Scope: The focus of the NBD-WG is to form a community of interest from industry, academia, and government, with the goal of developing a consensus definitions, taxonomies, reference architectures, and technology roadmap which would enable breakthrough discoveries and innovation by advancing measurement science, standards, and technology in ways that enhance economic security and improve quality of life. Deliverables:
Develop Big Data Definitions
Develop Big Data Taxonomies
Develop Big Data Reference Architectures
Develop Big Data Technology Roadmap
Target Date: The goal for completion of INITIAL DRAFTs is Friday, September 27, 2013. Further milestones will be developed once the WG has initiated its regular meetings.
Participants: The NBD-WG is open to everyone. We hope to bring together stakeholder communities across industry, academic, and government sectors representing all of those with interests in Big Data techniques, technologies, and applications. The group needs your input to meet its goals so please join us for the kick-off meeting and contribute your ideas and insights.
Meetings: The NBD-WG will hold weekly meetings on Wednesdays from 1300 – 1500 EDT (unless announce otherwise) by teleconference. Please click here for the virtual meeting information.> Questions: General questions to the NBD-WG can be addressed to BigDataInfo@nist.gov
This 8-part article will outline the elements of using Big Data technologies for the analysis of classified information. The topic will be divided to address each facet of big data analysis of classified information:
Part 1 – applied big data architecture
Part 2 – information flow
Part 3 – organizational alignment
Part 4 – roles and responsibilities
Part 5 – principal phases and earned value benchmarks
Part 6 – Data fusion
Part 7 – Knowledge creation
Part 8 – Visualization
Part 9 – Summary and review
The challenges of multi-level secure operating systems have been undertaken by several companies with arguably the largest being SUN’s Trusted Solaris 2.5.1, which is based on Solaris 2.5.1, Common Desktop Environment 1.1, and Solstice AdminSuite 2.1. The ITSEC certification granted by the UK is not presently accepted by NSA and so does not serve as a pre-built secure OS capability. General Dynamics C4S has also built a capability based on a Linux OS that does not mandate a SPARC architecture, making it more friendly to open-source platforms. These initiatives are generating the potential for data fusion, real-time analytics, and predictive analytics across gov, NIPR, SIPR, JWICS, and coalition networks. The architecture in practice is non-trivial but a generalized TOGAF Technical Reference Model based on Linux and open-source HADOOP, Mahout, openNLP, Cassandra, Hive, and PIG is now possible to construct.
The TOGAF Architectural Model
A reference architecture is useful as a starting point for building an enterprise-specific architecture and is useful to reduce the risk that any design facet is skipped. The Open Group Architecture Framework (TOGAF) is one of the more widely-adopted and is the concept upon which many domain-specific architectural standards are built. Per TOGAF, “The TOGAF Foundation Architecture is an architecture of generic services and functions that provides a foundation on which more specific architectures and architectural components can be built. This Foundation Architecture is embodied within the Technical Reference Model (TRM), which provides a model and taxonomy of generic platform services. The TRM is universally applicable and, therefore, can be used to build any system architecture.” 1
At its most fundamental, TOGAF is broken into Application Software, Application Platform, and Communications Infrastructure connected by Applications Platform Interfaces and Communications Infrastructure Interfaces as depicted in Figure 1. This construct provides a structure for top-down planning of service catalog elements and pre-positions for follow-on plans for ITIL Service Catalog construction. Service elements connect infrastructure to applications and are used to further visualize dependence.
Mapping the TRM to Open Source Big Data Technologies
Open Source Software (OSS) can be part of a cost-effective long-range strategy for many organizations. The US Government’s CIO in 2003 and again in 2009 declared that open source technologies should be considered closely when electing technologies and clarified the misconception that government-created versions of these technologies must be openly distributable to the public. Since these declarations the Apache Foundation technologies have figured highly in the US Government’s strategic portfolio, especially within the big data and analytics domain. Widely adopted platforms with security accreditation include Hadoop, Mahout, openNLP, Hive, Pig, Cassandra, SOLR, Lucene, the Apache Web Server, and many others. A general mapping of these technologies against the target big data architecture along with the capabilities of a secure operating system indicate complete coverage of core, non-specialized capabilities.
A robust open source portfolio for the analysis of classified information in this design includes capabilities for structured data analysis, unstructured data analysis, knowledge discovery, and complete multi-level classification and caveat isolation includes:
Secure OS – The secure operating system. Examples are a multi-level secure Linux or multi-level secure Solaris.
Router – The multi-level secure router. This provides TCP packet extensions and routing based on security classification markings.
Apache – The Apache Web Server, which provides HTML and other rendering services.
HDFS – Hadoop File System is the persistence (storage) structure that allows Hadoop to distribute data and operate on it.
Hadoop – Scalable Massive Parallel Processing architecture for distributed and scalable computing applications.
Mahout – A machine learning platform implemented on Hadoop for classification and prediction of discrete and continuous data.
openNLP – A natural language processing platform on Hadoop for unstructured text analysis (sentence marking, tokenizing, part-of-speech extraction, entity extraction, etc.)
SOLR – Opensource Apache search platform built on Lucene.
Lucene – Full-text indexer and search engine. Lucene will accept outputs from Mahout and openNLP models to aid searching results of analysis.
Hive – Apache Hive is a data warehouse infrastructure that may be used for content storage, retrieval, indexing, and other core DBMS functions.
HiveSQL – SQL-like Hive DDL/DML, non ANSI-92 compliant used to warehouse massive parallel datasets and operate upon them.
In the next part in the series we will look at the logical architecture and explore communication sequences within a few common scenarios.