A topology for a big data production environment

I’ve attached an excel file for a full-featured Big Data (hadoop) Production topology with a good starting place for an architecture that supports full Lambda architecture (streaming for seconds-old recency, batch for heavy lifting, and services to logically merge the two on demand).  The cluster is composed of 21 AWS instances with EBS backing.  The HDFS layer can be partitioned with the older data (those more than 1 year for example) are on cheaper S3 storage while still fully query-able.

The use cases covered in this architecture:

  1. Accessibility
    1. Data miner support through SQL and machine learning libraries into the raw data
    2. Ad-hoc querying through SQL in a dimensional model
    3. REST, thrift, and other API access with load balancing, data merging (from any data technology), and efficient data source routing
    4. OLAP cubes with perspectives (through data marts) for business analysis
  2. Technical
    1. Open source, free licensing model
    2. Fault tolerance and re-entrance on failure
    3. Scalable design with massive parallelism
    4. Cloud design for flexibility