Big Data Virtualization

Jboss enterprise has a free data virtualization (NOT server virtualization) platform called Teiid.  Capabilities of this include service of data from multiple technologies (jdbc, odbc, Thrift, REST, SOAP, etc.), merging/transformation of data, fault tolerance, scalability, and other capabilities one would require of an enterprise service.  This can stand in the technology portfolio as part of an Enterprise Service Bus (ESB) to abstract big data and make it APPEAR to be relational (among other benefits)  To set up a Teiid server to expose Hive data:

Install the Jboss EAP

  1. Download jboss-eap-6.4.0.zip (or latest) from jboss downloads
  2. Unzip to c:\programfiles\jboss\ on windows , /etc/jboss on linux

 Overlay Teiid on top of EAP

  1. Download teiid-9.0.1-jboss-dist.zip (or latest)
  2. Unzip on top of jboss you just installed:  c:\programfiles\jboss\jboss-eap-6.4.0

 Add the Teiid web console to jboss

  1. Download teiid-console-dist-1.x.zip (or latest)
  2. Unzip on top of jboss you just installed:  c:\programfiles\jboss\jboss-eap-6.4.0

In \jboss-eap-6.3\standalone\configuration\standalone-teiid.xml add to the drivers section:

<driver name="hive" module="org.apache.hadoop.hive">
     <driver-class>org.apache.hive.jdbc.HiveDriver</driver-class>
</driver>

Find on your cluster the following files and add them to <jboss install dir>\modules\org\apache\hive\main   This path is VERY important and is mis-documented at present on the jboss site.

commons-logging-1.1.3.jar
hadoop-common-2.5.0-cdh5.3.0.jar
hadoop-core-2.5.0-mr1-cdh5.3.0
hive_metastore.jar
hive_service.jar
hive-common-0.13.1-cdh5.3.0.jar
hive-jdbc-0.13.1-cdh5.3.0.jar
hive-metastore-0.13.1-cdh5.3.0.jar
hive-serde-0.13.1-cdh5.3.0.jar
hive-service-0.13.1-cdh5.3.0.jar
httpclient-4.2.5.jar
httpcore-4.2.5.jar
HiveJDBC4.jar
libfb303-0.9.0.jar
libthrift-0.9.0.jar
log4j-1.2.14.jar
ql.jar
slf4j-api-1.5.11.jar
slf4j-log4j12-1.5.11.jar
TCLIServiceClient.jar

Navigate to the EAP bin directory and execute ./standalone.sh -c standalone-teiid.xml

Additional versions:  http://tools.jboss.org/downloads/overview.html

A topology for a big data production environment

I’ve attached an excel file for a full-featured Big Data (hadoop) Production topology with a good starting place for an architecture that supports full Lambda architecture (streaming for seconds-old recency, batch for heavy lifting, and services to logically merge the two on demand).  The cluster is composed of 21 AWS instances with EBS backing.  The HDFS layer can be partitioned with the older data (those more than 1 year for example) are on cheaper S3 storage while still fully query-able.

The use cases covered in this architecture:

  1. Accessibility
    1. Data miner support through SQL and machine learning libraries into the raw data
    2. Ad-hoc querying through SQL in a dimensional model
    3. REST, thrift, and other API access with load balancing, data merging (from any data technology), and efficient data source routing
    4. OLAP cubes with perspectives (through data marts) for business analysis
  2. Technical
    1. Open source, free licensing model
    2. Fault tolerance and re-entrance on failure
    3. Scalable design with massive parallelism
    4. Cloud design for flexibility

 

Example Big Data dev cluster topology

Below is an example dev cluster topology for a Big Data development cluster as I’ve actually used for some customers.  It’s composed of 6 Amazon Web Service (AWS) servers, each with a particular purpose.  We have been able to perform full lambda using this topology along with Teiid (for data abstraction) on terabytes of data.  It’s not sufficient for a production cluster but is a good starting point for a development group.  The total cost of this cluster as configured (less storage) is under $6/hour.

Here’s a link to this dev_topology in Excel.

 

Service Category Server1 Server2 Server3 Server4 Server5 Server6
Cloudera Mgr Cluster Mgt Alert pub Server Host mon Svc Mon Event Svr Act Mon
HDFS Infra Namenode SNN/DN/JN/HA DN DN/JN DN DN/JN
Zookeeper Infra Server Server Server
YARN Infra Node Mgr Node Mgr JobHist Node Mgr RM/NM
Redis Infra Master Slave Slave
Hive Data Hive server Metastore Hcat
Impala Data App Master Cat Svr Daemon Daemon Daemon
Storm Data Nimbus/UI Supervisor Supervisor Supervisor
Hue UI Server
Pentaho BI UI BI Server
IP ADDRESS
AWS details              
Name m3.2xlarge m3.2xlarge m3.2xlarge r3.4xlarge r3.4xlarge r3.4xlarge
vCPU 8 8 8 16 16 16
Memory (Gb) 30.0 30.0 30.0 122.0 122.0 122.0
Instance storage (Gb) SSD 2 x 80 SSD 2 x 80 SSD 2 x 80 SSD 1 x 320 SSD 1 x 320 SSD 1 x 320
I/O High High High High High High
EBS option Yes Yes Yes Yes Yes Yes