Category Archives: Technologies

Content on Knowledge Discovery in Databases (KDD), analytics, decision support, or data mining ranging from the user-approachable to the technically focused.

Big Data Virtualization

Jboss enterprise has a free data virtualization (NOT server virtualization) platform called Teiid.  Capabilities of this include service of data from multiple technologies (jdbc, odbc, Thrift, REST, SOAP, etc.), merging/transformation of data, fault tolerance, scalability, and other capabilities one would require of an enterprise service.  This can stand in the technology portfolio as part of an Enterprise Service Bus (ESB) to abstract big data and make it APPEAR to be relational (among other benefits)  To set up a Teiid server to expose Hive data:

Install the Jboss EAP

  1. Download (or latest) from jboss downloads
  2. Unzip to c:\programfiles\jboss\ on windows , /etc/jboss on linux

 Overlay Teiid on top of EAP

  1. Download (or latest)
  2. Unzip on top of jboss you just installed:  c:\programfiles\jboss\jboss-eap-6.4.0

 Add the Teiid web console to jboss

  1. Download (or latest)
  2. Unzip on top of jboss you just installed:  c:\programfiles\jboss\jboss-eap-6.4.0

In \jboss-eap-6.3\standalone\configuration\standalone-teiid.xml add to the drivers section:

<driver name="hive" module="org.apache.hadoop.hive">

Find on your cluster the following files and add them to <jboss install dir>\modules\org\apache\hive\main   This path is VERY important and is mis-documented at present on the jboss site.


Navigate to the EAP bin directory and execute ./ -c standalone-teiid.xml

Additional versions:

Example Big Data dev cluster topology

Below is an example dev cluster topology for a Big Data development cluster as I’ve actually used for some customers.  It’s composed of 6 Amazon Web Service (AWS) servers, each with a particular purpose.  We have been able to perform full lambda using this topology along with Teiid (for data abstraction) on terabytes of data.  It’s not sufficient for a production cluster but is a good starting point for a development group.  The total cost of this cluster as configured (less storage) is under $6/hour.

Here’s a link to this dev_topology in Excel.


Service Category Server1 Server2 Server3 Server4 Server5 Server6
Cloudera Mgr Cluster Mgt Alert pub Server Host mon Svc Mon Event Svr Act Mon
Zookeeper Infra Server Server Server
YARN Infra Node Mgr Node Mgr JobHist Node Mgr RM/NM
Redis Infra Master Slave Slave
Hive Data Hive server Metastore Hcat
Impala Data App Master Cat Svr Daemon Daemon Daemon
Storm Data Nimbus/UI Supervisor Supervisor Supervisor
Hue UI Server
Pentaho BI UI BI Server
AWS details              
Name m3.2xlarge m3.2xlarge m3.2xlarge r3.4xlarge r3.4xlarge r3.4xlarge
vCPU 8 8 8 16 16 16
Memory (Gb) 30.0 30.0 30.0 122.0 122.0 122.0
Instance storage (Gb) SSD 2 x 80 SSD 2 x 80 SSD 2 x 80 SSD 1 x 320 SSD 1 x 320 SSD 1 x 320
I/O High High High High High High
EBS option Yes Yes Yes Yes Yes Yes


The Structure of an OpenNLP NameFinder Model

Named Entity Models

Research labs and product teams intent on building upon openNLP and SOLR (which can consume an openNLP NameFinder model) frequently find it important to generate their own model parser or model builder classes.  openNLP has in-built capabilities for this but in the case of custom parsers the structure of the openNLP NameFinder model must be known.

The NameFinder model is defined by the GISModel class which extends AbstractModel and the definition and interfaces exposed can be found in the openNLP api docs on the Apache site.  The structure as below is composed of an indicator of Model type, a correction constant, model outcomes, and model predicates.  Models for NameFinder can be downloaded free from the openNLP project and are trained against generic corpora.

openNLP NameFinder Model Structure

  1. The type identifier, GIS (literal)
  2. The model correction constant (int)
  3. Model correction constant parameter (double)
  4. Outcomes
    1. The number of outcomes (int)
    2. The outcome names (string array, length of which is specified in 4.1. above)
  5. Predicates
    1. Outcome patterns
      1. The number of outcome patterns (int)
      2. The outcome pattern values (each stored in a space delimited string)
    2. The predicate labels
      1. The number of predicates (int)
      2. The predicate names (string array, length of which is specified in 5.2.1. above)
    3. Predicate parameters (double values)