Tag Archives: predictive analytics architecture

NIST Big Data Working Group

The US National Institute of Standards and Technology (NIST) kicked off their Big Data Working Group on June 19th 2013.  The sessions have now been broken down into subgroups for Definitions, Taxonomies, Reference Architecture, and Technology Roadmap.  The charter for the working group:

NIST is leading the development of a Big Data Technology Roadmap. This roadmap will define and prioritize requirements for interoperabilityreusability, and extendibility for big data analytic techniques and technology infrastructure in order to support secure and effective adoption of Big Data. To help develop the ideas in the Big Data Technology Roadmap, NIST is creating the Public Working Group for Big Data.

Scope: The focus of the NBD-WG is to form a community of interest from industry, academia, and government, with the goal of developing a consensus definitionstaxonomiesreference architectures, and technology roadmap which would enable breakthrough discoveries and innovation by advancing measurement science, standards, and technology in ways that enhance economic security and improve quality of life. Deliverables:

  • Develop Big Data Definitions
  • Develop Big Data Taxonomies
  • Develop Big Data Reference Architectures
  • Develop Big Data Technology Roadmap

Target Date: The goal for completion of INITIAL DRAFTs is Friday, September 27, 2013. Further milestones will be developed once the WG has initiated its regular meetings.

Participants: The NBD-WG is open to everyone. We hope to bring together stakeholder communities across industry, academic, and government sectors representing all of those with interests in Big Data techniques, technologies, and applications. The group needs your input to meet its goals so please join us for the kick-off meeting and contribute your ideas and insights.

Meetings: The NBD-WG will hold weekly meetings on Wednesdays from 1300 – 1500 EDT (unless announce otherwise) by teleconference. Please click here for the virtual meeting information.> Questions: General questions to the NBD-WG can be addressed to BigDataInfo@nist.gov


To participate in helping the US Government in their efforts, sign up at http://bigdatawg.nist.gov/home.php

Meetup for Tampa Analytics Professionals

I’ve started a meetup for local professionals in the decision science field around the Tampa Bay area to come together and learn about what’s happening in our area.  If you are a data science professional, come join us and be a part of making the Tampa-St. Petersburg metro area the southeast center of excellence in big data and analytics.  Visit http://www.meetup.com/Analytics-Professionals-of-Tampa/ to find events and to join.  I hope to see you there.

Big Data Analysis of Classified Information (Part 1)

Big Data Analytics in a Secure Environment

This 8-part article will outline the elements of using Big Data technologies for the analysis of classified information.  The topic will be divided to address each facet of big data analysis of classified information:

Part 1 – applied big data architecture

Part 2 – information flow

Part 3 – organizational alignment

Part 4 – roles and responsibilities

Part 5 – principal phases and earned value benchmarks

Part 6 – Data fusion

Part 7 – Knowledge creation

Part 8 – Visualization

Part 9 – Summary and review

The challenges of multi-level secure operating systems have been undertaken by several companies with arguably the largest being SUN’s Trusted Solaris 2.5.1, which is based on Solaris 2.5.1, Common Desktop Environment 1.1, and Solstice AdminSuite 2.1. The ITSEC certification granted by the UK is not presently accepted by NSA and so does not serve as a pre-built secure OS capability. General Dynamics C4S has also built a capability based on a Linux OS that does not mandate a SPARC architecture, making it more friendly to open-source platforms. These initiatives are generating the potential for data fusion, real-time analytics, and predictive analytics across gov, NIPR, SIPR, JWICS, and coalition networks. The architecture in practice is non-trivial but a generalized TOGAF Technical Reference Model based on Linux and open-source HADOOP, Mahout, openNLP, Cassandra, Hive, and PIG is now possible to construct.

Detailed Technical Reference Model (Showing Service Categories)
Fig 1. Detailed Technical Reference Model (Showing Service Categories)

 The TOGAF Architectural Model

A reference architecture is useful as a starting point for building an enterprise-specific architecture and is useful to reduce the risk that any design facet is skipped.  The Open Group Architecture Framework (TOGAF) is one of the more widely-adopted and is the concept upon which many domain-specific architectural standards are built.  Per TOGAF,  “The TOGAF Foundation Architecture is an architecture of generic services and functions that provides a foundation on which more specific architectures and architectural components can be built. This Foundation Architecture is embodied within the Technical Reference Model (TRM), which provides a model and taxonomy of generic platform services.  The TRM is universally applicable and, therefore, can be used to build any system architecture.”  1

At its most fundamental, TOGAF is broken into Application Software, Application Platform, and Communications Infrastructure connected by Applications Platform Interfaces and Communications Infrastructure Interfaces as depicted in Figure 1. This construct provides a structure for top-down planning of service catalog elements and pre-positions for follow-on plans for ITIL Service Catalog construction. Service elements connect infrastructure to applications and are used to further visualize dependence.

Mapping the TRM to Open Source Big Data Technologies

Open Source Software (OSS) can be part of a cost-effective long-range strategy for many organizations. The US Government’s CIO in 2003 and again in 2009 declared that open source technologies should be considered closely when electing technologies and clarified the misconception that government-created versions of these technologies must be openly distributable to the public. Since these declarations the Apache Foundation technologies have figured highly in the US Government’s strategic portfolio, especially within the big data and analytics domain. Widely adopted platforms with security accreditation include Hadoop, Mahout, openNLP, Hive, Pig, Cassandra, SOLR, Lucene, the Apache Web Server, and many others. A general mapping of these technologies against the target big data architecture along with the capabilities of a secure operating system indicate complete coverage of core, non-specialized capabilities.

A robust open source portfolio for the analysis of classified information in this design includes capabilities for structured data analysis, unstructured data analysis, knowledge discovery, and complete multi-level classification and caveat isolation includes:

  • Secure OS – The secure operating system. Examples are a multi-level secure Linux or multi-level secure Solaris.
  • Router – The multi-level secure router.  This provides TCP packet extensions and routing based on security classification markings.
  • Apache – The Apache Web Server, which provides HTML and other rendering services.
  • HDFS – Hadoop File System is the persistence (storage) structure that allows Hadoop to distribute data and operate on it.
  • Hadoop – Scalable Massive Parallel Processing architecture for distributed and scalable computing applications.
  • Mahout – A machine learning platform implemented on Hadoop for classification and prediction of discrete and continuous data.
  • openNLP – A natural language processing platform on Hadoop for unstructured text analysis (sentence marking, tokenizing, part-of-speech extraction, entity extraction, etc.)
  • SOLR – Opensource Apache search platform built on Lucene.
  • Lucene – Full-text indexer and search engine.  Lucene will accept outputs from Mahout and openNLP models to aid searching results of analysis.
  • Hive – Apache Hive is a data warehouse infrastructure that may be used for content storage, retrieval, indexing, and other core DBMS functions.
  • HiveSQL – SQL-like Hive DDL/DML, non ANSI-92 compliant used to warehouse massive parallel datasets and operate upon them.

In the next part in the series we will look at the logical architecture and explore communication sequences within a few common scenarios.