Big Data Analytics in a Secure Environment
This 8-part article will outline the elements of using Big Data technologies for the analysis of classified information. The topic will be divided to address each facet of big data analysis of classified information:
Part 1 – applied big data architecture
Part 2 – information flow
Part 3 – organizational alignment
Part 4 – roles and responsibilities
Part 5 – principal phases and earned value benchmarks
Part 6 – Data fusion
Part 7 – Knowledge creation
Part 8 – Visualization
Part 9 – Summary and review
The challenges of multi-level secure operating systems have been undertaken by several companies with arguably the largest being SUN’s Trusted Solaris 2.5.1, which is based on Solaris 2.5.1, Common Desktop Environment 1.1, and Solstice AdminSuite 2.1. The ITSEC certification granted by the UK is not presently accepted by NSA and so does not serve as a pre-built secure OS capability. General Dynamics C4S has also built a capability based on a Linux OS that does not mandate a SPARC architecture, making it more friendly to open-source platforms. These initiatives are generating the potential for data fusion, real-time analytics, and predictive analytics across gov, NIPR, SIPR, JWICS, and coalition networks. The architecture in practice is non-trivial but a generalized TOGAF Technical Reference Model based on Linux and open-source HADOOP, Mahout, openNLP, Cassandra, Hive, and PIG is now possible to construct.
The TOGAF Architectural Model
A reference architecture is useful as a starting point for building an enterprise-specific architecture and is useful to reduce the risk that any design facet is skipped. The Open Group Architecture Framework (TOGAF) is one of the more widely-adopted and is the concept upon which many domain-specific architectural standards are built. Per TOGAF, “The TOGAF Foundation Architecture is an architecture of generic services and functions that provides a foundation on which more specific architectures and architectural components can be built. This Foundation Architecture is embodied within the Technical Reference Model (TRM), which provides a model and taxonomy of generic platform services. The TRM is universally applicable and, therefore, can be used to build any system architecture.” 1
At its most fundamental, TOGAF is broken into Application Software, Application Platform, and Communications Infrastructure connected by Applications Platform Interfaces and Communications Infrastructure Interfaces as depicted in Figure 1. This construct provides a structure for top-down planning of service catalog elements and pre-positions for follow-on plans for ITIL Service Catalog construction. Service elements connect infrastructure to applications and are used to further visualize dependence.
Mapping the TRM to Open Source Big Data Technologies
Open Source Software (OSS) can be part of a cost-effective long-range strategy for many organizations. The US Government’s CIO in 2003 and again in 2009 declared that open source technologies should be considered closely when electing technologies and clarified the misconception that government-created versions of these technologies must be openly distributable to the public. Since these declarations the Apache Foundation technologies have figured highly in the US Government’s strategic portfolio, especially within the big data and analytics domain. Widely adopted platforms with security accreditation include Hadoop, Mahout, openNLP, Hive, Pig, Cassandra, SOLR, Lucene, the Apache Web Server, and many others. A general mapping of these technologies against the target big data architecture along with the capabilities of a secure operating system indicate complete coverage of core, non-specialized capabilities.
A robust open source portfolio for the analysis of classified information in this design includes capabilities for structured data analysis, unstructured data analysis, knowledge discovery, and complete multi-level classification and caveat isolation includes:
- Secure OS – The secure operating system. Examples are a multi-level secure Linux or multi-level secure Solaris.
- Router – The multi-level secure router. This provides TCP packet extensions and routing based on security classification markings.
- Apache – The Apache Web Server, which provides HTML and other rendering services.
- HDFS – Hadoop File System is the persistence (storage) structure that allows Hadoop to distribute data and operate on it.
- Hadoop – Scalable Massive Parallel Processing architecture for distributed and scalable computing applications.
- Mahout – A machine learning platform implemented on Hadoop for classification and prediction of discrete and continuous data.
- openNLP – A natural language processing platform on Hadoop for unstructured text analysis (sentence marking, tokenizing, part-of-speech extraction, entity extraction, etc.)
- SOLR – Opensource Apache search platform built on Lucene.
- Lucene – Full-text indexer and search engine. Lucene will accept outputs from Mahout and openNLP models to aid searching results of analysis.
- Hive – Apache Hive is a data warehouse infrastructure that may be used for content storage, retrieval, indexing, and other core DBMS functions.
- HiveSQL - SQL-like Hive DDL/DML, non ANSI-92 compliant used to warehouse massive parallel datasets and operate upon them.
In the next part in the series we will look at the logical architecture and explore communication sequences within a few common scenarios.
- 1 The Open Group Foundation, “43. Foundation Architecture: Technical Reference Model”, http://pubs.opengroup.org/architecture/togaf9-doc/arch/chap43.html. Retrieved 29 May 2013.