Big Data Analysis of Classified Information (Part 1)

Big Data Analytics in a Secure Environment

This 8-part article will outline the elements of using Big Data technologies for the analysis of classified information.  The topic will be divided to address each facet of big data analysis of classified information:

Part 1 – applied big data architecture

Part 2 – information flow

Part 3 – organizational alignment

Part 4 – roles and responsibilities

Part 5 – principal phases and earned value benchmarks

Part 6 – Data fusion

Part 7 – Knowledge creation

Part 8 – Visualization

Part 9 – Summary and review

The challenges of multi-level secure operating systems have been undertaken by several companies with arguably the largest being SUN’s Trusted Solaris 2.5.1, which is based on Solaris 2.5.1, Common Desktop Environment 1.1, and Solstice AdminSuite 2.1. The ITSEC certification granted by the UK is not presently accepted by NSA and so does not serve as a pre-built secure OS capability. General Dynamics C4S has also built a capability based on a Linux OS that does not mandate a SPARC architecture, making it more friendly to open-source platforms. These initiatives are generating the potential for data fusion, real-time analytics, and predictive analytics across gov, NIPR, SIPR, JWICS, and coalition networks. The architecture in practice is non-trivial but a generalized TOGAF Technical Reference Model based on Linux and open-source HADOOP, Mahout, openNLP, Cassandra, Hive, and PIG is now possible to construct.

Detailed Technical Reference Model (Showing Service Categories)
Fig 1. Detailed Technical Reference Model (Showing Service Categories)

 The TOGAF Architectural Model

A reference architecture is useful as a starting point for building an enterprise-specific architecture and is useful to reduce the risk that any design facet is skipped.  The Open Group Architecture Framework (TOGAF) is one of the more widely-adopted and is the concept upon which many domain-specific architectural standards are built.  Per TOGAF,  “The TOGAF Foundation Architecture is an architecture of generic services and functions that provides a foundation on which more specific architectures and architectural components can be built. This Foundation Architecture is embodied within the Technical Reference Model (TRM), which provides a model and taxonomy of generic platform services.  The TRM is universally applicable and, therefore, can be used to build any system architecture.”  1

At its most fundamental, TOGAF is broken into Application Software, Application Platform, and Communications Infrastructure connected by Applications Platform Interfaces and Communications Infrastructure Interfaces as depicted in Figure 1. This construct provides a structure for top-down planning of service catalog elements and pre-positions for follow-on plans for ITIL Service Catalog construction. Service elements connect infrastructure to applications and are used to further visualize dependence.

Mapping the TRM to Open Source Big Data Technologies

Open Source Software (OSS) can be part of a cost-effective long-range strategy for many organizations. The US Government’s CIO in 2003 and again in 2009 declared that open source technologies should be considered closely when electing technologies and clarified the misconception that government-created versions of these technologies must be openly distributable to the public. Since these declarations the Apache Foundation technologies have figured highly in the US Government’s strategic portfolio, especially within the big data and analytics domain. Widely adopted platforms with security accreditation include Hadoop, Mahout, openNLP, Hive, Pig, Cassandra, SOLR, Lucene, the Apache Web Server, and many others. A general mapping of these technologies against the target big data architecture along with the capabilities of a secure operating system indicate complete coverage of core, non-specialized capabilities.

A robust open source portfolio for the analysis of classified information in this design includes capabilities for structured data analysis, unstructured data analysis, knowledge discovery, and complete multi-level classification and caveat isolation includes:

  • Secure OS – The secure operating system. Examples are a multi-level secure Linux or multi-level secure Solaris.
  • Router – The multi-level secure router.  This provides TCP packet extensions and routing based on security classification markings.
  • Apache – The Apache Web Server, which provides HTML and other rendering services.
  • HDFS – Hadoop File System is the persistence (storage) structure that allows Hadoop to distribute data and operate on it.
  • Hadoop – Scalable Massive Parallel Processing architecture for distributed and scalable computing applications.
  • Mahout – A machine learning platform implemented on Hadoop for classification and prediction of discrete and continuous data.
  • openNLP – A natural language processing platform on Hadoop for unstructured text analysis (sentence marking, tokenizing, part-of-speech extraction, entity extraction, etc.)
  • SOLR – Opensource Apache search platform built on Lucene.
  • Lucene – Full-text indexer and search engine.  Lucene will accept outputs from Mahout and openNLP models to aid searching results of analysis.
  • Hive – Apache Hive is a data warehouse infrastructure that may be used for content storage, retrieval, indexing, and other core DBMS functions.
  • HiveSQL - SQL-like Hive DDL/DML, non ANSI-92 compliant used to warehouse massive parallel datasets and operate upon them.

In the next part in the series we will look at the logical architecture and explore communication sequences within a few common scenarios.

 

References:

SOFs and Big Data – A Not a Cultural Shift

NOTE: This is a repost by permission of an article by Mr. Richard Marshall. Mr. Marshall provides big data and analytics capabilities to the Special Operations community through his company, Blackstorm International. His website is http://blackstorm-int.com.

SOF Warriors

 

When you think of Special Operations Forces you think of the hard men that stormed the Osama Bin Ladin compound in the middle of the night, successfully delivering Justice and Honor. You do not think of tall thin kid, barely out of college with a European man-bag, converse shoes drinking a vanilla latte as the next warrior against the enemies of freedom.

Special Operations has always looked to gain the advantage in every action, seeking especially adept groups as seeking out competitive advantage. Too often these groups focus on the bleeding edge of operations and are often scarce resources used for a limited purpose.

In a situation that is not unique to SOF, there is a condition where the supportive functions of the organization do not benefit from the same attention the primary mission holders receive. While this is to be expected, organizations also need to ensure that the supporting elements’ business systems and processes are improved over time to avoid organizational drag.

In essence the lack of proliferation of qualified data scientists in all levels of the organization result in a lack of consistent business practices and a myopic focus in isolated business areas severely limits the value big data and analytics can bring to SOF.  What is needed is a set of practices and processes that are repeatable, can be expanded upon and easily translated across organizational boundaries. The potential for subordinate units being able to leverage Headquarters practices and resources thereby lowering the barriers successful analytics utilization is an ability not yet realized for most commands.

In fact, many consultants in this space will assert that commoditization is not possible within the discipline of BI/BA as every problem is different and that it takes different skills and approaches to solve the identified problems. This is a fallacy and is a stance usually designed to prolong consulting engagements and profitability.

It is a simple fact that much of the technology needed to develop an analytics program are already in existence within the organizations desiring analytics capability. There are benefits to purchasing scalable distributed storage solutions supporting big data applications; however these need to be balanced against the benefits of license optimization within the current infrastructure. Seldom is scalability a driving issue in COCOMS the way it is for other industries such as banking. The data are simply not that large.

Eventually we will begin to learn to utilize the additional deluge of data off our sensor platforms necessitating the need for a scalable infrastructure however the practice of working with the data must come first. Most likely, big data sets that are available in DoD will be more focused on efficiencies and utilization (performance management) rather than finding a bad guy. In fact, much of the data that fits the big data profile will be platform specific data that has little to do with SOF’s 8 primary mission areas.

So what will DoD organizations as the Combatant Command and subordinate organizations need to change to take advantage of this emergent approach to competitive advantage? SOF only needs to do what they have always done—operate outside their comfort zone:

  1. Realize that the contracting groups that are most likely to assist in this field will not come from their old ops buddies. The groups that will bring this success will have little or no knowledge of SOF Missions. They will have a deep knowledge of data, statistical analysis and presentation.
  2. Look to develop a set of business practices and policies that support decision making for the command that can be shared with subordinate units.
  3. Question Solutions. Look critically at the offerings within the community. Many organizations are trying to sell applications and hardware as bundled sets. Analyze the benefits of these platforms and what capability it will bring. Most organizations running a Microsoft infrastructure already have all the tools they need to develop an analytics capability.
  4. Focus on the practice. Build a framework and integrate the capability into every J-Code/staff section. Hire the personnel that can train and guide Command staff asking the questions that will lead to analytics solutions.
  5. Focus on the data. The practice of working with data has academically been reserved for a small group of science majors and professionals. As the data sets expand, staff members can assist the command in being mindful of the importance of all data and ensure that the organizations information is properly constructed and cared for.
  6. Knowledge Management. Knowledge Management offers a unique position for developing a global analytics solution due to the scope of their reach within CCMD’s. Though underutilized now, KM’s will mature into the focal point for future analytics operations, as keepers of the index.

There are plenty of opportunities for SOF warriors to squeeze more out of their data and current systems. The habit of consistently reaching outside existing comfort zones is a hallmark of profession. What SOF needs is a practice and a framework that can be shared and grown and a vehicle to deliver the tools needed by the new generation of leaders and operations specialists.  The nondescript, European man-bag-carrying warrior will be on point in our unconventional war against our enemies with enhanced, analytics-driven information as a key weapon in her arsenal.

Automated Metadata Extraction for Competitive Intelligence

Artificial Intelligence for the Creation of Competitive Intelligence Tools

Introduction

Often in prioritizing business development activities it is helpful to determine who is able to influence a decision and how they are related to those in the market space.  To make a defensible and actionable strategy it is useful to perform Influence Analysis and Network Analysis, which can form the kernel of a competitive intelligence analysis strategy.  The data required for analysis must be obtained by identifying and extracting target attribute values in unstructured and often very large (multi-terabyte or petabyte) data stores.  This necessitates a scalable infrastructure, distributed parallel computing capability, and fit-for-use natural language processing algorithms.  Herein I will demonstrate a target logical architecture and methodology for accomplishing the task.  Influence and Network analysis by machine learning algorithm (naïve bayes or perceptron for example) will be covered in a later supporting article.

Recognizing Significance

Named-Entity Recognition is required for unstructured content extraction in this scenario.  This identification scheme may or may not employ stemming but will always require tokenizing, part-of-speech tagging, and the acquisition of a predefined model of attribute patterns to properly recognize and extract required metadata.  A powerful platform with these built-in capabilities is the Apache openNLP project, which includes typed attribute models for the name finder, an extensible name finder algorithm, an API that exposes a Lucene index consumer, and a scalable, distributed architecture.  The Apache Stanbol project in the incubator (http://stanbol.apache.org/) shows promise at semantic-based extraction and content enhancement but hasn’t been promoted outside the incubator yet.

Apache openNLP attribute recognition models are available in only a few languages with the original and largest being English.  The community publishes models in English for the Name Finder interface for dates, location, money, organization, percentage, person, and time (date).  Each is an appropriate candidate for term extraction for competitive intelligence analysis.

Logical Architecture

Natural Language Processing for Competitive Intelligence
openNLP in four node Hadoop cluster

The controlling requirement for the task of metadata extraction from massive datasources is the processing of massive datasets to extract information.  For this Hadoop provides a flexible, fault-tolerant framework and processing model that readily supports the natural language processing needs.  The logical architecture for a small (<1TB) 4-node clustered Hadoop solution is as follows:

 

Process Flow

As below, the process to execute is standardized on the map/reduce patterns Distributed Task Execution, Union, Selection, and Intersection.  Pre-processing using a Graph Processing pattern in a distinctly separate map phase would likely hasten any Influence Analysis to be performed post-process.

 

Operations Sequence Diagram of openNLP with Map Reduce on Hadoop for Competitive Intelligence
Multi-node Sequence Diagram for openNLP with Map Reduce on Hadoop

The primary namenode initiates work and passes the data and map/reduce execution program to the task trackers, who in turn distribute it among worker nodes.  The worker nodes execute the map on HDFS-stored data, provide health and status to the task tracker, who reports it to the primary namenode.  On node map completion the primary namenode may redistribute map work to the worker node or order the reduce task, each by way of the task tracker.  The reduce task selects data from the HDFS interim resultset, aggregates, and streams to a result file.  The result file is then used later for analysis by the machine learning algorithm of choice.

File Structures

The input file is of a machine-readable ASCII text type and is unstructured.  Example:

 

From: Amir Soofi

Sent: Thursday, December 06, 2012 2:37 AM

To: Aaron Macarthur; Hugo Cruz

Cc: Donald Krapohl

Subject: RE: Language Comparison

 

Hugo,

 

FYI, Rick Marshall unofficially approved a 3-day trip for one person from the Enterprise team down to Jacksonville, FL to assist in the catalog reinstall.

 

I’ll be placing it in the travel portal soon for the official process, so that the option becomes officially available to us.

 

I think together we’ll be able to push through the environment differences better in person than over the phone.

 

Let us know whether your site can even accommodate a visitor, and when you’d like to exercise this option.

 

Respectfully,

 

Amir Soofi

 

Principal Software Engineer, Enterprise

 

 

The output of the openNLP Name Find algorithm map task on this input:

From: <namefind/person>Amir Soofi</namefind/person>

Sent: <namefind/date>Thursday, December 06, 2012 2:37 AM</namefind/date >

To: <namefind/person>Aaron Macarthur</namefind/person>; <namefind/person>Hugo Cruz</namefind/person>

Cc: <namefind/person>Donald Krapohl</namefind/person>

Subject: RE: Language Comparison

 

Hugo,

 

FYI, <namefind/person>Rick Marshall</namefind/person> unofficially approved a 3-day trip starting <namefind/date>14 November</namefind/date> for one person from the Enterprise team down to <namefind/location>Jacksonville, FL</namefind/location> to assist in the catalog reinstall.

 

I’ll be placing it in the travel portal soon for the official process, so that the option becomes officially available to us.

 

I think together we’ll be able to push through the environment differences better in person than over the phone.

 

Let us know whether your site can even accommodate a visitor, and when you’d like to exercise this option.

 

Respectfully,

 

<namefind/person>Amir Soofi</namefind/person>

 

Principal Software Engineer, Enterprise

 

The output of an example reduce task on this output:

{DocumentUniqueID, EntityKey, EntityType}

{234cba3231, Amir Soofi, Person}

{234cba3231, Thursday, December 06, 2012 2:37 AM, Date}

{234cba3231, Aaron Macarthur, Person}

{234cba3231, Hugo Cruz, Person}

{234cba3231, Donald Krapohl, Person}

{234cba3231, Rick Marshall, Person}

{234cba3231, 14 November, Date}

{234cba3231, Jacksonville/,FL, Location}

{234cba3231, Amir Soofi, Person}

 

A second reduce pass might yield combinations for network analysis (link strength below being calculated on instances of co-existence across unique documents):

{EntityKey, LinkedEntity, LinkStrength}

{Amir Soofi, Donald Krapohl, 6}

{Amir Soofi, Aaron Macarthur, 15}

{Amir Soofi, Jacksonville/, FL, 1}

 

The data may then be consumed into the analysis tool of choice, such as RapidMiner, WEKA, PowerPivot, or SQL Server/SQL Server Analysis Services for further analysis.

Conclusion

openNLP on Hadoop can provides good metadata extraction for key information in unstructured data.  The information may be retrieved from competitor websites, SEC filings, Twitter activity, employee social network activity, or many other sources.  The data pre-processing and preparation steps in metadata extraction for competitive intelligence applications can be low relative to that of other analytical problems (contract semantic analysis, social analysis trending, etc.).  The steps outlined in this paper demonstrate a very high-level overview of a logical architecture and key execution activities required to gather metadata for Influence Analysis and Network Analysis for competitive advantage.

My Google+