US Government Spying on American Citizens

From Whence are the Dangers of Intelligence Collection on American Citizens?

Corporate spying, not US Government spying and what you should know
Consumer Intelligence

The ubiquity of cell phones has made the capturing of actual behavior (versus behavior stated in surveys) a multi-billion dollar enterprise.  A boon to legitimate enterprises and researchers, there are nagging questions regarding the ethics of collecting personal information solely on the basis of a (unapproachably worded) legal disclaimer.  The powerful sensor package carried in our pockets may rival those of a military drone aircraft or a manufacturing robot.  Further, collection and resale of sensor data in mobile devices will continue to expand as more sensors are added.  A highly insightful article in the Wall Street Journal posted recently in their blog section displays a fantastic analysis, hinting at one small aspect of surreptitious consumer information gathering.  There has been a significant volume of emotional arguments expressed by those concerned with the US Government’s Prism project–the law and ethics of which are well-controlled and well-understood, the individual risk of disclosure negligible, and the threat imposed by disclosure minimal.  The sheer volume of information, the cost of analysis, the lack of actionable intelligence, and high degree of noise are a huge barrier to actual violation of individual liberties and in practice likely preclude the US Government spying activity on citizens.  In contrast, the WSJ article above highlights a few of the major commercial collectors of consumer intelligence, the contents of which are very typically acted upon.   Perhaps articles like these may do something to educate people to one reason some mobile services are free.  Then consumers may choose by whom and to what degree they wish to be surveilled.

Automated Metadata Extraction for Competitive Intelligence

Artificial Intelligence for the Creation of Competitive Intelligence Tools


Often in prioritizing business development activities it is helpful to determine who is able to influence a decision and how they are related to those in the market space.  To make a defensible and actionable strategy it is useful to perform Influence Analysis and Network Analysis, which can form the kernel of a competitive intelligence analysis strategy.  The data required for analysis must be obtained by identifying and extracting target attribute values in unstructured and often very large (multi-terabyte or petabyte) data stores.  This necessitates a scalable infrastructure, distributed parallel computing capability, and fit-for-use natural language processing algorithms.  Herein I will demonstrate a target logical architecture and methodology for accomplishing the task.  Influence and Network analysis by machine learning algorithm (naïve bayes or perceptron for example) will be covered in a later supporting article.

Recognizing Significance

Named-Entity Recognition is required for unstructured content extraction in this scenario.  This identification scheme may or may not employ stemming but will always require tokenizing, part-of-speech tagging, and the acquisition of a predefined model of attribute patterns to properly recognize and extract required metadata.  A powerful platform with these built-in capabilities is the Apache openNLP project, which includes typed attribute models for the name finder, an extensible name finder algorithm, an API that exposes a Lucene index consumer, and a scalable, distributed architecture.  The Apache Stanbol project in the incubator ( shows promise at semantic-based extraction and content enhancement but hasn’t been promoted outside the incubator yet.

Apache openNLP attribute recognition models are available in only a few languages with the original and largest being English.  The community publishes models in English for the Name Finder interface for dates, location, money, organization, percentage, person, and time (date).  Each is an appropriate candidate for term extraction for competitive intelligence analysis.

Logical Architecture

Natural Language Processing for Competitive Intelligence
openNLP in four node Hadoop cluster

The controlling requirement for the task of metadata extraction from massive datasources is the processing of massive datasets to extract information.  For this Hadoop provides a flexible, fault-tolerant framework and processing model that readily supports the natural language processing needs.  The logical architecture for a small (<1TB) 4-node clustered Hadoop solution is as follows:


Process Flow

As below, the process to execute is standardized on the map/reduce patterns Distributed Task Execution, Union, Selection, and Intersection.  Pre-processing using a Graph Processing pattern in a distinctly separate map phase would likely hasten any Influence Analysis to be performed post-process.


Operations Sequence Diagram of openNLP with Map Reduce on Hadoop for Competitive Intelligence
Multi-node Sequence Diagram for openNLP with Map Reduce on Hadoop

The primary namenode initiates work and passes the data and map/reduce execution program to the task trackers, who in turn distribute it among worker nodes.  The worker nodes execute the map on HDFS-stored data, provide health and status to the task tracker, who reports it to the primary namenode.  On node map completion the primary namenode may redistribute map work to the worker node or order the reduce task, each by way of the task tracker.  The reduce task selects data from the HDFS interim resultset, aggregates, and streams to a result file.  The result file is then used later for analysis by the machine learning algorithm of choice.

File Structures

The input file is of a machine-readable ASCII text type and is unstructured.  Example:


From: Amir Soofi

Sent: Thursday, December 06, 2012 2:37 AM

To: Aaron Macarthur; Hugo Cruz

Cc: Donald Krapohl

Subject: RE: Language Comparison




FYI, Rick Marshall unofficially approved a 3-day trip for one person from the Enterprise team down to Jacksonville, FL to assist in the catalog reinstall.


I’ll be placing it in the travel portal soon for the official process, so that the option becomes officially available to us.


I think together we’ll be able to push through the environment differences better in person than over the phone.


Let us know whether your site can even accommodate a visitor, and when you’d like to exercise this option.




Amir Soofi


Principal Software Engineer, Enterprise



The output of the openNLP Name Find algorithm map task on this input:

From: <namefind/person>Amir Soofi</namefind/person>

Sent: <namefind/date>Thursday, December 06, 2012 2:37 AM</namefind/date >

To: <namefind/person>Aaron Macarthur</namefind/person>; <namefind/person>Hugo Cruz</namefind/person>

Cc: <namefind/person>Donald Krapohl</namefind/person>

Subject: RE: Language Comparison




FYI, <namefind/person>Rick Marshall</namefind/person> unofficially approved a 3-day trip starting <namefind/date>14 November</namefind/date> for one person from the Enterprise team down to <namefind/location>Jacksonville, FL</namefind/location> to assist in the catalog reinstall.


I’ll be placing it in the travel portal soon for the official process, so that the option becomes officially available to us.


I think together we’ll be able to push through the environment differences better in person than over the phone.


Let us know whether your site can even accommodate a visitor, and when you’d like to exercise this option.




<namefind/person>Amir Soofi</namefind/person>


Principal Software Engineer, Enterprise


The output of an example reduce task on this output:

{DocumentUniqueID, EntityKey, EntityType}

{234cba3231, Amir Soofi, Person}

{234cba3231, Thursday, December 06, 2012 2:37 AM, Date}

{234cba3231, Aaron Macarthur, Person}

{234cba3231, Hugo Cruz, Person}

{234cba3231, Donald Krapohl, Person}

{234cba3231, Rick Marshall, Person}

{234cba3231, 14 November, Date}

{234cba3231, Jacksonville/,FL, Location}

{234cba3231, Amir Soofi, Person}


A second reduce pass might yield combinations for network analysis (link strength below being calculated on instances of co-existence across unique documents):

{EntityKey, LinkedEntity, LinkStrength}

{Amir Soofi, Donald Krapohl, 6}

{Amir Soofi, Aaron Macarthur, 15}

{Amir Soofi, Jacksonville/, FL, 1}


The data may then be consumed into the analysis tool of choice, such as RapidMiner, WEKA, PowerPivot, or SQL Server/SQL Server Analysis Services for further analysis.


openNLP on Hadoop can provides good metadata extraction for key information in unstructured data.  The information may be retrieved from competitor websites, SEC filings, Twitter activity, employee social network activity, or many other sources.  The data pre-processing and preparation steps in metadata extraction for competitive intelligence applications can be low relative to that of other analytical problems (contract semantic analysis, social analysis trending, etc.).  The steps outlined in this paper demonstrate a very high-level overview of a logical architecture and key execution activities required to gather metadata for Influence Analysis and Network Analysis for competitive advantage.

My Google+