Category Archives: Technologies

Content on Knowledge Discovery in Databases (KDD), analytics, decision support, or data mining ranging from the user-approachable to the technically focused.

Big Data Technology Strategy: Is Hadoop Already Outdated?

 

Is Hadoop Already Outdated?

Logical architecture of the Hadoop stack
The Hadoop Ecosystem

An article posted to Information Age 18 February, 2013 Teradata CTO Stephen Brobst highlights the schism that has overtaken traditional Decision Support and the new-age Big Data camp, noting at a recent Stanford University very-large-database conference “The Hadoop guys were saying, ‘relational databases are dead, SQL programming is for dinosaurs, long live the new kings Hadoop and MapReduce.'”  (Swabey, 2013 ).  The inclusion of the Hadoop platform by name and the technology’s rapid ascendancy is striking in its proliferation progressing from initial release to core services in multinational platforms in less than six years (Hadoop Releases, 2013), yet it represents the lion’s share of the commercial Big Data marketplace.  Fanatical zeal aside, should it be the sole platform for knowledge management and creation?

Much is made of the dimensions by which we assign special treatment to “Big Data”.   These facets are known popularly as “The Three V’s”, which are defined by Gartner as “high-volume, high-velocity and high-variety information assets”.  Additional V’s are sometimes added to suit the audience as necessary including Veracity (What is big data?, 2013), Variability, and Value (Fan, 2013).  In the December 2013 issue of the ACM SIGKDD, Wei Fan and Albert Bifet explore the current and future state of Big Data.  They allude to signals that the technology adoption has overshot the technical ecosystem’s ability to give it proper perspective providing seven factors they consider to be controversial (Fan, 2012):

  •  There is no need to distinguish Big Data analytics from data analytics, as data will continue growing, and it will never be small again […]
  •  Big Data may be a hype to sell Hadoop based computing systems. Hadoop is not always the best tool […]
  • In real time analytics, data may be changing. In that case, what it is important is not the size of the data, it is its recency […]
  • Claims to accuracy are misleading […]
  • Bigger data are not always better data.  It depends if the data is noisy or not, and if it is representative of what we are looking for […]
  • …[Is it] ethical that people can be analyzed without knowing it […]
  • Limited access to Big Data creates new digital divides […]

Further supporting Fan and Bifet’s arguments, Stephen Brobst notes, “A lot of people are talking about the ‘velocity of big data’ but if that just means that data values are updating quickly, it’s nothing new.  What’s new is the velocity of change in the structure of data.” (Swabey, 2013).

Google (noticeably silent in the Big Data marketplace) abandoned the batch processing approach underlying Hadoop in favor of a real-time, service-based processing architecture originally called Dremel and outlined in a paper from Google research (Melnik, 3010).  Google’s BigQuery cloud service, used extensively at Google internally, takes a differing tack that “builds on ideas from web search and parallel DBMSs”—core competencies for the company.  In a January 2013 consortium organized by IBM and Arizona State University, Dr. K. Selcuk Candan (Candan, 2013) highlights six key outcomes which may be summarized as a need for better data fusion, data analysis algorithms, data models, scalable architectures, and real-time analysis.  While several vendors are visibly out front with custom Hadoop builds for real-time analysis, two non-Hadoop projects, S4 in the Apache Incubator and the production-ready Storm (http://storm-project.net/) show promise a general-purpose parallel computing engines.

While Apache Hadoop project has staged an impressive entrance, broken through the Relational and OLAP paradigms, and shown the viability of open source software, I intend to keep an eye on the companies that have avoided the hype such as Google (Regalado, 2013) and observe as the market polarizes into real-time analysis and those who never needed it.

 

References:

Candan, K. Selcuk. (2013, June 25). Hunting for the Value Gaps in Data Management, Services, and Analytics.  Retrieved from http://wp.sigmod.org/?p=904 .

Fan, Wei, and Albert Bifet, Mining Big Data: Current Status, and Forecast to the Future, December 2014, Vol. 4, Issue 2.  Downloaded from http://www.kdd.org/sites/default/files/issues/14-2-2012-12/V14-02-01-Fan.pdf

Gilyadov, Camuel. (2013, July 2). OpenDremel: Google BigQuery / Dremel implementation.  Retrieved from http://bigdatacraft.com/opendremel

Hadoop Releases. (2013, June 14). Retrieved from http://hadoop.apache.org/releases.html

Melnik, Sergey, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis (2010). “Dremel: Interactive Analysis of Web-Scale Datasets”. Proc. of the 36th International Conference on Very Large Data Bases (VLDB).

Regalado, Antonio. (2013, June 11).  Just Don’t Call it Big Data: Why Google fears the totalitarian connotations of the buzzword big data.  Retrieved from http://www.technologyreview.com/view/515941/just-dont-call-it-big-data/

Swabey, Pete. (2013, February 18).  Teradata seeks compromise in the big data Holy Wars.  Retrieved from http://www.information-age.com/technology/information-management/123456802/teradata-seeks-compromise-in-the-big-data-holy-wars-

What is big data? (n.d.). Retrieved from http://www-01.ibm.com/software/data/bigdata/

 

US Government Business Capture Data Mining in Microsoft Excel

View agency activity clustering on geography in Excel using Excel Data Mining Add-ins

By Don Krapohl

1.       Ensure you have downloaded the Excel Data Mining Add-ins from Microsoft at http://www.microsoft.com/en-us/download/details.aspx?id=35578 .  The article assumes you have a working version of the DM Addins and a default Analysis Services (SSAS) instance defined.  Search for getting started with SQL Server Data Mining Add-ins for Excel if you are not familiar with this process.

2.       Open the Excel sample file for Federal contract acquisitions in Wyoming (2012) from http://www.augmentedintel.com/content/datasets/government_contracts_data_mining_addins.xlsx

3.       On the wy_data_feed tab, select all the data.

4.       In the Home tab on the ribbon in the Styles section select “Format as Table”.  Pick any format you wish.

5.       A new tab will appear on the ribbon for Table Tools with menus for Analyze and Design as below.

Microsoft Excel Table Tools menu
Table tools to format data as a table in Excel

 

6.       On the Analyze menu, select “Detect Categories”.  This is will group (cluster) your information on common attributes, particular commonalities that are not obvious or immediately observable.

7.       Deselect all checkboxes except the following:

a.       Dollars Obligated

b.      Award Type

c.       Contract Pricing

d.      Funding Agency

e.      Product Or Service Code

f.        Category

8.       Click ‘run’

9.       The output will show you categories of information showing strong affinities.  Explore the model by filtering the charts and tables by the category/ies generated.  Do this by selecting the filter icon (funnel) next to Category on the table or the Category label at the lower left of the graph.

10.   Interesting information may be derived from the groups with fewer rows that may show particularly interesting correlations for a targeted campaign.  For example, filter the table and chart on Category 6.  This group indicates a group affinity for the attribute values ProductOrServiceCode = “REFRIGERATION AND AIR CONDITIONING COMPONENTS”, fundingAgency = “Veterans Affairs, Department Of”, and a contract award value of $61,148 to $1,173,695 as shown below:

 

Importance of data categories in Excel Data Mining Add-ins
Factor Analysis in Microsoft Excel

For my organization’s business development activities, if I am in the heating and air business I may elect to focus efforts on medium-sized contracts with Veterans Affairs.

My Google+

Automated Metadata Extraction for Competitive Intelligence

Artificial Intelligence for the Creation of Competitive Intelligence Tools

Introduction

Often in prioritizing business development activities it is helpful to determine who is able to influence a decision and how they are related to those in the market space.  To make a defensible and actionable strategy it is useful to perform Influence Analysis and Network Analysis, which can form the kernel of a competitive intelligence analysis strategy.  The data required for analysis must be obtained by identifying and extracting target attribute values in unstructured and often very large (multi-terabyte or petabyte) data stores.  This necessitates a scalable infrastructure, distributed parallel computing capability, and fit-for-use natural language processing algorithms.  Herein I will demonstrate a target logical architecture and methodology for accomplishing the task.  Influence and Network analysis by machine learning algorithm (naïve bayes or perceptron for example) will be covered in a later supporting article.

Recognizing Significance

Named-Entity Recognition is required for unstructured content extraction in this scenario.  This identification scheme may or may not employ stemming but will always require tokenizing, part-of-speech tagging, and the acquisition of a predefined model of attribute patterns to properly recognize and extract required metadata.  A powerful platform with these built-in capabilities is the Apache openNLP project, which includes typed attribute models for the name finder, an extensible name finder algorithm, an API that exposes a Lucene index consumer, and a scalable, distributed architecture.  The Apache Stanbol project in the incubator (http://stanbol.apache.org/) shows promise at semantic-based extraction and content enhancement but hasn’t been promoted outside the incubator yet.

Apache openNLP attribute recognition models are available in only a few languages with the original and largest being English.  The community publishes models in English for the Name Finder interface for dates, location, money, organization, percentage, person, and time (date).  Each is an appropriate candidate for term extraction for competitive intelligence analysis.

Logical Architecture

Natural Language Processing for Competitive Intelligence
openNLP in four node Hadoop cluster

The controlling requirement for the task of metadata extraction from massive datasources is the processing of massive datasets to extract information.  For this Hadoop provides a flexible, fault-tolerant framework and processing model that readily supports the natural language processing needs.  The logical architecture for a small (<1TB) 4-node clustered Hadoop solution is as follows:

 

Process Flow

As below, the process to execute is standardized on the map/reduce patterns Distributed Task Execution, Union, Selection, and Intersection.  Pre-processing using a Graph Processing pattern in a distinctly separate map phase would likely hasten any Influence Analysis to be performed post-process.

 

Operations Sequence Diagram of openNLP with Map Reduce on Hadoop for Competitive Intelligence
Multi-node Sequence Diagram for openNLP with Map Reduce on Hadoop

The primary namenode initiates work and passes the data and map/reduce execution program to the task trackers, who in turn distribute it among worker nodes.  The worker nodes execute the map on HDFS-stored data, provide health and status to the task tracker, who reports it to the primary namenode.  On node map completion the primary namenode may redistribute map work to the worker node or order the reduce task, each by way of the task tracker.  The reduce task selects data from the HDFS interim resultset, aggregates, and streams to a result file.  The result file is then used later for analysis by the machine learning algorithm of choice.

File Structures

The input file is of a machine-readable ASCII text type and is unstructured.  Example:

 

From: Amir Soofi

Sent: Thursday, December 06, 2012 2:37 AM

To: Aaron Macarthur; Hugo Cruz

Cc: Donald Krapohl

Subject: RE: Language Comparison

 

Hugo,

 

FYI, Rick Marshall unofficially approved a 3-day trip for one person from the Enterprise team down to Jacksonville, FL to assist in the catalog reinstall.

 

I’ll be placing it in the travel portal soon for the official process, so that the option becomes officially available to us.

 

I think together we’ll be able to push through the environment differences better in person than over the phone.

 

Let us know whether your site can even accommodate a visitor, and when you’d like to exercise this option.

 

Respectfully,

 

Amir Soofi

 

Principal Software Engineer, Enterprise

 

 

The output of the openNLP Name Find algorithm map task on this input:

From: <namefind/person>Amir Soofi</namefind/person>

Sent: <namefind/date>Thursday, December 06, 2012 2:37 AM</namefind/date >

To: <namefind/person>Aaron Macarthur</namefind/person>; <namefind/person>Hugo Cruz</namefind/person>

Cc: <namefind/person>Donald Krapohl</namefind/person>

Subject: RE: Language Comparison

 

Hugo,

 

FYI, <namefind/person>Rick Marshall</namefind/person> unofficially approved a 3-day trip starting <namefind/date>14 November</namefind/date> for one person from the Enterprise team down to <namefind/location>Jacksonville, FL</namefind/location> to assist in the catalog reinstall.

 

I’ll be placing it in the travel portal soon for the official process, so that the option becomes officially available to us.

 

I think together we’ll be able to push through the environment differences better in person than over the phone.

 

Let us know whether your site can even accommodate a visitor, and when you’d like to exercise this option.

 

Respectfully,

 

<namefind/person>Amir Soofi</namefind/person>

 

Principal Software Engineer, Enterprise

 

The output of an example reduce task on this output:

{DocumentUniqueID, EntityKey, EntityType}

{234cba3231, Amir Soofi, Person}

{234cba3231, Thursday, December 06, 2012 2:37 AM, Date}

{234cba3231, Aaron Macarthur, Person}

{234cba3231, Hugo Cruz, Person}

{234cba3231, Donald Krapohl, Person}

{234cba3231, Rick Marshall, Person}

{234cba3231, 14 November, Date}

{234cba3231, Jacksonville/,FL, Location}

{234cba3231, Amir Soofi, Person}

 

A second reduce pass might yield combinations for network analysis (link strength below being calculated on instances of co-existence across unique documents):

{EntityKey, LinkedEntity, LinkStrength}

{Amir Soofi, Donald Krapohl, 6}

{Amir Soofi, Aaron Macarthur, 15}

{Amir Soofi, Jacksonville/, FL, 1}

 

The data may then be consumed into the analysis tool of choice, such as RapidMiner, WEKA, PowerPivot, or SQL Server/SQL Server Analysis Services for further analysis.

Conclusion

openNLP on Hadoop can provides good metadata extraction for key information in unstructured data.  The information may be retrieved from competitor websites, SEC filings, Twitter activity, employee social network activity, or many other sources.  The data pre-processing and preparation steps in metadata extraction for competitive intelligence applications can be low relative to that of other analytical problems (contract semantic analysis, social analysis trending, etc.).  The steps outlined in this paper demonstrate a very high-level overview of a logical architecture and key execution activities required to gather metadata for Influence Analysis and Network Analysis for competitive advantage.

My Google+