Category Archives: Technologies

Content on Knowledge Discovery in Databases (KDD), analytics, decision support, or data mining ranging from the user-approachable to the technically focused.

Predicting Federal Contracts using Machine Learning Classification in WEKA

Can I predict which contracts will likely be awarded in my area?

By Don Krapohl

  1. Open WEKA explorer
  2. On pre-process tab find the government_contracts.arff file.
  3. Perform pre-processing
    1. Escape non-enclosure single- and double-quotes (’, ”) if using a delimited text version.
    2. Check ‘UniqueTransactionID’ and click ‘Remove’.  Stating the obvious, there is no value in analysis of a continuous random transaction ID, discretization and local smoothing  can lead to overfitting, and it has no predictive value.
    3. If you have saved the arff back into a csv you will have to filter the ZIP code fields RecipientZipCode and PlaceOfPerformanceZipCode back to nominal with the unsupervised attribute filter StringToNominal and DollarsObligated to numeric.
  4. Using the attribute evaluator to explore algorithm merit on the ‘Select Attributes’ tab, use the ClassifierSubsetEval  evaluator with the Naïve Bayes algorithm and a RandomSearch search predicting the Product or Service Code (PSC).  This yields:

Selected attributes: 2,3,4,6 : 4






This indicates the best prediction of a Product or Service Code using the Naïve Bayes algorithm is a 40% (0.407 subset merit) predictive ability if you know these contract attributes.

  1. Using those attributes to predict PSC, select the Classify tab, bayes classifier -> Naïve Bayes, 10-fold cross validation, predict PSC and click ‘Start’.  The output will indicate F-measure and other attribute significance by class.  An example of a single class result is:

TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class

0               0.014      0                  0            0                         0.972        REFRIGERATION AND AIR CONDITIONING COMPONENTS

  1. View the threshold for the prediction by right-clicking the result buffer entry at the left, hover over Threshold Curve.  Select the “REFRIGERATION AND AIR CONDITIONING COMPONENTS” for example.  The curve is as follows:


Classifier accuracy
Classifier accuracy


This shows a 97% predictive accuracy on this class.  The F-Measure visualization further supports this:


Lift chart showing classifier coverage
Lift chart showing classifier coverage

To see an analogous cluster visualization using Excel and the SQL Server 2008 R2 addins, see my quick article on Activity Clustering on Geography.

My Google+

Automatic Entity Extraction using openNLP in C#

Entity Extraction and Competitive Intelligence

I have been approached by multiple companies wishing to perform entity extraction for competitive intelligence. Simply put, executives want to know what their competition is up to, they want to expand their company, or they are just performing market research for a proposal. The targets are typically newspaper stories, SEC filing, blogs, social media, and other unstructured content. Another goal is frequently to create intellectual property by way of branded product. Frequently these are Microsoft .net-driven organizations. These are characterized by robust enterprise licensing with Microsoft, a mature product ecosystem, and large sunk cost in existing systems making a .net platform more amenable to their resource base and portfolio.

Making possible a quick-hit entity extractor in this environment are the opensource projects openNLP (open Natural Language Processing) and IKVM, a free java virtual machine that runs .net assemblies. openNLP provides entity extraction through pre-trained models for extraction of several common entity types: person, organization, date, time, location, percentage, and money. openNLP also provides for training and refinement of user-created models.

This article won’t undertake to answer the questions of requirements gathering, fitness measurement, statistical analysis, model internals, platform architecture, operational support, or release management, but these are factors which should be considered prior to development for a production application.


This article assumes the user has .net development skill and knowledge of the fundamentals of natural language processing. Download the latest version of openNLP from The Apache Foundation website and extract it to a directory of your choice. You will also need to download models for tokenization, sentence detection, and the entity model of your choice (person, date, etc.). Likewise, download the latest version of IKVM from SourceForge and extract it to a directory of your choice.

Create the openNLP dll

Open a command prompt and navigate to the ikvmbin-(yourProductVersion)/bin directory and build the openNLP dll with the command (change the versions to match yours):
ikvmc -target:library -assembly:openNLP opennlp-maxent-3.0.2-incubating.jar jwnl-1.3.3.jar opennlp-tools-1.5.2-incubating.jar

Create your .net Project

Create a project of your choice at a known location. Add a project reference to:

Create your Class

Copy the code below and paste it into a blank C# class file. Change the path to the models to match where you downloaded them. Compile your application and call the EntityExtractor.ExtractEntities with the content to be searched and the entity extraction type.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace NaturalLanguageProcessingCSharp

public class EntityExtractor
/// Entity Extraction for the entity types available in openNLP.
/// TODO:
/// try/catch/exception handling
/// filestream closure
/// model training if desired
/// Regex or dictionary entity extraction
/// clean up the setting of the Name Finder model path
/// Implement entity extraction in other languages
/// Implement entity extraction for other entity types

/// Call syntax: myList = ExtractEntities(myInText, EntityType.Person);

private string sentenceModelPath = “c:\\models\\en-sent.bin”; //path to the model for sentence detection
private string nameFinderModelPath; //NameFinder model path for English names
private string tokenModelPath = “c:\\models\\en-token.bin”; //model path for English tokens
public enum EntityType
Date = 0,

public List ExtractEntities(string inputData, EntityType targetType)
/*required steps to detect names are:
* downloaded sentence, token, and name models from
* 1. Parse the input into sentences
* 2. Parse the sentences into tokens
* 3. Find the entity in the tokens


//——————Preparation — Set Name Finder model path based upon entity type—————–
switch (targetType)
case EntityType.Date:
nameFinderModelPath = “c:\\models\\en-ner-date.bin”;
case EntityType.Location:
nameFinderModelPath = “c:\\models\\en-ner-location.bin”;
case EntityType.Money:
nameFinderModelPath = “c:\\models\\en-ner-money.bin”;
case EntityType.Organization:
nameFinderModelPath = “c:\\models\\en-ner-organization.bin”;
case EntityType.Person:
nameFinderModelPath = “c:\\models\\en-ner-person.bin”;
case EntityType.Time:
nameFinderModelPath = “c:\\models\\en-ner-time.bin”;

//—————– Preparation — load models into objects—————–
//initialize the sentence detector sentenceParser = prepareSentenceDetector();

//initialize person names model nameFinder = prepareNameFinder();

//initialize the tokenizer–used to break our sentences into words (tokens) tokenizer = prepareTokenizer();

//—————— Make sentences, then tokens, then get names——————————–

String[] sentences = sentenceParser.sentDetect(inputData) ; //detect the sentences and load into sentence array of strings
List results = new List();

foreach (string sentence in sentences)
//now tokenize the input.
//”Don Krapohl enjoys warm sunny weather” would tokenize as
//”Don”, “Krapohl”, “enjoys”, “warm”, “sunny”, “weather”
string[] tokens = tokenizer.tokenize(sentence);

//do the find[] foundNames = nameFinder.find(tokens);

//important: clear adaptive data in the feature generators or the detection rate will decrease over time.

results.AddRange(, tokens).AsEnumerable());

return results;

#region private methods
private prepareTokenizer()
{ tokenInputStream = new; //load the token model into a stream tokenModel = new; //load the token model
return new; //create the tokenizer
private prepareSentenceDetector()
{ sentModelStream = new; //load the sentence model into a stream sentModel = new;// load the model
return new; //create sentence detector
private prepareNameFinder()
{ modelInputStream = new; //load the name model into a stream model = new; //load the model
return new; //create the namefinder

Hadoop on Azure and HDInsight integration
4/2/2013 – HDInsight doesn’t seem to support openNLP or any other natural language processing algorithm. It does integrate well with SQL Server Analysis Services and the rest of the Microsoft business intelligence stack, which do provide excellent views within and across data islands. I hope to see NLP on HDInsight in the near future for algorithms stronger than the LSA/LSI (latent semantic analysis/latent semantic indexing–semantic query) in SQL Server 2012.

My Google+