Automatic Entity Extraction using openNLP in C#

Entity Extraction and Competitive Intelligence

I have been approached by multiple companies wishing to perform entity extraction for competitive intelligence. Simply put, executives want to know what their competition is up to, they want to expand their company, or they are just performing market research for a proposal. The targets are typically newspaper stories, SEC filing, blogs, social media, and other unstructured content. Another goal is frequently to create intellectual property by way of branded product. Frequently these are Microsoft .net-driven organizations. These are characterized by robust enterprise licensing with Microsoft, a mature product ecosystem, and large sunk cost in existing systems making a .net platform more amenable to their resource base and portfolio.

Making possible a quick-hit entity extractor in this environment are the opensource projects openNLP (open Natural Language Processing) and IKVM, a free java virtual machine that runs .net assemblies. openNLP provides entity extraction through pre-trained models for extraction of several common entity types: person, organization, date, time, location, percentage, and money. openNLP also provides for training and refinement of user-created models.

This article won’t undertake to answer the questions of requirements gathering, fitness measurement, statistical analysis, model internals, platform architecture, operational support, or release management, but these are factors which should be considered prior to development for a production application.


This article assumes the user has .net development skill and knowledge of the fundamentals of natural language processing. Download the latest version of openNLP from The Apache Foundation website and extract it to a directory of your choice. You will also need to download models for tokenization, sentence detection, and the entity model of your choice (person, date, etc.). Likewise, download the latest version of IKVM from SourceForge and extract it to a directory of your choice.

Create the openNLP dll

Open a command prompt and navigate to the ikvmbin-(yourProductVersion)/bin directory and build the openNLP dll with the command (change the versions to match yours):
ikvmc -target:library -assembly:openNLP opennlp-maxent-3.0.2-incubating.jar jwnl-1.3.3.jar opennlp-tools-1.5.2-incubating.jar

Create your .net Project

Create a project of your choice at a known location. Add a project reference to:

Create your Class

Copy the code below and paste it into a blank C# class file. Change the path to the models to match where you downloaded them. Compile your application and call the EntityExtractor.ExtractEntities with the content to be searched and the entity extraction type.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace NaturalLanguageProcessingCSharp

public class EntityExtractor
/// Entity Extraction for the entity types available in openNLP.
/// TODO:
/// try/catch/exception handling
/// filestream closure
/// model training if desired
/// Regex or dictionary entity extraction
/// clean up the setting of the Name Finder model path
/// Implement entity extraction in other languages
/// Implement entity extraction for other entity types

/// Call syntax: myList = ExtractEntities(myInText, EntityType.Person);

private string sentenceModelPath = “c:\\models\\en-sent.bin”; //path to the model for sentence detection
private string nameFinderModelPath; //NameFinder model path for English names
private string tokenModelPath = “c:\\models\\en-token.bin”; //model path for English tokens
public enum EntityType
Date = 0,

public List ExtractEntities(string inputData, EntityType targetType)
/*required steps to detect names are:
* downloaded sentence, token, and name models from
* 1. Parse the input into sentences
* 2. Parse the sentences into tokens
* 3. Find the entity in the tokens


//——————Preparation — Set Name Finder model path based upon entity type—————–
switch (targetType)
case EntityType.Date:
nameFinderModelPath = “c:\\models\\en-ner-date.bin”;
case EntityType.Location:
nameFinderModelPath = “c:\\models\\en-ner-location.bin”;
case EntityType.Money:
nameFinderModelPath = “c:\\models\\en-ner-money.bin”;
case EntityType.Organization:
nameFinderModelPath = “c:\\models\\en-ner-organization.bin”;
case EntityType.Person:
nameFinderModelPath = “c:\\models\\en-ner-person.bin”;
case EntityType.Time:
nameFinderModelPath = “c:\\models\\en-ner-time.bin”;

//—————– Preparation — load models into objects—————–
//initialize the sentence detector sentenceParser = prepareSentenceDetector();

//initialize person names model nameFinder = prepareNameFinder();

//initialize the tokenizer–used to break our sentences into words (tokens) tokenizer = prepareTokenizer();

//—————— Make sentences, then tokens, then get names——————————–

String[] sentences = sentenceParser.sentDetect(inputData) ; //detect the sentences and load into sentence array of strings
List results = new List();

foreach (string sentence in sentences)
//now tokenize the input.
//”Don Krapohl enjoys warm sunny weather” would tokenize as
//”Don”, “Krapohl”, “enjoys”, “warm”, “sunny”, “weather”
string[] tokens = tokenizer.tokenize(sentence);

//do the find[] foundNames = nameFinder.find(tokens);

//important: clear adaptive data in the feature generators or the detection rate will decrease over time.

results.AddRange(, tokens).AsEnumerable());

return results;

#region private methods
private prepareTokenizer()
{ tokenInputStream = new; //load the token model into a stream tokenModel = new; //load the token model
return new; //create the tokenizer
private prepareSentenceDetector()
{ sentModelStream = new; //load the sentence model into a stream sentModel = new;// load the model
return new; //create sentence detector
private prepareNameFinder()
{ modelInputStream = new; //load the name model into a stream model = new; //load the model
return new; //create the namefinder

Hadoop on Azure and HDInsight integration
4/2/2013 – HDInsight doesn’t seem to support openNLP or any other natural language processing algorithm. It does integrate well with SQL Server Analysis Services and the rest of the Microsoft business intelligence stack, which do provide excellent views within and across data islands. I hope to see NLP on HDInsight in the near future for algorithms stronger than the LSA/LSI (latent semantic analysis/latent semantic indexing–semantic query) in SQL Server 2012.

My Google+