Automatic entity extraction / term extraction for competitive analysis and social media

Automatic Entity Extraction using openNLP in C#

By Don Krapohl

Introduction

I have been approached by multiple companies wishing to perform entity extraction for competitive intelligence. Simply put, executives want to know what their competition is up to, they want to expand their company, or they are just performing market research for a proposal. The targets are typically newspaper stories, SEC filing, blogs, social media, and other unstructured content. Another goal is frequently to create intellectual property by way of branded product. Frequently these are Microsoft .net-driven organizations. These are characterized by robust enterprise licensing with Microsoft, a mature product ecosystem, and large sunk cost in existing systems making a .net platform more amenable to their resource base and portfolio.

Making possible a quick-hit entity extractor in this environment are the opensource projects openNLP (open Natural Language Processing) and IKVM, a free java virtual machine that runs .net assemblies. openNLP provides entity extraction through pre-trained models for extraction of several common entity types: person, organization, date, time, location, percentage, and money. openNLP also provides for training and refinement of user-created models.

This article won't undertake to answer the questions of requirements gathering, fitness measurement, statistical analysis, model internals, platform architecture, operational support, or release management, but these are factors which should be considered prior to development for a production application.

Preparation

This article assumes the user has .net development skill and knowledge of the fundamentals of natural language processing. Download the latest version of openNLP from The Apache Foundation website and extract it to a directory of your choice. You will also need to download models for tokenization, sentence detection, and the entity model of your choice (person, date, etc.). Likewise, download the latest version of IKVM from SourceForge and extract it to a directory of your choice.

Create the openNLP dll

Open a command prompt and navigate to the ikvmbin-(yourProductVersion)/bin directory and build the openNLP dll with the command (change the versions to match yours):
ikvmc -target:library -assembly:openNLP opennlp-maxent-3.0.2-incubating.jar jwnl-1.3.3.jar opennlp-tools-1.5.2-incubating.jar

Create your .net Project

Create a project of your choice at a known location. Add a project reference to:

IKVM.OpenJDK.Core.dll
IKVM.OpenJDK.Jdbc.dll
IKVM.OpenJDK.Text.dll
IKVM.OpenJDK.Util.dll
IKVM.OpenJDK.XML.API.dll
IKVM.Runtime.dll
openNLP.dll

Create your Class

Copy the code below and paste it into a blank C# class file. Change the path to the models to match where you downloaded them. Compile your application and call the EntityExtractor.ExtractEntities with the content to be searched and the entity type to extract.


					using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
 
 
namespace NaturalLanguageProcessingCSharp
{
    
    public class EntityExtractor
    {
        /// <summary>
        /// Extractor for the entity types available in openNLP.
        /// TODO:
        ///     try/catch/exception handling
        ///     filestream closure
        ///     model training if desired
        ///     Regex or dictionary entity extraction
        ///     clean up the setting of the Name Finder model path
        /// </summary>
        /// Call syntax:  myList = ExtractEntities(myInText, EntityType.Person);
 
 
        private string sentenceModelPath = "c:\\models\\en-sent.bin";   //path to the model for sentence detection
        private string nameFinderModelPath;                              //NameFinder model path for English names
        private string tokenModelPath = "c:\\models\\en-token.bin";     //model path for English tokens
        public enum EntityType
        {
            Date = 0,
            Location,
            Money,
            Organization,
            Person,
            Time
        }
 
        public List<string> ExtractEntities(string inputData, EntityType targetType)
        {
            /*required steps to detect names are:
             * downloaded sentence, token, and name models from http://opennlp.sourceforge.net/models-1.5/
             * 1. Parse the input into sentences
             * 2. Parse the sentences into tokens
             * 3. Find the entity in the tokens
 
            */
 
            //------------------Preparation -- Set Name Finder model path based upon entity type-----------------
            switch (targetType)
            {
                case EntityType.Date:
                    nameFinderModelPath = "c:\\models\\en-ner-date.bin";
                    break;
                case EntityType.Location:
                    nameFinderModelPath = "c:\\models\\en-ner-location.bin";
                    break;
                case EntityType.Money:
                    nameFinderModelPath = "c:\\models\\en-ner-money.bin";
                    break;
                case EntityType.Organization:
                    nameFinderModelPath = "c:\\models\\en-ner-organization.bin";
                    break;
                case EntityType.Person:
                    nameFinderModelPath = "c:\\models\\en-ner-person.bin";
                    break;
                case EntityType.Time:
                    nameFinderModelPath = "c:\\models\\en-ner-time.bin";
                    break;
                default:
                    break;
            }
 
            //----------------- Preparation -- load models into objects-----------------
            //initialize the sentence detector
            opennlp.tools.sentdetect.SentenceDetectorME sentenceParser = prepareSentenceDetector();
 
            //initialize person names model
            opennlp.tools.namefind.NameFinderME nameFinder =  prepareNameFinder();
 
            //initialize the tokenizer--used to break our sentences into words (tokens)
            opennlp.tools.tokenize.TokenizerME tokenizer = prepareTokenizer();
 
            //------------------  Make sentences, then tokens, then get names--------------------------------
 
            String[] sentences = sentenceParser.sentDetect(inputData) ; //detect the sentences and load into sentence array of strings
            List<string> results = new List<string>();
 
            foreach (string sentence in sentences)
            {
                //now tokenize the input.
                //"Don Krapohl enjoys warm sunny weather" would tokenize as
                //"Don", "Krapohl", "enjoys", "warm", "sunny", "weather"
                string[] tokens = tokenizer.tokenize(sentence);
 
                //do the find
                opennlp.tools.util.Span[] foundNames = nameFinder.find(tokens);
 
                //important:  clear adaptive data in the feature generators or the detection rate will decrease over time.
                nameFinder.clearAdaptiveData();
 
                results.AddRange( opennlp.tools.util.Span.spansToStrings(foundNames, tokens).AsEnumerable());
            }
 
            return results;
        }
 
#region private methods
        private opennlp.tools.tokenize.TokenizerME prepareTokenizer()
        {
            java.io.FileInputStream tokenInputStream = new java.io.FileInputStream(tokenModelPath);     //load the token model into a stream
            opennlp.tools.tokenize.TokenizerModel tokenModel = new opennlp.tools.tokenize.TokenizerModel(tokenInputStream); //load the token model
            return new opennlp.tools.tokenize.TokenizerME(tokenModel);  //create the tokenizer
        }
        private opennlp.tools.sentdetect.SentenceDetectorME prepareSentenceDetector()
        {
            java.io.FileInputStream sentModelStream = new java.io.FileInputStream(sentenceModelPath);       //load the sentence model into a stream
            opennlp.tools.sentdetect.SentenceModel sentModel = new opennlp.tools.sentdetect.SentenceModel(sentModelStream);// load the model
            return new opennlp.tools.sentdetect.SentenceDetectorME(sentModel); //create sentence detector
        }
        private opennlp.tools.namefind.NameFinderME prepareNameFinder()
        {
            java.io.FileInputStream modelInputStream = new java.io.FileInputStream(nameFinderModelPath); //load the name model into a stream
            opennlp.tools.namefind.TokenNameFinderModel model = new opennlp.tools.namefind.TokenNameFinderModel(modelInputStream); //load the model
            return new opennlp.tools.namefind.NameFinderME(model);                   //create the namefinder
        }
#endregion 
    }
}

Hadoop on Azure and HDInsight integration

4/2/2013 - HDInsight doesn't seem to support openNLP or any other natural language processing algorithm. It does integrate well with SQL Server Analysis Services and the rest of the Microsoft business intelligence stack, which do provide excellent views within and across data islands. I hope to see NLP on HDInsight in the near future for algorithms stronger than the LSA/LSI (latent semantic analysis/latent semantic indexing--semantic query) in SQL Server 2012.

My Google+

Augmented|Intelligence