The Structure of an OpenNLP NameFinder Model

Named Entity Models

Research labs and product teams intent on building upon openNLP and SOLR (which can consume an openNLP NameFinder model) frequently find it important to generate their own model parser or model builder classes.  openNLP has in-built capabilities for this but in the case of custom parsers the structure of the openNLP NameFinder model must be known.

The NameFinder model is defined by the GISModel class which extends AbstractModel and the definition and interfaces exposed can be found in the openNLP api docs on the Apache site.  The structure as below is composed of an indicator of Model type, a correction constant, model outcomes, and model predicates.  Models for NameFinder can be downloaded free from the openNLP project and are trained against generic corpora.

openNLP NameFinder Model Structure

  1. The type identifier, GIS (literal)
  2. The model correction constant (int)
  3. Model correction constant parameter (double)
  4. Outcomes
    1. The number of outcomes (int)
    2. The outcome names (string array, length of which is specified in 4.1. above)
  5. Predicates
    1. Outcome patterns
      1. The number of outcome patterns (int)
      2. The outcome pattern values (each stored in a space delimited string)
    2. The predicate labels
      1. The number of predicates (int)
      2. The predicate names (string array, length of which is specified in 5.2.1. above)
    3. Predicate parameters (double values)

Is Big Data Just Data?

Richard Marshall at Decision and Data Sciences recently posted a blog entry based in part on conversations during the NIST Big Data Taxonomy and Definitions group meeting.  He brings up a few good points, especially that volume isn’t a new thing in data and velocity is a niche problem.  As initially stated by the chair, Nancy Grady of SAIC, variety of data may be the true litmus test and the factor that changes the management and analysis lifecycles.  Read his full blog at http://ddsciences.com/why-big-data-is-different/

NIST Big Data Definitions and Taxonomy Generates Probing Questions

Many interesting questions came up in the NIST Definitions and Taxonomy Big Data group meeting today. Brilliant minds are hard at work to stabilize the language around Big Data, but some fundamental questions have been posed that the marketplace seems to believe we have already solved.

  • How do we differentiate Big Data from traditional big data like sensor feeds, credit card processing, and financial transactions. What makes it different? One noted professional taxonomist asserted that a basic differentiator may exist in the variability and variety of data.
  • Has data lifecycle changed with BD? The subgroup lead Nancy Grady made a compelling argument that the position of storage in the workstream may be of interest. She pointed out that traditional decision support transforms and stores the data prior to analysis, whereas the Big Data paradigm frequently stores data raw and applies structure later (schema on read).
  • Should there be an obsolescence characteristic attached to data definitions? Ubiquitous sensors (The Internet of Things) may present disposable data with immediate obsolescence which climate monitoring sensors only provide value at a future date.
  • Data cleanliness may be less important than traditional BI.
  • Are there certain enablers to Big Data that should be assumed in planning such as (perhaps) cloud computing?

At this point it is obvious there is no consensus on these questions, but what do we as a community of practitioners think about these questions?

 

The working group meetings are highly compelling and I encourage anyone who wishes to become involved to go to the group site, http://bigdatawg.nist.gov/