Mar 21

Approachable Data Mining Tutorials for the Non Data Miner

A list of several sources to learn data science in a hands-on format

https://www.coursera.org/course/ml – The most approachable machine learning course available. And it’s free.

https://www.kaggle.com/wiki/Tutorials - Provides data sources, forums, scenarios, and real-world competitions to teach data mining

http://deeplearning.net/tutorial/ - Tutorial on Deep Learning – introduction to machine learning image analysis algorithms

http://tryr.codeschool.com/ - Interactive introduction to R Language

Jan 01

A Structured Methodology for Group Decision Making

Refreshed from my June 2013 post.

Abstract:

A simple but powerful method for structured, transparent decision making within a group is demonstrated using a supplied template and the supporting process.  The approach employs a weighted decision matrix with authoritative attributes, leads to an individual decision outcome, and composes a group decision using basic statistical methods.  The goal is a simple-to-convey means by which to discover factors, results, agreement, and influencers in group decision making.

Download the group decision making template (filled with example data) in:

Excel 2010 format (*.xslx)

OpenData format (*.odf)


Decision Making

A common dilemma in organizations is the need for rapid decision making on complex subjects.  The default unstructured open-dialog approach appears to respect the judgment of the individual while actually putting the possibility of a quick, fit, and defensible solution beyond reach.  Further complicating factors such as decision makers’ distribution across multiple locations, differences in controlling standards among group members, and competing organizational interests add additional noise to an unstructured process.  Focused analysis of a 10-factor decision with 3 options for example requires each individual to evaluate in real-time, 30 combinations while debating on- and off-topic aspects.

The use of a weighted decision matrix and rudimentary analysis provide a simple tool set for rapid group decision making on complex subjects.  By including strategic-level experts in each area affected, documenting the aspects that impact the decision, and approaching the problem methodically will reduce the time to complete the exercise.  The steps to complete the process also enable instant identification of differences and commonality of opinion for targeted debate.  Use of this method is particularly prescribed when analyzing multiple options and their measures of value across decision spaces such as (not exhaustive):

  • finance ,  emphasizing  ROI, NPV, FV, and Opportunity Cost
  • engineering, focused on technical obstacles, multi-year plans, legacy technology issues, cost of skilled resources, and operational complexity
  • sales, stressing time to market, product features, product relevance, and “seize-the-moment” opportunity capture support
  • business capture decision making and negotiations
  • and portfolio strategist, concerned with SWOT analysis, market trends, and integration into the enterprise ecosystem

Weighted Decision Matrix Concept

A tool commonly used to achieve a quantifiable (but still subjective) answer to any type of question from “What university should I attend?” to “Should I invest in a business intelligence services department for my multi-national company?” is the weighted decision matrix.  Only a few elements are required for analysis:

Factor Weight Option 1 Score on Factor

Option n Score on Factor
Factor 1
Factor 2

Factor n

 

The required elements to complete the matrix are the list of factors involved in the decision, the weights to apply to each factor, a column for each option under study, and a number representing how well the option positively supports each factor.

Definitions

Factor – An attribute that supports a positive outcome in the final product.  Examples might be “Uses only open-source software” for a software platform acquisition, “Close to home” for selection of a college, “Regular work hours” for a new job.

Fitness – a simple measure of weight times score indicating the relative strength of an option when measured against a factor.

Weight – The absolute importance of each factor individually.  It is not a ranking or other direct competition between factors.

Option – A potential outcome in the decision.  Options for selecting a university might be “Middle Tennessee State University, University of Tennessee, and University of Virginia”.  Options for a statistical analysis desktop application might be “WEKA, Excel, RapidMiner, and Statistica”.

Option score – The measure of how well the option satisfies the factor.  A factor “Software is open source” might score 0 for Microsoft Excel but 7 for RapidMiner.  The score is determined by the analysis and expert opinion of each decision maker.

The Decision Making Process

It is important to follow the process as closely as possible as it reduces the possibility of biasing the final decision and optimizes the ability to make quick decisions.

Phase I – Group Factor Identification

  1. Identify a single facilitator.  This individual is responsible to keep the conversations on-point, attempt to negotiate the phrasing of the factors when disagreements arise, distribute the final factor list, collect everyone’s results, and publish the final results.
  2. Call all required individuals together at one time to define the factors in the decision.  Consider no other aspect of the decision at this point and do not pre-suppose any options.
  3. Make a list of the factors involved in the decision.  Phrase the factors in a way that supports a positive result and is definitive.  “Must add value” for example is not definitive and would not be a good factor as it is too relative.  Likewise, “Does not integrate with our current technology” is negative and will undermine the measurements.
  4. Define the options or courses of action that could be selected.  The options should be well-understood and not vague.  If selecting a school for example a good option might be “Middle Tennessee State University” instead of “A college in Tennessee” as it will not be realistic to rate a broad and non-specific option.
  5. Write the finalized options and factors on the decision matrix template and create one matrix per rater.  The remainder will be performed individually.
  6. Cover or hide the Options columns.  This is done so the process is not skewed by previewing the results.
  7. Rate the factors beginning with the first factor.  Rate the factor’s importance from zero (low) to ten (high).  Zero indicates the factor is absolutely not important in your individual estimation, ten indicates this factor is absolutely vital.  Any whole number between zero and ten is valid.
  8. Hide the weight column and uncover the score column for the first option only.
  9. Score the option against the factor, zero (low) to ten (high).  If the option is a perfect fit for the factor it may score a ten, if the factor provides no support at all for the factor it may score zero.  Any whole number between zero and ten is valid.
  10. Show all the columns and view the results.  A sample result appears in Figure 1.

Phase II – Individual Scoring

  1. Cover or hide the Options columns.  This is done so the process is not skewed by previewing the results.
  2. Rate the factors beginning with the first factor.  Rate the factor’s importance from zero (low) to ten (high).  Zero indicates the factor is absolutely not important in your individual estimation, ten indicates this factor is absolutely vital.  Any whole number between zero and ten is valid.
  3. Hide the weight column and uncover the score column for the first option only.
  4. Score the option against the factor, zero ( low) to ten (high).  If the option is a perfect fit for the factor it may score a ten, if the factor provides no support at all for the factor it may score zero.  Any whole number between zero and ten is valid.
  5. Show all the columns and view the results.  A sample result appears in Figure 1.
Individual Weighted Decision Matrix Example for Group Decision making

Figure 1. Weighted Decision Matrix-Individual

Phase III – Facilitator Compiles Results

Copy each individual matrix and paste it into a new tab in the template.  Adjust the formulas in the Results tab to reflect the locations and numbers of the factors and results in each matrix.  A sample Result appears in Figure 2.

 

Group Decision Making Results

Figure 2. Group Decision Results

 

Interpretation

The decision making result measures shown in the template are not a complete list of those that could be created.  Further analysis could be performed to determine if one idea such as geographic location or known-concept bias is skewing the results (through language processing and/or a dependency parser if data mining resources are available).  Most of the agreement/disagreement measures rely on standard deviation to show how broadly the data are distributed as well as the data skew to demonstrate if there were individual strong opinions that affected the outcome significantly.

The Final Result is a Compromise

The consensus result is that with the highest final score.  In the example above option 2 received the highest final score at 632.  This value represents how the group (composed of individual assessments) determined the option fit the requirements (the factors).

Relative Strength of Disagreement

This metric uses the standard deviation of the population (STDEV.P in Excel) to determine how widely distributed the individual scores are.  The notion is that with perfect agreement the standard deviation is zero and the more widely the scores vary the larger this figure and the bar in the Excel cell will be.

Disagreement Heat Map

This section displays the degree of contention within the results and uses the values from the Relative Strength of Disagreement section.  The higher the intensity of the red coloration, the greater the degree of disagreement on that element.  The color intensity and range is configurable in the Excel template through the Conditional Formatting tab on the Home ribbon.

Points of Contention

Points of contention show only the few points that are the least agreed upon.

Agreement Heat Map

Agreement in this instance is denoted by using the inverse of the disagreement calculation (STDEV.P), or 1/(1+STDEV.P).  The agreement heat map shows the points on which the individual scores most agree and can be set aside in negotiation.

Optimistic/Pessimistic Disagreement

The degree of optimism or pessimism in this case is based on the skew (non-parametric skew to be accurate) of the data.  That is, if the mean (average) is higher than the median the data is skewed RIGHT and a few individual, strong negative opinions weighed heavily on the outcome.  Likewise, if the skew was LEFT there existed strong positive sentiment that had a disproportionate influence on the outcome.

Optimistic/Pessimistic support of the final score

As with the individual item optimism or pessimism, negative opinion dragging on the consensus is indicated by the down arrow, positive support for an opinion overinflating the result for that option shows an up arrow.

Conclusion

Decision making by a group may be reached quickly and transparently through a structured process of analysis.  Individual weighted decision matrices, coalesced and analyzed, with a simple process can quantify group assessment of approaches to a problem as well as a means by which to discover and de-conflict individual interests and to demonstrate when individually-held strong opinions are influencing a decision.  Further advancement of this technique through workflow automation to gather inputs, master factor lists for global factor analysis, decision trending, and term extraction for content analysis would add additional dimensions for the broader Enterprise and provide data for a supervised model of the successful or failure rates of decision outcomes.

Keywords: decision making, group decision making, group consensus building, structured decision making

Jul 20

The Structure of an OpenNLP NameFinder Model

Named Entity Models

Research labs and product teams intent on building upon openNLP and SOLR (which can consume an openNLP NameFinder model) frequently find it important to generate their own model parser or model builder classes.  openNLP has in-built capabilities for this but in the case of custom parsers the structure of the openNLP NameFinder model must be known.

The NameFinder model is defined by the GISModel class which extends AbstractModel and the definition and interfaces exposed can be found in the openNLP api docs on the Apache site.  The structure as below is composed of an indicator of Model type, a correction constant, model outcomes, and model predicates.  Models for NameFinder can be downloaded free from the openNLP project and are trained against generic corpora.

openNLP NameFinder Model Structure

  1. The type identifier, GIS (literal)
  2. The model correction constant (int)
  3. Model correction constant parameter (double)
  4. Outcomes
    1. The number of outcomes (int)
    2. The outcome names (string array, length of which is specified in 4.1. above)
  5. Predicates
    1. Outcome patterns
      1. The number of outcome patterns (int)
      2. The outcome pattern values (each stored in a space delimited string)
    2. The predicate labels
      1. The number of predicates (int)
      2. The predicate names (string array, length of which is specified in 5.2.1. above)
    3. Predicate parameters (double values)