Articles

A pragmatic guide to Key Drivers Analysis – How to have your cake and eat it

Author: Gary Bennett

One of the slightly confusing aspects of Key Drivers Analysis for researchers is the variety of alternative methodologies. The pros and cons of the approaches are often poorly communicated. The choice of method often comes down to the preferences of a particular Analyst and/or the Researcher’s familiarity with outputs.

The various approaches can broadly be classified in terms of the following characteristics:

1. A Regression Model or an Importance Index

Regression Models

Regression Models are models where many predictor variables are used to make predictions of one dependent variable (DV) of interest, such as likelihood to recommend. The family of model used depends on the type of scale being predicted but essentially we can use the model to make a forecast of the expected DV value using the predictors. This can be very useful for making predictions of the DV under various “what-if” scenarios.

The raw outputs of these models are not very useful for inferring the “relative importance” of each predictor to the model. The raw effect for each predictor is as much a function of its correlation with the other predictors as it is of its relationship with the DV. A simple importance index, known as the Pratt Importance Measure, can easily be obtained for a predictor using a simple formula which multiplies the predictors (standardised) effect size under the model with its correlation with the DV. Rebasing this shows the contribution of each predictor summing to 100%.

One of the advantages of Regression Models is that many algorithms allow you to select an optimal set of predictors from a larger candidate set, often using stepwise selection algorithms. This can be useful to focus the outputs on predictors making the strongest unique contributions, but can arbitrarily sometimes exclude good predictors which happen to correlate with other variables in the model, appearing to assign them an importance of zero.

Examples of this type of approach:

Linear / Logistic / Ordinal Regression (SPSS / SAS / R ) including Correlated Component Regression (CCR) and Partial Least Squares (PLS) variants

Advantages:

Flexible / adaptable to different scale types (not just linear)
Can perform variable selection
Can make predictions and derive a simple importance measure

Disadvantages:

Tend to become unstable with small samples and many correlated predictors, though this can be mitigated by using CCR framework
More complex to explain / summarise
Can sometimes get counter-intuitive “negative importance” index (Pratt)
Different implementations available (standard OLS regression, Ridge regression, PLS regression CCR regression) some more widely used than others
(with exception of CCR/PLS) Can over-fit the sample giving misleading assurance of good model

Importance Indices

Another broad label for this type of approach is “Average-over-ordering” approaches. These approaches use various algorithms to assess the contribution each predictor over and above the other variables being evaluated. The pure version of this method performs thousands of iterations, entering all possible predictors in every possible order (permutation). For each permutation, the “additional contribution to prediction” of the DV is assessed and then stored as an index. When all permutations of predictors have been run, the contribution of each predictor is then averaged over permutations to assess its average contribution. These simple “average-over-orderings” contributions, which can be interpreted as the unique contribution of each predictor to explaining the DV, are the only output. The idea is that predictors making a unique contribution to prediction will on average make a higher contribution regardless of whether it is or isn’t given precedence over other predictors.

The most well used of these methods is Shapley Value Analysis (sometimes known as General Dominance Analysis). Under this method, Linear Regression is performed at each iteration and the average change in R-squared stored and then averaged over iterations. The details of each model such as effect sizes, directions of effect etc. are discarded and only the average contribution, rebased to sum to 100% across all predictors, is reported. The main disadvantage of Shapley is that it becomes too computationally intensive to perform calculations with more than about 10 predictors, due to the number of permutations involved which increase at an exponential rate (10 predictors = 3,628,800, 11 predictors = 39,916,800, 12 predictors = 479,001,600 etc.)

Another variant of these methods is Kruskal Analysis, which uses iterations of partial correlations, rather than regression analysis at its core. The results are mostly indistinguishable from Shapley Value Analysis. We have modified Kruskal to allow estimation with a “large sample” of permutations, rather than ALL permutations, which results in very stable estimates if you pick a big enough number. This modification allows us to estimate contributions for as many as 50 or even 100 predictors.

These methods provide nice, simple, easy to use outputs, but have a number of drawback as specified below.

Advantages:

Simple outputs, easy to explain
Widely used and accepted
With modifications (sampling permutations) can estimate contribution for many predictors

Disadvantages:

Can only be used for linear models (not logistic or ordinal)
No underlying model, so no guidance on effect size or direction of effect and cannot build simulator
Can’t screen predictors from a candidate set. Analyst / Researcher decides what to include
Also prone to over-fitting for small samples
Takes a very long time to estimate for large samples / large numbers of predictors

2. To screen or not to screen?

Most of the algorithms based around regression analysis allow variable screening. Screening is often a sensible approach as many of the potential predictors may not contribute anything unique to the DV. Even if they appear to correlate with the DV, this may just be due to correlations with other predictors rather than any unique explanatory power.

Advantages of Screening

Provides subset of variables with the greatest unique predictive power
Eliminates noise / redundant items with little unique explanatory power
Provides simpler set of variables for client to act on
Small sets of screened variables tend to result in much more robust models which predict better to new cases

Disadvantages of Screening

Can often be only small differences between predictors selected and predictors left out
Gives impression that all other variables explain nothing which isn’t usually the case
Makes comparison across subgroups difficult (due to inconsistent variables selected)

In an ideal world the researcher should be able to review both screened and unscreened results and be able to make comparisons between the two. For large numbers of runs and subgroup comparisons, it is likely that running “unscreened only” for a set of possible predictors is more practical and economic.

3. To Cross-validate or Trust your Sample?

Cross-validation (CV) measures how variable a model’s estimates are likely to be under similar sized independent samples drawn from the population being studied. Done properly, cross-validation can provide guidance as to which type of model performs best and also on the optimal number of predictors to retain.

Very few Drivers Analysis methods use cross-validation. Most of the Regression-based methods instead rely on significance tests in the sample being used to build the model (F-tests, t-tests and associated p-values). This is fine provided the sample is very large compared to the number of potential predictors (a minimum of 10-20 cases per predictor is recommended) and there are no extremely high correlations between predictors. However, in many real key drivers’ applications, these assumptions fail to hold.

Even worse, in most of the Importance Index methods such as Shapley Value there are no sig-tests. The algorithms converge with no indication as to the reliability of their output. Naïve analysts might assume that as R-squared tends to 100% they are getting a good model. In fact in an inadequate sample as it tends to 100% it is probably getting worse (see The Curse of Overfitting). It is possible to obtain “bootstrap” (a form of CV) confidence intervals for these methods but these are rarely implemented due to lack of technical know-how and can be time consuming and expensive to run.

Only a handful of methodologies, such as Correlated Component Regression (CCR) and PLS Regression have Cross-validation built into their core algorithm.

Advantages of Cross-validation

Gives reassurance that the model specification used is robust
Gives the simplest / best fitting model
Eliminates in-sample noise and overfitting

Disadvantages

None

CCR-Johnsons: Bringing it all together – A Unified Approach

At the Stats People we have developed a new approach which fuses together the best of all worlds:

A regression model approach which delivers a Shapley-Value-like index, for as many predictors as we need, that works for extreme situations: Small samples, many highly correlated predictors.
Works within all common types of modelling framework: Logistic and ordinal, as well as linear models.
Can deliver, if needed, two sets of results: One with optimal variables screening, and one without.
Uses Cross-validation at core to select most stable model specification and optimal number of predictors if screening.

This uses a combination of our Correlated Component Regression (CCR) methodology and a new Importance Index being widely adopted called Johnson’s Relative Weights. This method allows an almost 100% approximation of Shapley-Value coefficients, but with many advantages over the Shapley-Value method:

Quicker to estimate for large data sets, so faster turnaround.
No limit on the number of predictors (Shapley becomes cumbersome after about 10)
Can be applied to Logistic and Ordinal as well as Linear Regression models
Unlike the Pratts Method (used on our original CCR output) doesn’t result in negative importance for Suppressor variables (where effect size is opposite sign to correlation).

The basic Johnson’s method suffers from the usual overfitting problems we mentioned for small sample and many highly correlated predictors. We have overcome this by integrating it with the CCR algorithm, giving much more stable results. Rather than forcing clients to choose between (a) screening predictors and (b) including all predictors we now offer an option of obtain the best model for both.

For more information email info@statspeople.com