Skip to content

Correlation Computing in Shifu

Zhang Pengshan (David) edited this page Apr 13, 2017 · 6 revisions

Starting from Shifu 0.10.0, Shifu supports to computing correlation values in all feature pairs and from 0.11.0, correlation MapReduce job is optimized to be 2 hours for 3400 columns correlation computing.

How to Enable Correlation Computing in Shifu

  shifu stats -correlation(c)

Please run correlation after 'shifu stats' running or you will get an alert.

Where is Correlation Output and Which Format?

Correlation values are dumped in correlation.csv in current working dir. The format is:

  column_index,,0,1,2,3,4,5
  ,column_name,column_a,column_b,column_c,column_d,column_e,column_f
  0,column_a,0.111,0.223,0.354,0.412,0.5123
  ...

What Kind of Correlation is Computed in Shifu?

Shifu supports Pearson Correlation. For categorical features, category is transformed to numerical values by positive ratio and then compute all pearson correlation values.

VarSel Check for Correlation in Shifu

Start from 0.11.0, shifu will check correlation value in varsel step. The logic is to check if correlation of two columns are larger than a threshold set, if yes, drop one column with smaller IV value.

You can enable this feature by set 'correlationThreshold'. By default it is 1 and since pearson is in [-1, 1], default 1 will not do any variable selection in fact.

  "varSelect" : {
    ...
    "correlationThreshold" : 0.96,
    ...
  },

The varsel output will like this:

2017-04-12 02:50:28: WARN VarSelectModelProcessor - Absolute corrlation value 0.9997951177551206 in (570, 597) are larger than correlationThreshold value 0.96 set in VarSelect#correlationThreshold, column 597 with smaller IV value will not be selected, set finalSelect to false.
2017-04-12 02:50:28: WARN VarSelectModelProcessor - Absolute corrlation value 0.9993956801597748 in (570, 598) are larger than correlationThreshold value 0.96 set in VarSelect#correlationThreshold, column 598 with smaller IV value will not be selected, set finalSelect to false.
2017-04-12 02:50:28: WARN VarSelectModelProcessor - Abslolute corrlation value 1.0 in (579, 588) are larger than correlationThreshold value 0.96 set in VarSelect#correlationThreshold, column 579 with smaller IV value will not be selected, set finalSelect to false.

And Shifu will check target column with other columns, if correlation value is larger than threshold, the column will be directly dropped since it is highly correlated with target column.

Please be careful, no matter it is forceSelect or not, such rule will remove the column at final.

Clone this wiki locally