-
Notifications
You must be signed in to change notification settings - Fork 108
Correlation Computing in Shifu
Starting from Shifu 0.10.0, Shifu supports to computing correlation values in all feature pairs and from 0.11.0, correlation MapReduce job is optimized to be 2 hours for 3400 columns correlation computing.
shifu stats -correlation(c)
Please run correlation after 'shifu stats' running or you will get an alert.
Correlation values are dumped in correlation.csv in current working dir. The format is:
column_index,,0,1,2,3,4,5
,column_name,column_a,column_b,column_c,column_d,column_e,column_f
0,column_a,0.111,0.223,0.354,0.412,0.5123
...
Shifu supports Pearson Correlation. For categorical features, category is transformed to numerical values by positive ratio and then compute all pearson correlation values.
Start from 0.11.0, shifu will check correlation value in varsel step. The logic is to check if correlation of two columns are larger than a threshold set, if yes, drop one column with smaller IV value.
You can enable this feature by set 'correlationThreshold'. By default it is 1 and since pearson is in [-1, 1], default 1 will not do any variable selection in fact.
"varSelect" : {
...
"correlationThreshold" : 0.96,
...
},
The varsel output will like this:
2017-04-12 02:50:28: WARN VarSelectModelProcessor - Absolute corrlation value 0.9997951177551206 in (570, 597) are larger than correlationThreshold value 0.96 set in VarSelect#correlationThreshold, column 597 with smaller IV value will not be selected, set finalSelect to false.
2017-04-12 02:50:28: WARN VarSelectModelProcessor - Absolute corrlation value 0.9993956801597748 in (570, 598) are larger than correlationThreshold value 0.96 set in VarSelect#correlationThreshold, column 598 with smaller IV value will not be selected, set finalSelect to false.
2017-04-12 02:50:28: WARN VarSelectModelProcessor - Abslolute corrlation value 1.0 in (579, 588) are larger than correlationThreshold value 0.96 set in VarSelect#correlationThreshold, column 579 with smaller IV value will not be selected, set finalSelect to false.
And Shifu will check target column with other columns, if correlation value is larger than threshold, the column will be directly dropped since it is highly correlated with target column.
Please be careful, no matter it is forceSelect or not, such rule will remove the column at final.