Skip to content

Tutorial Build Your First WDL Model

Liu-Delin edited this page Jan 12, 2021 · 17 revisions

For shifu pipeline and how to install shifu: https://github.com/ShifuML/shifu/wiki/Tutorial---Build-Your-First-ML-Model

How to Run Shifu Pipeline with WDL Model

Shifu will parse your Hadoop platform settings and set all Hadoop conf for Shifu runtime. All logics are in bash ${SHIFU_HOME}/bin/shifu

  • shifu new <ModelName>

    [15:50]:[wuhaifeng@sample-host:bin]$ ./shifu new WDLSampleModel
    SHIFU_HOME is not set. Using /home/haifwu/shifu/shifu-0.13.0-SNAPSHOT/bin/.. as SHIFU_HOME
    2020-07-18 03:50:24: INFO CreateModelProcessor [main] - Creating ModelSet Folder: /home/haifwu/shifu/shifu-0.13.0-SNAPSHOT/bin/WDLSampleModel...
    2020-07-18 03:50:24: INFO CreateModelProcessor [main] - Creating Initial ModelConfig.json ...
    2020-07-18 03:50:25: INFO CreateModelProcessor [main] - Enable DIST/MAPRED mode because Hadoop cluster is detected.
    2020-07-18 03:50:28: INFO CreateModelProcessor [main] - Step Finished: new
    2020-07-18 03:50:28: INFO ShifuCLI [main] - ModelSet WDLSampleModel is created successfully with ModelConfig.json in WDLSampleModel folder.
    2020-07-18 03:50:28: INFO ShifuCLI [main] - Please change your folder to WDLSampleModel and then configure your ModelConfig.json or directly do initialization step by 'shifu init.'
    

    This command will create a new ModelName folder for training, in the new folder, You will find some auto-created files:

    1. ModelConfig.json: Some input and model pipeline configurations and will be discussed more later.

        "basic" : {
            "name" : "WDLSampleModel",
            "author" : "haifwu",
            "description" : "Created at 2020-07-18 15:50:25",
            "version" : "0.13.0",
            "runMode" : "DIST",
            "postTrainOn" : false,
            "customPaths" : { }
        },
        "dataSet" : {
           "source" : "HDFS",
           "dataPath" : "/user/pengzhang/cam2015/rawtrain",
           "validationDataPath" : null,
           "dataDelimiter" : "|",
           "headerPath" : "/user/pengzhang/cam2015/header",
           "headerDelimiter" : "|",
           "filterExpressions" : "",
           "weightColumnName" : "wgt_column_name",
           "targetColumnName" : "tagging_column_name",
           "posTags" : [ "1" ],
           "negTags" : [ "0" ],
           "missingOrInvalidValues" : [ "", "*", "#", "?", "null", "~" ],
           "metaColumnNameFile" : "columns/meta.column.names",
           "categoricalColumnNameFile" : "columns/categorical.column.names"
         },
         ...
      • basic::name is the name of your Model and is the same as ModelName
      • basic::runMode can be 'local' or 'mapred'/'dist', by default is local which means run jobs on local machine; 'mapred'/'dist' means jobs are running in Hadoop platform
      • dataSet::source has two types: 'local' or 'hdfs' which means data in local or hadoop file system.
      • dataSet::dataPath is the data path for model training. If 'hdfs' source, dataPath should be files or folders in HDFS; HDFS glob expression is supported here, for example: you can use such dataPath: hdfs:/user/shifu/{2016/01,2016/02}/trainingdata; you can take our example data in ${SHIFU_HOME}/example/cancer-judgement/ and push it into your HDFS for testing.
      • dataSet::headerPath: which is a file for data header, if it is null, first line of your dataPath will be parsed as headers.
      • dataSet::dataDelimiter & dataSet::headerDelimiter: the delimiter of data and data header
      • dataSet::filterExpressions: User-specified expressions like ' columnA == '2' ' are supported to filter data in stats and training, compilcated one like " population=='NSF' or population=='eCHQ' ", more details can be found in http://commons.apache.org/proper/commons-jexl/reference/syntax.html. A new feature is to verify this parameter and details can be found in: https://github.com/ShifuML/shifu/wiki/Filter-Expressions-Testing-for-Train-Dataset-or-Eval-Dataset.
      • dataSet::weightColumnName: if your training or stats are based on weighted columns. For example in our risk training, it should be dollar columns which means our target is to save dollar-wise loss. If not set, it is unit-wised.
      • dataSet::targetColumnName: which column is your target column, please make sure it is successfully configured.
      • dataSet::posTags: elements in such list will be treated as positive like 1 in binary classification.
      • dataSet::negTags: elements in such list will be treated as negative like 0 in binary classification.
      • dataSet::missingOrInvalidValues: values in such a list will be treated as invalid.
      • dataSet::metaColumnNameFile: meta column config file which is by default and created well in columns folder
      • dataSet::categoricalColumnNameFile: categorical column config files which list all categorical features and will be set in init step
    2. columns/meta.column.names: Empty file which specifies columns like ID/date columns that couldn't be used for building models

    3. columns/categorical.column.names: Empty file which specifies categorical columns

    4. columns/forceremove.column.names: Empty file which specifies columns which must be removed in model training

    5. columns/forceselect.column.names: Empty file which specifies columns which must be selected in model training

    6. columns/Eval1score.meta.column.names: Empty file which specifies evaluation meta columns

    Mostly in this part, the user should config basic and dataSet path well, then in the next steps all running are based on successful data paths and modes.

  • cd <ModelName>;shifu init

    All next steps from init should be run in (ModelName folder), this design is to make sure user could build different models in different folder in parallel.

    Init step will create another important file - ColumnConfig.json by ModelConfig.json. ColumnConfig.json is a json file includes all statistic info and mostly info will be filled later in 'stats' step.

    So far numerical or categorical columns must be specified by users in columns/categorical.column.names. This is very important to do the right column stats and transform. Please do make sure you configure categorical columns here well. Any variable that is not specified in columns/categorical.column.names will be treated as a numerical variable by default.

  • shifu stats

    Stats step is used to collect statistics like mean, stddev, KS and other info by using MapReduce/Pig/Spark jobs.

          "stats" : {
             "maxNumBin" : 20,
             "cateMaxNumBin" : 0,
             "binningMethod" : "EqualPositive",
             "sampleRate" : 1.0,
             "sampleNegOnly" : false,
             "binningAlgorithm" : "SPDTI",
             "psiColumnName" : ""
          },
    • stats::maxNumBin: how many bins (buckets) in each numerical columns will be computed. The more the better results but more computations. Better in 10-50. For categorical features
    • stats::binningMethod: What kind of binning method: 'EqualPositive' in each bin the same positive number of records, others like 'EqualNegative', 'EqualTotal' and 'EqualInterval' ...
    • stats::sampleRate: usually you can do sampling for stats to accelerate these steps
    • stats::sampleNegOnly: If only sample negative records, this is useful for some cases negative are much more than positive records.
    • stats::binningAlgorithm: By default, it is 'SPDTI' which is histogram-based statistics.

    After stats running, you can find ColumnConfig.json updated in the ModelName folder with mean, ks, binning, and other stats info which can be used in the next steps.

  • shifu norm

    For logistic regression or neural network models, training input data should be normalized like z-score normalization or max-min normalization or woe normalization. Such normalization methods are all supported in this step.

    For tree ensemble models like Random Forest or Gradient Boosted Trees, no need norm step after shifu 0.10.x. while in shifu 0.9.x, norm is still needed, actually norm is just to generate clean data for further training. Start from shifu 0.10.0, by run norm which will generate both real normalization outputs and cleaned data outputs for tree model input.

       "normalize" : {
          "stdDevCutOff" : 9.0,
          "sampleRate" : 1.0,
          "sampleNegOnly" : false,
          "normType" : "ZSCALE_INDEX"
       },
    • normalize::stdDevCutOff: stddev cut off for zscore, if abs value after zscore are still larger than this value, will be cut off to this value.
    • normalize::sampleRate: samplining data for next step training.
    • normalize::sampleNegOnly: If only sample negative records, this is useful for some cases negative are much more that positive records.
    • normalize::normType: can be 'zscale'/'zscore', 'maxmin', 'woe', 'woe_zscale', case insensitive. For WDL model, we need 'ZSCALE_INDEX' here.

    'woe' norm type is very important, it leverages binning information to transform numerical values into discrete values. This norm type improves model performance very well.

  • New shifu varsel

    After stats and norm, varsel step is used for feature selection according to some statistical information like KS or IV value. For WDL model, you can select the embedded columns from the result after 'varsel' steps. Or if you want to force select some column as embeded columns, you need config these column names in file 'columns/forceselect.column.names'.

    "varSelect" : {
          "forceEnable" : true,
          "candidateColumnNameFile" : null,
          "forceSelectColumnNameFile" : "columns/forceselect.column.names",
          "forceRemoveColumnNameFile" : "columns/forceremove.column.names",
          "filterEnable" : true,
          "filterNum" : -1,
          "filterBy" : "SE",
          "filterOutRatio" : 0.05,
          "missingRateThreshold" : 0.98,
          "correlationThreshold" : 0.96,
          "minIvThreshold" : 0.0,
          "minKsThreshold" : 0.0,
          "postCorrelationMetric" : "SE",
          "params" : null
    }
    • varSelect::forceEnable: If enable force remove and force selection features
    • varSelect::filterEnable: If enable filter in variable selection
    • varSelect::filterNum: The number of variables need to be selected for model training, filterNum has higher priority than filterOutRatio, in another word, once filterNum is set filterOutRatio is ignored.
    • varSelect::filterOutRatio: ratio of variables should be filtered out after running shifu varselect
    • varSelect::filterBy: type of variable selection type, like 'KS', 'IV', 'SE', 'ST', 'FI'

Feature selection by 'KS' or 'IV' is the coarse level just by feature quality. 'SE', 'ST' and 'FI' are feature selection methods based on model training. For any detailed information, please check [https://github.com/ShifuML/shifu/wiki/Variable-Selection-in-Shifu](Variable Selection in Shifu)

  • shifu train

    One of Shifu's pros is that training in Shifu is very powerful:

    • Distributed Logistic Regression / Neural Network / Tree Ensemble training are supported if runMode is 'dist'
    • Bagging and validation are natively supported with just a configuration
    • All distributed training is fault tolerance and tested well in a busy shared Hadoop cluster. Straggler issue is solved well to make sure training running smoothly in the cluster.
    • Bagging can base on different parameters and different bagging data which is enabled by just set baggingSampleRate.
    "train" : {
        "baggingNum" : 1,
        "baggingWithReplacement" : false,
        "baggingSampleRate" : 1.0,
        "validSetRate" : 0.1,
        "numTrainEpochs" : 1000,
        "isContinuous" : true,
        "workerThreadCount" : 4,
        "algorithm" : "WDL",
        "params" : {
            "wideEnable" : true,
            "deepEnable" : true,
            "embedEnable" : true,
            "Propagation" : "B",
            "LearningRate": 0.5,
            "NumHiddenLayers": 1,
            "NumEmbedColumnIds": [2,3,5,6,7,9,10,11,12,13,14,16,17,20,21,22,23,24,25,32,35,37,40,124,125,128,129,238,266,296,298,300,319,384,390,391,402,403,406,407,408,409,410,411,412,413,414,415,416,417,418,420,421,441,547,549,550,552,553,554,555,627,629],        
            "ActivationFunc": ["tanh"],
            "NumHiddenNodes": [30],
            "WDLL2Reg": 0.01
        }
    },
    • train::baggingNum: How many models will be trained. In DIST mode, this means how many training jobs, each job is to train one model.
    • train::baggingWithReplacement: If bagging is combined with replacement sampling like Random Forest.
    • train::baggingSampleRate: How many training data will be used in training and validation, by default it is 1.
    • train::validSetRate: How many data are for validation data, others are for training
    • train::numTrainEpochs: How many iterations are used to train NN/LR models
    • train::isContinuous: If existing models in models folder and such one is set to true, training will start from existing NN/LR/GBT models. Such feature is not supported in Random forest.
    • train::workerThreadCount: Data are distributed in each Hadoop task, in each task, how many threads are used to training model in parallel. This can accelerate training. By default it is 4. In a shared cluster better set to 4-8. Set it higher sometimes may have CPU issues in a shared cluster without set CPU isolation well.
    • train::algorithm: 'NN', 'LR', 'GBT', 'RF' are supported so far in Shifu. For different algorithms, different train::params should be set well.
    • train::params::NumHiddenLayers: How many hidden layers in Neural network
    • train::params::ActivationFunc: Activation functions in each hidden layer.
    • train::params::NumHiddenNodes: Hidden nodes in each layer.
    • train::params::LearningRate: Learning rate for neural network building
    • train::params::NumEmbedColumnIds: A list of embed column ids which are choosing from category features. You can get these column ids from ColumnConfig.json which column type should be 'C' and finalSelect should be true.
    • train::params::Propagation: 'R', 'Q', 'B' are supported well, here 'B' is BackPropagation, 'Q' is QuickPropagation. 'R' is ResilentPropagation. By default it is 'Q'.

    After training is finished, you can find models trained in local folder /models/. Which can be used in production or evaluation step.

  • shifu eval

    Evaluation step is to evaluate models you just trained. If multiple models are found in models folder. all will be evaluated and 'mean' model score is used to do final performance report.

     "evals" : [ {
         "name" : "Eval-1",
       "dataSet" : {
         "source" : "HDFS",
         "dataPath" : "/user/${USER}/data.csv",
         "dataDelimiter" : "|",
         "headerPath" : "",
         "headerDelimiter" : "",
         "filterExpressions" : "",
         "weightColumnName" : "wgt_column_name",
         "targetColumnName" : "tagging_column_name",
         "posTags" : [ "1" ],
         "negTags" : [ "0" ],
         "missingOrInvalidValues" : [ "", "*", "#", "?", "null", "~" ],
         "metaColumnNameFile" : "columns/eval.meta.column.names",
         "categoricalColumnNameFile" : "columns/categorical.column.names"
       },
       "performanceBucketNum" : 100,
       "performanceScoreSelector" : "mean",
       "scoreMetaColumnNameFile" : "columns/Eval1score.meta.column.names",
       "customPaths" : { }
        },
      "performanceBucketNum" : 100,
      "performanceScoreSelector" : "mean",
      "scoreMetaColumnNameFile" : "columns/Eval1score.meta.column.names",
    }]
    • Evaluation supports multiple evaluation data set settings.
    • evals::dataSet: most time is the same as the ones in the dataSet part, data path and schema can be specified in eval even different compared with the training data set.
    • evals::performanceBucketNum: Bucket number to checkpoints in the final report.
    • evals::performanceScoreSelector: By default it is mean value for all bagging models.
    • evals::scoreMetaColumnNameFile: this is a file name, such file specifies champion model score field name in eval data set and such performance will be set together in eval data set performance chart for comparison.

    Evaluation results can be found in the console like AUC or Gain Chart, Precision-Recall chart, and HTML format report can be found in the evaluation local folder. Then you can get your models and model performance in JSON or HTML formats. Multiple evaluations can be supported by specifying multiple eval data sets by different data folder or different schema. Such eval data sets are run in parallel to speed eval performance.

Clone this wiki locally