Skip to content

Shifu Best Practices FAQ

Zhang Pengshan (David) edited this page May 9, 2023 · 20 revisions

Multiple Folders as Input

Sometimes, data are organized like monthly folder /a/b/2016/02. If user would like to customize 3 months of data to do training and later 2 months of data to do testing. Shifu supports such feature well by regex expression in dataPath:

 "dataPath" : "/a/b/{2016/02,2016/03,2016/04}"
 "dataPath" : "/a/b/*/*.data"
 "dataPath" : "/a/b/{2016, 2017}/*/part-*.gz"

Use specific queue instead of default queue. Resource in default queue is small and try another one.

File: $SHIFU_HOME/conf/shifuconfig
hadoopJobQueue=<your queue>

'workerThreadCount' in Shifu:

In file your ModelConfig.json #train part, you can tune such parameter. Increase such count to 8 or even more sometimes cannot speed training jobs. The reason is that our training is CPU-bound. Set to more means use more threads in workers, if more tasks in one machine, cpu usage will be very high and the speed even can be slow down. Suggest default value with 4.

Decrease ‘guagua.min.workers.ratio’ (by default 0.99)

In file: $SHIFU_HOME/conf/shifuconfig. Bulk sync in each epochs is used to train model in Shifu. Means in each iteration, say 100 workers, master need wait for 100 workers done then start another epoch. To mitigate straggler workers, one config ‘guagua.min.workers.ratio’ config, by default in Shifu is 0.99 means each epoch will only wait for 99 workers and ignore the slowest one. Set it to small like 0.95 or even 0.90 may be very helpful especially for a busy cluster. But not too small which may not be good for accuracy.

Increase ‘guagua.split.maxCombinedSplitSize’ (by default 268435456)

In File: $SHIFU_HOME/conf/shifuconfig; If many workers in your training jobs, bagging num is also high. Sometimes you may cannot get enough workers in a busy cluster. Please try to increase this value to 536870912 to decrease workers to half.

Shuffle Input Data Source

In production scenario, data is generated in timestamp. Most of times, shuffle dataset is a good practice for better model generalization. Shuffle feature can be called like:

  shifu norm -shuffle -Dshifu.norm.shuffle.size=150

Parameter 'shifu.norm.shuffle.size' is the final number of shuffle part files. It could be tuned according to file size of each part. If gzip file it is better to be 256MB-512MB for each file, considering compression ratio of 1:5.

Sometimes if input is very large gzip files like 5G in each raw input file, as gzip format after norm still very large norm part file. Leveraging shuffle could also help to tune each norm output file to be in rational size.

Each worker in Shifu by default is set to 2G in training, if each norm output file size like 1GB gzip file, usually 2G memory is not enough. Of course worker memory can be increased in below tips. While if fewer works like dozens, shuffle can also be used to increase # of workers to the one you want, say original number of files in norm output is 15 but each one is as large as 2GB. below can help to reduce to 150 output files with about 200MB per file. Then By default 150 workers will be enabled in Shifu training to accererate training.

  shifu norm -shuffle -Dshifu.norm.shuffle.size=150

This is tradeoff and please be noticed. If you set to 1500 output files with 1500 workers finally, you need 1500*2G = 3T memory available at that time. So sometimes if no enough memory in your cluster, you can also decrease of workers by set a smaller shuffle number with 'shifu.norm.shuffle.size'.

Continuous training: isContinuous

In File: your ModelConfig.json #train part; If you’d like to train 1000 iterations, whole job may be 2 hours, for stability, you can train 500 iterations firstly and then start train step again to do continuous training if isContinuous is set to true.

How to Support Two Types of Running with the Same Input?

By specifying your ModelConfig.json #train customizedPaths, Input training path can be changed to a dedicated folder path. This is flexible if you have a data set run to train and you'd like to trigger another training job with different parameters, you can copy ModelConfig and ColumnConfig to other data set and specify the training input data to the same data path. Which can make new shifu running start from training.

LR/NN Customized Training Inputs

    "customPaths" : { 
          "normalizedDataPath" : "/demo/tmp/NormalizedData/",
          "normalizedValidationDataPath" : "/demo/tmp/NormalizedValidationData/"
    }
  • If not set 'dataSet#validationDataPath', no need to set 'normalizedValidationDataPath'.

GBDT/RF Customized Training Inputs

    "customPaths" : { 
          "cleanedDataPath" : "/demo/tmp/CleanedData/",
          "cleanedValidationDataPath" : "/demo/tmp/CleanedValidationData/"
    }
  • If not set 'dataSet#validationDataPath', no need to set 'cleanedValidationDataPath'.

Can I Specify Different Validation Set and Not to Use Sampling to Get Validation Set?

By set validationDataPath in dataSet part, schema should be the same as training dataHeader, then no matter how you set baggingSampleRate in training, such data set will be used as validation set.

  "dataSet" : {
    "source" : "hdfs",
    "dataPath" : "/apps/risk/data/dev/",
    "dataDelimiter" : "\u0007",
    "headerPath" : "/apps/risk/data/dev/.pig_header",
    "headerDelimiter" : "\u0007",
    "validationDataPath" : "/apps/risk/data/validation/"
    ...
  }

Where Can I Find All Configuration Supported in Shifu?

  1. User-specified properties in ModelConfig.json, please check this Meta Configuration File

  2. System related properties in ${SHIFU_HOME}/conf/shifuconfig here.

How to Verify if 'filterExpressions' Works?

'filterExpression' is a very important feature for user to be flexible to operate their input data set, while as issues in it is not easy to debug, a new feature can be used to test if it works or not: https://github.com/ShifuML/shifu/wiki/Filter-Expressions-Testing-for-Train-Dataset-or-Eval-Dataset

Model Checkpoint in Training

Two kinds of model checkpoint mechanism are supported native in Shifu training:

1. Temporal model by epochs

Temporal models are stored into hdfs://<model sets folder>/<model set name>/tmp/modelsTmp/model0-100.nn (which means it is models in 100 epochs of training).

hdfs://user/<id>/ModelSets/demo/tmp/modelsTmp/model0-100.nn
hdfs://user/<id>/ModelSets/demo/tmp/modelsTmp/model0-120.nn
hdfs://user/<id>/ModelSets/demo/tmp/modelsTmp/model0-140.nn
hdfs://user/<id>/ModelSets/demo/tmp/modelsTmp/model0-160.nn
hdfs://user/<id>/ModelSets/demo/tmp/modelsTmp/model0-180.nn
hdfs://user/<id>/ModelSets/demo/tmp/modelsTmp/model0-20.nn
hdfs://user/<id>/ModelSets/demo/tmp/modelsTmp/model0-200.nn
hdfs://user/<id>/ModelSets/demo/tmp/modelsTmp/model0-40.nn
hdfs://user/<id>/ModelSets/demo/tmp/modelsTmp/model0-60.nn

2. Final model check pointed by epochs

Final models are stored into below final model location

  • For neural network training every (train#epochs)/25, model would be check-pointed in its final location hdfs://<model sets folder>/<model set name>/models/model0.nn.
  • For GBDT, every (train#params#trees/10) to store final GBDT model in hdfs://<model sets folder>/<model set name>/models/model0.gbt.
hdfs://user/<id>/ModelSets/demo/models/model0.nn

Q: How to eval one temporal model or one check-pointed model?

A: Take demo data set as one example:

  1. Replace the final model in the one in hdfs://user/<id>/ModelSets/demo/models/model0.nn with hdfs://user/<id>/ModelSets/demo/tmp/modelsTmp/model0-180.nn;
  2. Download such temporal model to local /models folder and rename it to model0.nn;
[/tmp/demo]$ ls -al models
total 504K
drwxr-xr-x 2 pengzhang users   22 Aug  8 01:38 ./
drwxr-xr-x 4 pengzhang users   82 Aug  8 01:39 ../
-rw-r--r-- 1 pengzhang users 502K Aug  8 01:38 model0.nn
  1. If using bagging, please be careful of if model0.nn or model1.nn, here 0 or 1 is bagging number;
  2. When you download model from hdfs to local, probably you will see such error: Checksum Exception, please remove hdfs checksum file(.model0.nn.crc) in both local and hdfs model path.

How to Re-generate GBDT/RF Training Data?

GBDT/RF in Shifu has no 'shifu norm' requirement because of GBDT using raw data. While there is filter logic supported inside of Shifu, before Shifu GBDT training, still need filter data to simplify training data input. As implicit step, sometimes if some changes on filter logic or raw data, need regenerate such training input. 'shifu.tree.regen' is one parameter for such case: shifu train -Dshifu.tree.regeninput=true

And another implicit functionality for GBDT/TF clean (training) data generation, if GBDT/TF model configured, just run 'shifu norm' to get both norm data and clean data.

How to Check All Properties in ModelConfig.json?

Please check ModelConfig.json metadata file: https://github.com/ShifuML/shifu/blob/develop/src/main/resources/store/ModelConfigMeta.json

Max Categories Support in Shifu

By default max categories in categorical variable is 10000 to ensure stats not being skewed in computation. If any feature with categories larger than such max value, the stats will result in no stats result. In following norm, training, it will be ignored.

Shifu side is better to support such feature to cut off to 10000 while there is also a config for user to config shifu stats -Dshifu.max.category.size=40000

Such max value you can reference distinctCount from the ColumnConfig.

Memory Tuning in Each Step

Below are default memory settings in SHIFU_HOME/conf/shifuconfig

mapreduce.map.memory.mb=2048
mapreduce.map.java.opts=-Xms1700m -Xmx1700m -server -XX:MaxPermSize=64m -XX:PermSize=64m -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=8 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
mapreduce.reduce.memory.mb=2048
mapreduce.reduce.java.opts=-Xms1700m -Xmx1700m -server -XX:MaxPermSize=64m -XX:PermSize=64m

Reducer memory is well tuned for most of big data scenarios except skewed data set. Memory tuning is needed more in mapper memory. In Shifu pipeline, 'norm' and 'eval' steps are usually good with default memory settings. 'stats' and 'train' ('varsel' included sometimes calling train step) can be tuned by increasing or decreasing such two:

mapreduce.map.memory.mb=2048
mapreduce.map.java.opts=-Xms1700m -Xmx1700m -server -XX:MaxPermSize=64m -XX:PermSize=64m -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=8 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

Most of time with memory issue we can directly increase the memory, while please be careful if increased too much, overall resources of train job may not be enough which causing container waiting, timing out and failures.

Another point, 'mapreduce.map.memory.mb' is container memory, '-Xms -Xmx' in 'mapreduce.map.java.opts' is for the JVM process in such container, it is better to keep 200MB-400MB diff(gap) for some non-heap usage of such container. Like below 4096M comparing with 3700M:

mapreduce.map.memory.mb=4096
mapreduce.map.java.opts=-Xms3700m -Xmx3700m -server -XX:MaxPermSize=64m -XX:PermSize=64m -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=8 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

Please be aware that such memory setting is better to only be applied in each step rather than global, you can try to use below command to limit to current step:

shifu train -Dmapreduce.map.memory.mb=... -Dmapreduce.map.java.opts='...'

This means such memory settings only works in train step, other step you can tune accordingly. If not set, default ones in SHIFU_HOME/conf/shifuconfig will be applied.

Please be noticed Shifu default mode is to leverage guagua mapreduce mode to run distributed master-workers (mapper only jobs) model training. Default mode mapreduce.map.memory.mb is the total memory set for master or worker containers, for example 2G which means maximal 2G physical memory. 'mapreduce.map.java.opts' means the memory used by JVM process, as some other additional memory consumption by JVM, better to have 300M-500M memory buffer between such two parameters like -Xmx=1700m if container set to 2G. Sometimes if container set to 4G while -Xmx only 1700G, about 2G memory may be wasted.

(Non-TensorFlow) Training Job Failure Because of Master OOM

Default mode in Shifu distributed training memory is set by 'mapreduce.map.memory.mb' and 'mapreduce.map.java.opts', the limit is that master and workers are set to the same memory, sometimes because of huge workers like 1000+ and not a small message like (10G), you will see master OOM issue reported because of consuming worker messages at the same time. Such issue has been fixed in recent https://github.com/ShifuML/guagua/tree/develop. For a temporary solution, manual guagua-core-0.8.0-SNAPSHO.jar can be built to replace the one guagua-core-*.jar in SHIFU_HOME/lib to solve such OOM issue.

(Non-TensorFlow) Training Job Failure Because of Worker OOM

Worker OOM you may see message like Java OOM exception or GuaguaRuntimeException("List over size limit."). All because of data is too much to load into worker memory. Solution could be reshuffle worker data or tune worker size.

How to Run Correlation Computation?

Correlation compute between each two columns has been supported by

shifu stats -c

Please be noticed a prerequisite of such feature is to run

shifu stats

And you can find correlation.csv in your local folder, by default it would be used in variable selection to remove some correlated column, by default correlation threshold set in variable selection is 0.99, while you can change it to enable correlation based variable selection.

How to Run PSI Computation?

PSI is a very important metric about variable stability. By configurating psiColumnName in ModelConfig.json, it will be enabled:

  "dataSet" : {
    ...
    "psiColumnName" : "daily_column"
    ...
  }

'psiColumnName' should be one column in your dataset to identify based on date/month or other flags like 'dev,oot' ... If you don't have such column, you need pre-generate one column for PSI computation. Then PSI values would be computed by

shifu stats
shifu stats -psi

'shifu stats' must be called before 'shifu stats -psi' to fulfill some basic stats metrics. Final result can be found in ColumnConfig.json for all columns.

Clone this wiki locally