scikit-learn (sklearn) streamlined workflow and command line interface
Report Bug
·
Request Feature
Table of Contents
python sk_factor.py -c examples/open_ml/config/credit_card_fraud.toml
SK Factor is a framework and CLI used to factor scikit-learn repetitive and common developement tasks* so you can quickly access and reuse them.
Inspired by module based software projects, SK Factor's goal is to streamline the process of developing machine learning projects with sklearn,
Many of the development tasks are redundant when building a sklearn project, especially when it comes to pipelines.
From the get go (in a single command line run), you will be able to do many advanced actions:
-
Unlimited split training in one run
-
Train on different estimators (classifiers, regressors) for each splits
-
Compare predictions with different models, at once
-
Easily save and export models, plots
-
Standardize and readily available report / scoring templates
(roc curve, classification report, confusion matrix, precision recall, features permutation ...)
-
Enable and disable any step you choose (sampler, transformer)
To achieve this, SK Factor uses .toml configuration files reflecting each part of the workflow:
Each steps can be customized via a convenient plugins system (loader, estimators ...)
By default, running sk_factor.py will run the complete workflow specified in your .toml configuration file. You can choose to filter only one part of the workflow using one of those arguments:
- --explore : preprocess the data and display plots (uses only dataset, preprocess and eda config)
- --train : train on given estimators and display plots (uses only dataset, preprocess and training config)
- --predict : predicts with chosen model(s) and display reports (uses only dataset, preprocess and predictions config)
See practical .toml configuration examples below
python sk_factor.py -c examples/open_ml/config/credit_card_fraud.toml
The credit_card_fraud.toml config file predicts credit card frauds using open_ml's CreditCardFraudDetection dataset.
[preprocess]
# preprocessing section is always required
- The dataset is scaled, shuffled, and 'Time' column is removed.
preprocessors.shuffle = 1 # shuffle's value is a random_state
transformers.scaler = [] # empty array means all columns
preprocessors.drop_columns = ['Time']
- 25000 rows are removed at the end of the dataset and used for predictions.
preprocessors.drop_rows = -25000
drop_rows_to_predict_file = true
[eda]
enabled = true
- The target distribution is displayed
show_plots = true
plots = [
'distribution_y',
]
[training]
enabled = true
- The dataset imbalance is mitigated with the tomek links algorithm.
pipeline = 'imblearn.pipeline'
samplers = [
'sampler/tomek_links',
]
- The kfold_stratified algorithm is applied in 3 splits.
splitting_method.kfold_stratified = 3
- Two estimators are used on each splits: kneighbors_classifier and sgd_classifier
estimators = [
'classifier/kneighbors_classifier',
'classifier/sgd_classifier',
]
- Training score is f1.
runners = [
'score',
]
scoring = 'f1'
- models are written to the models directory
save_model = true
model_timestamp = false
models_directory = 'models'
[predictions]
objective = 'binary'
- Extracts test data from original dataset (@see drop_rows_to_predict_file)
loader = 'csv' # Loads test data from predict_file below
predict_file = 'tests/credit_card_fraud/test.csv'
preprocess = false # preprocess is skipped because test data was extracted from [preprocess] section.
enabled = true
- Uses models generated at the training phase with a threshold of 0,5
models = [
'models/credit_card_fraud-classifier/kneighbors_classifier.pkl',
'models/credit_card_fraud-classifier/sgd_classifier.pkl'
]
threshold = 0.5
- Predictions are saved to .csv files for each model and displayed in the console
predictions_directory = 'tests/credit_card_fraud/predictions'
save_predictions = true
predictions_timestamp = false
# Keep original features data features in the final predictions columns.
keep_data = false
python sk_factor.py -c examples/toy_datasets/config/iris.toml
The iris.toml config file extracts data from sklearn toy datasets. This config files predicts a plant class from sepal attributes.
- The dataset is shuffled
- 5 rows are removed at the end for preditctions
- A pair plot and and heatmap are displayed
- DPI resolution is set to 200
- The dataset imbalance is mitigated with near miss and yeo johnson algorithms
- Two kfold_shuffle and two kfold_stratified splits are made
- Estimators linear svc, xgboost and lgbm random forest are used on each split
- Accuracy score is calculated and printed
- Uses sequential model data from current script execution training file
- Predictions are saved to .csv files for each model and displayed in console
python sk_factor.py -c examples/open_ml/config/happiness_rank.toml
The happiness_rank.toml config file extracts data from open_ml and predicts a happiness score based on demographics and lifestyle attributes.
- Passthrough 7 attributes ('Economy', 'Family', 'Health', 'Freedom', ...)
- Applies one hot encoder to Region
- Scales the 'Standard Error' column
- Shuffles dataset and keep 5 rows at the end for predictions
- Plots the heatmap with 35*35 and 200 DPI
- Create 5 kfold and 5 shuffled kfold splits
- Applies xgboost, lgbm regressor and hgbr estimator on each splits
- Print the r2 score for each estimator
- Save model files
- Predicts from previously saved model files
SK Factor is a standard Python OOP script organized in packages and modules.
By default, running sk_factor requires the following dependencies:
- sklearn -> core functionality such as pipeline is based on sklearn
- imblearn -> provides advanced samplers to mitigate dataset unbalance
- toml -> standard format for configuration file
- argparse -> provides CLI arguments handling
- pandas -> advanced dataset arrays operations
- matplotlib -> data visualisation
- seaborn -> diagrams plots
The standard python package installer PIP is required:
pip install sklearn imblearn toml argparse pandas matoplotlib seaborn
Additionnal dependencies:
- lightgbm -> additionnal gradient boost estimators
- xgboost -> additionnal gradient boost estimators
- shap -> advanced features analysis (such as permutation)
- openml -> access to machine learning datasets (instead of csv)
- Clone the repo
git clone https://github.com/B2F/sk-factor.git
- Grab one of the examples below
The [dataset] section is used to describe the data source and how to parse it.
Example from happiness_rank.toml:
[dataset]
loader = 'open_ml'
files = ['HappinessRank_2015']
show_columns = true
plugins = 'examples.open_ml'
-
Data parser from which one or multiple files are read.
Options: csv, open_ml, toy_datasets.
@see plugins/loader
-
Array of arguments to be passed to the loader.
Can be replaced by --train_files CLI argument.
-
Will display all available dataset's columns at the beginning of the CLI output.
-
Package or directory used to override plugins definition. @see plugins system.
The preprocess section is used to apply transformation to the dataset (drop, shuffle, encode, passthrough).
@see plugins/preprocess
-
Column name used as the target label.
-
Boolean, specify if the label must be encoded (use true for string).
-
List of columns left unchanged, use an empty [] for all.
-
Encode values in a new categorical column.
-
Encode values by replacing them in the same column.
-
Applies sklearn StandardScaler.
-
Shuffles the DataFrame with the random state as value.
-
Drops n rows at the beginning (positive integer), or from the end (negative integer).
-
Use dropped rows for predictions (@see predictions )
-
Set to false to remove suffixes from one hot encoder
-
Choose the dataframe merge axis when using multiple files
The eda section (Exploratory Data Analysis) is used with matplotlib and seaborn plots or anything else printed with Python.
-
If the EDA phase is skipped, no output, no file save (default: true).
Use the --explore CLI option to filter script execution on eda config only.
-
To skip diagram or printed output, use show_plots = false.
-
Write plots visual to files.
-
Append a timestamp suffix to saved files.
-
Extension of saved files.
-
Directory of saved plot files.
-
Plot plugins to use, @see plugins/plots
Options: heatmap, pairplot, distribution_y, distribution_x
-
Specify an array of columns name to be used with the plot plugin above.
-
Figure size width and height in inches.
-
Figure resolution in DPI.
The training section is used to train on splits and to to create models
-
If you want to force skip the training section, set to false (default: true)
Use the --train option to filter script execution on training config only.
-
Training's pipeline module (default: 'imblearn.pipeline')
-
Imblearn samplers ('sampler/smote', 'sampler/tomek_links')
-
Classifier and regressor estimators (see plugins/classifiers).
# Ex: estimators = [ 'classifier/logistic_regression', 'classifier/ridge_classifier', 'classifier/kneighbors_classifier', 'classifier/sgd_classifier', 'classifier/lgbm_classifier', ]
-
Training score runners: 'score', 'classification_report', 'confusion_matrix', 'precision_recall' ... (@see plugins/training)[https://github.com/B2F/sk-factor/blob/main/plugins/training]
-
Scoring metric passed as argument to the score runner plugin ('f1', 'r2' ...)
-
Specify unlimited amount of sklearn splitting methods (one per line), with value as n_splits. Ex:
splitting_method.kfold = 5 splitting_method.kfold_shuffle = 5
-
Save trained model
-
Append timestamp suffix to model filename
-
Saved models directory
The predictions section is used to predict from training data or model files.
-
Enable or disable the predictions section section altogether.
Use the --predict option to filter CLI execution on the predictions section only.
-
The loader plugin used to retrieve data for prediction.
Ex: 'csv'
-
Choose weither or not to re-use the preprocessing section rules for the prediction data.
If you used preprocessors.drop_rows with drop_rows_to_predict_file enabled in the preprocess section, then your prediction data is already preprocessed and you'll want to set preprocess = false
-
Path used to make predictions (test data).
If you set drop_rows_to_predict_file = true, then this file will be written with the number of rows from the original dataset, specified in preprocessors.drop_rows
-
An array of models files to use for predictions
-
Options 'binary', 'multiclass', 'regressor'
-
Threshold parameter passed to the objective plugin to filter probabilities.
-
If set to true, predictions will be saved to the predictions_directory.
-
Where to save predictions.
-
Set to true to append the timestamp to predictions filenames.
-
Set to true to keep all prediction's data columns in predictions files.
-
Enable python CLI debugger
-
Debug port (Usually 5678)
-
Host address (Usually '127.0.0.1')
-
Set to true to start debugger with execution
Default plugins are located in the plugins directory:
-
plugins/loader -> inherits BaseLoader
Used to implements specific loading methods, like the provided CSV loader.
-
plugins/preprocess/preprocessor -> inherits BasePreprocessor
Used to add data transformation to the whole dataset (drop column, drop na, drop nb rows ...)
-
plugins/preprocess/transformer -> inherits BaseTransformer
Data transformer per column (scaler, discretizer, encoder ...)
-
plugins/preprocess/selector -> inherits BaseSelector
Used to apply a transformer on columns from selector (numbers, string, k best)
-
plugins/plots -> inherits Report
Plotting heatmap, pairplots ....
-
plugins/estimators/classifier -> inherits BaseEstimator
Adds classifier algorithms (linear svc, lgbm ...)
-
plugins/estimators/regressor -> inherits BaseEstimator
Adds regressor algorithms (ridge cv, hgbr , xgboost...)
-
plugins/estimators/sampler -> inherits BaseEstimator
Sampling methods with the imblearn pipeline (smote, near miss, tomek links, instance hardness ...)
-
plugins/estimators/transformer -> inherits BaseEstimator
Power transforms (yeo johnson)
-
plugins/split -> inherits BaseCv
Splitting methods (Kfold, leave one out, shuffle ...)
-
plugins/training -> inherits TrainingPlot
Training reports (Confusion matrix, classification report, sharp permutation ...)
-
plugins/predictions -> inherits BasePredictor
Handles prediction objective with output format and threshold (binary, multiclass, regression ...)
You can override or add more functionnality by putting your plugin class files in a package containing a plugins/ directory, which hierarchy reflects the project's base plugins structure.
Plugins files names must match the class name in CamelCase with an underscore to signal an uppercase character.
This package is specified by the plugins key in your toml config's dataset section.
From the examples/toy_datasets/config/iris.toml file:
[dataset]
...
plugins = 'examples.toy_datasets'
If you look into examples/toy_datasets/plugins you will find a custom plugins structure.
- clean OOP code architecture with a plugin system
- custom loader
- exploratory data analysis
- training
- predictions
- Sphinx documentation (complete list of configuration options in the external documentation)
- Manage default values for unspecified config elements
- Additionnal plugins (roc curve with threshold display on both roc and precision / recall)
- Stacking estimators
- Example with time series forecast with tsfresh and sktime
- More example use cases and tests
- Using src.engine to build a GUI, dynamic access to plugins
See the open issues for a full list of proposed features (and known issues).
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can help with following tags:
- plugins
- loader
- preprocess
- plots
- estimators
- training
- predictions
- split
- engine
Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Distributed under the MIT License. See LICENSE.txt
for more information.
B2F - Linkedin
Project Link: https://github.com/B2F/sk-factor
SK Factor is relying on all those awesome Data Science projects !
- Python
- Visualization
- Machine learning