Skip to content
/ sk-factor Public

Command line and configuration files for building and maintenaing sklearn projects

License

Notifications You must be signed in to change notification settings

B2F/sk-factor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


Logo

SK Factor: one liner CLI to factor and reuse code from sklearn projects

scikit-learn (sklearn) streamlined workflow and command line interface

Report Bug · Request Feature

Table of Contents
  1. What is SK Factor ?
  2. Getting Started
  3. Usage examples
  4. Configuration sections
  5. Plugin system
  6. Roadmap
  7. Contributing
  8. License
  9. Contact
  10. Acknowledgments

What is SK Factor ?


python sk_factor.py -c examples/open_ml/config/credit_card_fraud.toml

SK Factor is a framework and CLI used to factor scikit-learn repetitive and common developement tasks* so you can quickly access and reuse them.

Inspired by module based software projects, SK Factor's goal is to streamline the process of developing machine learning projects with sklearn,

Many of the development tasks are redundant when building a sklearn project, especially when it comes to pipelines.

From the get go (in a single command line run), you will be able to do many advanced actions:

  • Unlimited split training in one run

  • Train on different estimators (classifiers, regressors) for each splits

  • Compare predictions with different models, at once

  • Easily save and export models, plots

  • Standardize and readily available report / scoring templates

    (roc curve, classification report, confusion matrix, precision recall, features permutation ...)

  • Enable and disable any step you choose (sampler, transformer)

To achieve this, SK Factor uses .toml configuration files reflecting each part of the workflow:

SK Factor workflow

Each steps can be customized via a convenient plugins system (loader, estimators ...)

By default, running sk_factor.py will run the complete workflow specified in your .toml configuration file. You can choose to filter only one part of the workflow using one of those arguments:

  • --explore : preprocess the data and display plots (uses only dataset, preprocess and eda config)
  • --train : train on given estimators and display plots (uses only dataset, preprocess and training config)
  • --predict : predicts with chosen model(s) and display reports (uses only dataset, preprocess and predictions config)

See practical .toml configuration examples below

Usage examples


Binary target


python sk_factor.py -c examples/open_ml/config/credit_card_fraud.toml

The credit_card_fraud.toml config file predicts credit card frauds using open_ml's CreditCardFraudDetection dataset.

Preprocessing:

[preprocess]
# preprocessing section is always required
  • The dataset is scaled, shuffled, and 'Time' column is removed.
preprocessors.shuffle = 1 # shuffle's value is a random_state
transformers.scaler = [] # empty array means all columns
preprocessors.drop_columns = ['Time']
  • 25000 rows are removed at the end of the dataset and used for predictions.
preprocessors.drop_rows = -25000
drop_rows_to_predict_file = true

EDA:

[eda]
enabled = true
  • The target distribution is displayed
show_plots = true
plots = [
  'distribution_y',
]

Training:

[training]
enabled = true
  • The dataset imbalance is mitigated with the tomek links algorithm.
pipeline = 'imblearn.pipeline'
samplers = [
    'sampler/tomek_links',
]
  • The kfold_stratified algorithm is applied in 3 splits.
splitting_method.kfold_stratified = 3
  • Two estimators are used on each splits: kneighbors_classifier and sgd_classifier
estimators = [
    'classifier/kneighbors_classifier',
    'classifier/sgd_classifier',
]
  • Training score is f1.
runners = [
    'score',
]
scoring = 'f1'
  • models are written to the models directory
save_model = true
model_timestamp = false
models_directory = 'models'

Predictions

[predictions]
objective = 'binary'
loader = 'csv' # Loads test data from predict_file below
predict_file = 'tests/credit_card_fraud/test.csv'
preprocess = false # preprocess is skipped because test data was extracted from [preprocess] section.
enabled = true
  • Uses models generated at the training phase with a threshold of 0,5
models = [
  'models/credit_card_fraud-classifier/kneighbors_classifier.pkl',
  'models/credit_card_fraud-classifier/sgd_classifier.pkl'
]
threshold = 0.5
  • Predictions are saved to .csv files for each model and displayed in the console
predictions_directory = 'tests/credit_card_fraud/predictions'
save_predictions = true
predictions_timestamp = false
# Keep original features data features in the final predictions columns.
keep_data = false

Multiclass target


python sk_factor.py -c examples/toy_datasets/config/iris.toml

The iris.toml config file extracts data from sklearn toy datasets. This config files predicts a plant class from sepal attributes.

Preprocessing:

  • The dataset is shuffled
  • 5 rows are removed at the end for preditctions

EDA:

  • A pair plot and and heatmap are displayed
  • DPI resolution is set to 200

Training:

  • The dataset imbalance is mitigated with near miss and yeo johnson algorithms
  • Two kfold_shuffle and two kfold_stratified splits are made
  • Estimators linear svc, xgboost and lgbm random forest are used on each split
  • Accuracy score is calculated and printed

Predictions

  • Uses sequential model data from current script execution training file
  • Predictions are saved to .csv files for each model and displayed in console

Regression target


python sk_factor.py -c examples/open_ml/config/happiness_rank.toml

The happiness_rank.toml config file extracts data from open_ml and predicts a happiness score based on demographics and lifestyle attributes.

Preprocessing:

  • Passthrough 7 attributes ('Economy', 'Family', 'Health', 'Freedom', ...)
  • Applies one hot encoder to Region
  • Scales the 'Standard Error' column
  • Shuffles dataset and keep 5 rows at the end for predictions

EDA:

  • Plots the heatmap with 35*35 and 200 DPI

Training:

  • Create 5 kfold and 5 shuffled kfold splits
  • Applies xgboost, lgbm regressor and hgbr estimator on each splits
  • Print the r2 score for each estimator
  • Save model files

Predictions

  • Predicts from previously saved model files

(back to top)

Getting started


SK Factor is a standard Python OOP script organized in packages and modules.

By default, running sk_factor requires the following dependencies:

Prerequisites

  • sklearn -> core functionality such as pipeline is based on sklearn
  • imblearn -> provides advanced samplers to mitigate dataset unbalance
  • toml -> standard format for configuration file
  • argparse -> provides CLI arguments handling
  • pandas -> advanced dataset arrays operations
  • matplotlib -> data visualisation
  • seaborn -> diagrams plots

The standard python package installer PIP is required:

pip install sklearn imblearn toml argparse pandas matoplotlib seaborn

Additionnal dependencies:

  • lightgbm -> additionnal gradient boost estimators
  • xgboost -> additionnal gradient boost estimators
  • shap -> advanced features analysis (such as permutation)
  • openml -> access to machine learning datasets (instead of csv)

Installation

  1. Clone the repo
    git clone https://github.com/B2F/sk-factor.git
  2. Grab one of the examples below

(back to top)

Configuration sections


[dataset]


The [dataset] section is used to describe the data source and how to parse it.

Example from happiness_rank.toml:

[dataset]
loader = 'open_ml'
files = ['HappinessRank_2015']
show_columns = true
plugins = 'examples.open_ml'
  • loader

    Data parser from which one or multiple files are read.

    Options: csv, open_ml, toy_datasets.

    @see plugins/loader

  • files

    Array of arguments to be passed to the loader.

    Can be replaced by --train_files CLI argument.

  • show_columns

    Will display all available dataset's columns at the beginning of the CLI output.

  • plugins

    Package or directory used to override plugins definition. @see plugins system.

[preprocess]


The preprocess section is used to apply transformation to the dataset (drop, shuffle, encode, passthrough).

@see plugins/preprocess

  • label

    Column name used as the target label.

  • label_encode

    Boolean, specify if the label must be encoded (use true for string).

  • transformers.passthrough

    List of columns left unchanged, use an empty [] for all.

  • transformers.one_hot_encoder

    Encode values in a new categorical column.

  • transformers.ordinal_encoder

    Encode values by replacing them in the same column.

  • transformers.scaler

    Applies sklearn StandardScaler.

  • transformers.shuffle

    Shuffles the DataFrame with the random state as value.

  • preprocessors.drop_rows

    Drops n rows at the beginning (positive integer), or from the end (negative integer).

  • drop_rows_to_predict_file

    Use dropped rows for predictions (@see predictions )

  • verbose_feature_names_out

    Set to false to remove suffixes from one hot encoder

  • files_axis

    Choose the dataframe merge axis when using multiple files

[eda]


The eda section (Exploratory Data Analysis) is used with matplotlib and seaborn plots or anything else printed with Python.

  • enabled

    If the EDA phase is skipped, no output, no file save (default: true).

    Use the --explore CLI option to filter script execution on eda config only.

  • show_plots

    To skip diagram or printed output, use show_plots = false.

  • save_images

    Write plots visual to files.

  • save_timestamp

    Append a timestamp suffix to saved files.

  • images_extension

    Extension of saved files.

  • images_directory

    Directory of saved plot files.

  • plots

    Plot plugins to use, @see plugins/plots

    Options: heatmap, pairplot, distribution_y, distribution_x

  • features

    Specify an array of columns name to be used with the plot plugin above.

  • figsize

    Figure size width and height in inches.

  • dpi

    Figure resolution in DPI.

[training]


The training section is used to train on splits and to to create models

  • enabled

    If you want to force skip the training section, set to false (default: true)

    Use the --train option to filter script execution on training config only.

  • pipeline

    Training's pipeline module (default: 'imblearn.pipeline')

  • samplers

    Imblearn samplers ('sampler/smote', 'sampler/tomek_links')

  • estimators

    Classifier and regressor estimators (see plugins/classifiers).

    # Ex:
    estimators = [
       'classifier/logistic_regression',
       'classifier/ridge_classifier',
       'classifier/kneighbors_classifier',
       'classifier/sgd_classifier',
       'classifier/lgbm_classifier',
    ]
  • runners

    Training score runners: 'score', 'classification_report', 'confusion_matrix', 'precision_recall' ... (@see plugins/training)[https://github.com/B2F/sk-factor/blob/main/plugins/training]

  • scoring

    Scoring metric passed as argument to the score runner plugin ('f1', 'r2' ...)

  • splitting_method.kfold_stratified

    Specify unlimited amount of sklearn splitting methods (one per line), with value as n_splits. Ex:

    splitting_method.kfold = 5
    splitting_method.kfold_shuffle = 5
  • save_model

    Save trained model

  • model_timestamp

    Append timestamp suffix to model filename

  • models_directory

    Saved models directory

[predictions]


The predictions section is used to predict from training data or model files.

  • enabled = true

    Enable or disable the predictions section section altogether.

    Use the --predict option to filter CLI execution on the predictions section only.

  • loader

    The loader plugin used to retrieve data for prediction.

    Ex: 'csv'

  • preprocess

    Choose weither or not to re-use the preprocessing section rules for the prediction data.

    If you used preprocessors.drop_rows with drop_rows_to_predict_file enabled in the preprocess section, then your prediction data is already preprocessed and you'll want to set preprocess = false

  • predict_file

    Path used to make predictions (test data).

    If you set drop_rows_to_predict_file = true, then this file will be written with the number of rows from the original dataset, specified in preprocessors.drop_rows

  • models

    An array of models files to use for predictions

  • objective

    Options 'binary', 'multiclass', 'regressor'

  • threshold

    Threshold parameter passed to the objective plugin to filter probabilities.

  • save_predictions

    If set to true, predictions will be saved to the predictions_directory.

  • predictions_directory

    Where to save predictions.

  • predictions_timestamp

    Set to true to append the timestamp to predictions filenames.

  • keep_data

    Set to true to keep all prediction's data columns in predictions files.

[debug]


  • enabled

    Enable python CLI debugger

  • port

    Debug port (Usually 5678)

  • host

    Host address (Usually '127.0.0.1')

  • wait_for_client

    Set to true to start debugger with execution

(back to top)

Plugin system


Default plugins are located in the plugins directory:

  • plugins/loader -> inherits BaseLoader

    Used to implements specific loading methods, like the provided CSV loader.

  • plugins/preprocess/preprocessor -> inherits BasePreprocessor

    Used to add data transformation to the whole dataset (drop column, drop na, drop nb rows ...)

  • plugins/preprocess/transformer -> inherits BaseTransformer

    Data transformer per column (scaler, discretizer, encoder ...)

  • plugins/preprocess/selector -> inherits BaseSelector

    Used to apply a transformer on columns from selector (numbers, string, k best)

  • plugins/plots -> inherits Report

    Plotting heatmap, pairplots ....

  • plugins/estimators/classifier -> inherits BaseEstimator

    Adds classifier algorithms (linear svc, lgbm ...)

  • plugins/estimators/regressor -> inherits BaseEstimator

    Adds regressor algorithms (ridge cv, hgbr , xgboost...)

  • plugins/estimators/sampler -> inherits BaseEstimator

    Sampling methods with the imblearn pipeline (smote, near miss, tomek links, instance hardness ...)

  • plugins/estimators/transformer -> inherits BaseEstimator

    Power transforms (yeo johnson)

  • plugins/split -> inherits BaseCv

    Splitting methods (Kfold, leave one out, shuffle ...)

  • plugins/training -> inherits TrainingPlot

    Training reports (Confusion matrix, classification report, sharp permutation ...)

  • plugins/predictions -> inherits BasePredictor

    Handles prediction objective with output format and threshold (binary, multiclass, regression ...)

You can override or add more functionnality by putting your plugin class files in a package containing a plugins/ directory, which hierarchy reflects the project's base plugins structure.

Plugins files names must match the class name in CamelCase with an underscore to signal an uppercase character.

This package is specified by the plugins key in your toml config's dataset section.

From the examples/toy_datasets/config/iris.toml file:

[dataset]
...
plugins = 'examples.toy_datasets'

If you look into examples/toy_datasets/plugins you will find a custom plugins structure.

Roadmap


  • clean OOP code architecture with a plugin system
  • custom loader
  • exploratory data analysis
  • training
  • predictions
  • Sphinx documentation (complete list of configuration options in the external documentation)
  • Manage default values for unspecified config elements
  • Additionnal plugins (roc curve with threshold display on both roc and precision / recall)
  • Stacking estimators
  • Example with time series forecast with tsfresh and sktime
  • More example use cases and tests
  • Using src.engine to build a GUI, dynamic access to plugins

See the open issues for a full list of proposed features (and known issues).

(back to top)

Contributing


Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can help with following tags:

  • plugins
  • loader
  • preprocess
  • plots
  • estimators
  • training
  • predictions
  • split
  • engine

Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License


Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Contact


B2F - Linkedin

Project Link: https://github.com/B2F/sk-factor

(back to top)

Acknowledgments


SK Factor is relying on all those awesome Data Science projects !

(back to top)

About

Command line and configuration files for building and maintenaing sklearn projects

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages