GitHub - B2F/sk-factor: Command line and configuration files for building and maintenaing sklearn projects

SK Factor: one liner CLI to factor and reuse code from sklearn projects

scikit-learn (sklearn) streamlined workflow and command line interface

Report Bug · Request Feature

Table of Contents

What is SK Factor ?
Getting Started
- Prerequisites
- Installation
Usage examples
Configuration sections
Plugin system
Roadmap
Contributing
License
Contact
Acknowledgments

What is SK Factor ?

python sk_factor.py -c examples/open_ml/config/credit_card_fraud.toml

SK Factor is a framework and CLI used to factor scikit-learn repetitive and common developement tasks* so you can quickly access and reuse them.

Inspired by module based software projects, SK Factor's goal is to streamline the process of developing machine learning projects with sklearn,

Many of the development tasks are redundant when building a sklearn project, especially when it comes to pipelines.

From the get go (in a single command line run), you will be able to do many advanced actions:

Unlimited split training in one run
Train on different estimators (classifiers, regressors) for each splits
Compare predictions with different models, at once
Easily save and export models, plots
Standardize and readily available report / scoring templates

(roc curve, classification report, confusion matrix, precision recall, features permutation ...)
Enable and disable any step you choose (sampler, transformer)

To achieve this, SK Factor uses .toml configuration files reflecting each part of the workflow:

Each steps can be customized via a convenient plugins system (loader, estimators ...)

By default, running sk_factor.py will run the complete workflow specified in your .toml configuration file. You can choose to filter only one part of the workflow using one of those arguments:

--explore : preprocess the data and display plots (uses only dataset, preprocess and eda config)
--train : train on given estimators and display plots (uses only dataset, preprocess and training config)
--predict : predicts with chosen model(s) and display reports (uses only dataset, preprocess and predictions config)

See practical .toml configuration examples below

Usage examples

Binary target

python sk_factor.py -c examples/open_ml/config/credit_card_fraud.toml

The credit_card_fraud.toml config file predicts credit card frauds using open_ml's CreditCardFraudDetection dataset.

Preprocessing:

[preprocess]
# preprocessing section is always required

The dataset is scaled, shuffled, and 'Time' column is removed.

preprocessors.shuffle = 1 # shuffle's value is a random_state
transformers.scaler = [] # empty array means all columns
preprocessors.drop_columns = ['Time']

25000 rows are removed at the end of the dataset and used for predictions.

preprocessors.drop_rows = -25000
drop_rows_to_predict_file = true

EDA:

[eda]
enabled = true

The target distribution is displayed

show_plots = true
plots = [
  'distribution_y',
]

Training:

[training]
enabled = true

The dataset imbalance is mitigated with the tomek links algorithm.

pipeline = 'imblearn.pipeline'
samplers = [
    'sampler/tomek_links',
]

The kfold_stratified algorithm is applied in 3 splits.

splitting_method.kfold_stratified = 3

Two estimators are used on each splits: kneighbors_classifier and sgd_classifier

estimators = [
    'classifier/kneighbors_classifier',
    'classifier/sgd_classifier',
]

Training score is f1.

runners = [
    'score',
]
scoring = 'f1'

models are written to the models directory

save_model = true
model_timestamp = false
models_directory = 'models'

Predictions

[predictions]
objective = 'binary'

Extracts test data from original dataset (@see drop_rows_to_predict_file)

loader = 'csv' # Loads test data from predict_file below
predict_file = 'tests/credit_card_fraud/test.csv'
preprocess = false # preprocess is skipped because test data was extracted from [preprocess] section.
enabled = true

Uses models generated at the training phase with a threshold of 0,5

models = [
  'models/credit_card_fraud-classifier/kneighbors_classifier.pkl',
  'models/credit_card_fraud-classifier/sgd_classifier.pkl'
]
threshold = 0.5

Predictions are saved to .csv files for each model and displayed in the console

predictions_directory = 'tests/credit_card_fraud/predictions'
save_predictions = true
predictions_timestamp = false
# Keep original features data features in the final predictions columns.
keep_data = false

Multiclass target

python sk_factor.py -c examples/toy_datasets/config/iris.toml

The iris.toml config file extracts data from sklearn toy datasets. This config files predicts a plant class from sepal attributes.

Preprocessing:

The dataset is shuffled
5 rows are removed at the end for preditctions

EDA:

A pair plot and and heatmap are displayed
DPI resolution is set to 200

Training:

The dataset imbalance is mitigated with near miss and yeo johnson algorithms
Two kfold_shuffle and two kfold_stratified splits are made
Estimators linear svc, xgboost and lgbm random forest are used on each split
Accuracy score is calculated and printed

Predictions

Uses sequential model data from current script execution training file
Predictions are saved to .csv files for each model and displayed in console

Regression target

python sk_factor.py -c examples/open_ml/config/happiness_rank.toml

The happiness_rank.toml config file extracts data from open_ml and predicts a happiness score based on demographics and lifestyle attributes.

Preprocessing:

Passthrough 7 attributes ('Economy', 'Family', 'Health', 'Freedom', ...)
Applies one hot encoder to Region
Scales the 'Standard Error' column
Shuffles dataset and keep 5 rows at the end for predictions

EDA:

Plots the heatmap with 35*35 and 200 DPI

Training:

Create 5 kfold and 5 shuffled kfold splits
Applies xgboost, lgbm regressor and hgbr estimator on each splits
Print the r2 score for each estimator
Save model files

Predictions

Predicts from previously saved model files

(back to top)

Getting started

SK Factor is a standard Python OOP script organized in packages and modules.

By default, running sk_factor requires the following dependencies:

Prerequisites

sklearn -> core functionality such as pipeline is based on sklearn
imblearn -> provides advanced samplers to mitigate dataset unbalance
toml -> standard format for configuration file
argparse -> provides CLI arguments handling
pandas -> advanced dataset arrays operations
matplotlib -> data visualisation
seaborn -> diagrams plots

The standard python package installer PIP is required:

pip install sklearn imblearn toml argparse pandas matoplotlib seaborn

Additionnal dependencies:

lightgbm -> additionnal gradient boost estimators
xgboost -> additionnal gradient boost estimators
shap -> advanced features analysis (such as permutation)
openml -> access to machine learning datasets (instead of csv)

Installation

Clone the repo

git clone https://github.com/B2F/sk-factor.git

Grab one of the examples below

(back to top)

Configuration sections

[dataset]

The [dataset] section is used to describe the data source and how to parse it.

Example from happiness_rank.toml:

[dataset]
loader = 'open_ml'
files = ['HappinessRank_2015']
show_columns = true
plugins = 'examples.open_ml'

loader

Data parser from which one or multiple files are read.

Options: csv, open_ml, toy_datasets.

@see plugins/loader
files

Array of arguments to be passed to the loader.

Can be replaced by --train_files CLI argument.
show_columns

Will display all available dataset's columns at the beginning of the CLI output.
plugins

Package or directory used to override plugins definition. @see plugins system.

[preprocess]

The preprocess section is used to apply transformation to the dataset (drop, shuffle, encode, passthrough).

@see plugins/preprocess

label

Column name used as the target label.
label_encode

Boolean, specify if the label must be encoded (use true for string).
transformers.passthrough

List of columns left unchanged, use an empty [] for all.
transformers.one_hot_encoder

Encode values in a new categorical column.
transformers.ordinal_encoder

Encode values by replacing them in the same column.
transformers.scaler

Applies sklearn StandardScaler.
transformers.shuffle

Shuffles the DataFrame with the random state as value.
preprocessors.drop_rows

Drops n rows at the beginning (positive integer), or from the end (negative integer).
drop_rows_to_predict_file

Use dropped rows for predictions (@see predictions )
verbose_feature_names_out

Set to false to remove suffixes from one hot encoder
files_axis

Choose the dataframe merge axis when using multiple files

[eda]

The eda section (Exploratory Data Analysis) is used with matplotlib and seaborn plots or anything else printed with Python.

enabled

If the EDA phase is skipped, no output, no file save (default: true).

Use the --explore CLI option to filter script execution on eda config only.
show_plots

To skip diagram or printed output, use show_plots = false.
save_images

Write plots visual to files.
save_timestamp

Append a timestamp suffix to saved files.
images_extension

Extension of saved files.
images_directory

Directory of saved plot files.
plots

Plot plugins to use, @see plugins/plots

Options: heatmap, pairplot, distribution_y, distribution_x
features

Specify an array of columns name to be used with the plot plugin above.
figsize

Figure size width and height in inches.
dpi

Figure resolution in DPI.

[training]

The training section is used to train on splits and to to create models

enabled

If you want to force skip the training section, set to false (default: true)

Use the --train option to filter script execution on training config only.
pipeline

Training's pipeline module (default: 'imblearn.pipeline')
samplers

Imblearn samplers ('sampler/smote', 'sampler/tomek_links')

estimators

Classifier and regressor estimators (see plugins/classifiers).

# Ex:
estimators = [
   'classifier/logistic_regression',
   'classifier/ridge_classifier',
   'classifier/kneighbors_classifier',
   'classifier/sgd_classifier',
   'classifier/lgbm_classifier',
]

runners

Training score runners: 'score', 'classification_report', 'confusion_matrix', 'precision_recall' ... (@see plugins/training)[https://github.com/B2F/sk-factor/blob/main/plugins/training]
scoring

Scoring metric passed as argument to the score runner plugin ('f1', 'r2' ...)
splitting_method.kfold_stratified

Specify unlimited amount of sklearn splitting methods (one per line), with value as n_splits. Ex:
```
splitting_method.kfold = 5
splitting_method.kfold_shuffle = 5
```
save_model

Save trained model
model_timestamp

Append timestamp suffix to model filename
models_directory

Saved models directory

[predictions]

The predictions section is used to predict from training data or model files.

enabled = true

Enable or disable the predictions section section altogether.

Use the --predict option to filter CLI execution on the predictions section only.
loader

The loader plugin used to retrieve data for prediction.

Ex: 'csv'
preprocess

Choose weither or not to re-use the preprocessing section rules for the prediction data.

If you used preprocessors.drop_rows with drop_rows_to_predict_file enabled in the preprocess section, then your prediction data is already preprocessed and you'll want to set preprocess = false
predict_file

Path used to make predictions (test data).

If you set drop_rows_to_predict_file = true, then this file will be written with the number of rows from the original dataset, specified in preprocessors.drop_rows
models

An array of models files to use for predictions
objective

Options 'binary', 'multiclass', 'regressor'
threshold

Threshold parameter passed to the objective plugin to filter probabilities.
save_predictions

If set to true, predictions will be saved to the predictions_directory.
predictions_directory

Where to save predictions.
predictions_timestamp

Set to true to append the timestamp to predictions filenames.
keep_data

Set to true to keep all prediction's data columns in predictions files.

[debug]

enabled

Enable python CLI debugger
port

Debug port (Usually 5678)
host

Host address (Usually '127.0.0.1')
wait_for_client

Set to true to start debugger with execution

(back to top)

Plugin system

Default plugins are located in the plugins directory:

plugins/loader -> inherits BaseLoader

Used to implements specific loading methods, like the provided CSV loader.
plugins/preprocess/preprocessor -> inherits BasePreprocessor

Used to add data transformation to the whole dataset (drop column, drop na, drop nb rows ...)
plugins/preprocess/transformer -> inherits BaseTransformer

Data transformer per column (scaler, discretizer, encoder ...)
plugins/preprocess/selector -> inherits BaseSelector

Used to apply a transformer on columns from selector (numbers, string, k best)
plugins/plots -> inherits Report

Plotting heatmap, pairplots ....
plugins/estimators/classifier -> inherits BaseEstimator

Adds classifier algorithms (linear svc, lgbm ...)
plugins/estimators/regressor -> inherits BaseEstimator

Adds regressor algorithms (ridge cv, hgbr , xgboost...)
plugins/estimators/sampler -> inherits BaseEstimator

Sampling methods with the imblearn pipeline (smote, near miss, tomek links, instance hardness ...)
plugins/estimators/transformer -> inherits BaseEstimator

Power transforms (yeo johnson)
plugins/split -> inherits BaseCv

Splitting methods (Kfold, leave one out, shuffle ...)
plugins/training -> inherits TrainingPlot

Training reports (Confusion matrix, classification report, sharp permutation ...)
plugins/predictions -> inherits BasePredictor

Handles prediction objective with output format and threshold (binary, multiclass, regression ...)

You can override or add more functionnality by putting your plugin class files in a package containing a plugins/ directory, which hierarchy reflects the project's base plugins structure.

Plugins files names must match the class name in CamelCase with an underscore to signal an uppercase character.

This package is specified by the plugins key in your toml config's dataset section.

From the examples/toy_datasets/config/iris.toml file:

[dataset]
...
plugins = 'examples.toy_datasets'

If you look into examples/toy_datasets/plugins you will find a custom plugins structure.

Roadmap

See the open issues for a full list of proposed features (and known issues).

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can help with following tags:

plugins
loader
preprocess
plots
estimators
training
predictions
split
engine

Don't forget to give the project a star! Thanks again!

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Contact

B2F - Linkedin

Project Link: https://github.com/B2F/sk-factor

(back to top)

Acknowledgments

SK Factor is relying on all those awesome Data Science projects !

Python
- toml
- argparse
- pandas
Visualization
- matplotlib
- seaborn
Machine learning
- sklearn
- imblearn
- Models
  - lightgbm
  - xgboost
- Features exploration
  - shap
- Datasets
  - openml

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 148 Commits
config		config
examples		examples
images		images
plugins		plugins
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
sk_factor.py		sk_factor.py

License

B2F/sk-factor

Folders and files

Latest commit

History

Repository files navigation

SK Factor: one liner CLI to factor and reuse code from sklearn projects

What is SK Factor ?

Usage examples

Binary target

Preprocessing:

EDA:

Training:

Predictions

Multiclass target

Preprocessing:

EDA:

Training:

Predictions

Regression target

Preprocessing:

EDA:

Training:

Predictions

Getting started

Prerequisites

Installation

Configuration sections

[dataset]

loader

files

show_columns

plugins

[preprocess]

label

label_encode

transformers.passthrough

transformers.one_hot_encoder

transformers.ordinal_encoder

transformers.scaler

transformers.shuffle

preprocessors.drop_rows

drop_rows_to_predict_file

verbose_feature_names_out

files_axis

[eda]

enabled

show_plots

save_images

save_timestamp

images_extension

images_directory

plots

features

figsize

dpi

[training]

enabled

pipeline

samplers

estimators

runners

scoring

splitting_method.kfold_stratified

save_model

model_timestamp

models_directory

[predictions]

enabled = true

loader

preprocess

predict_file

models

objective

threshold

save_predictions

predictions_directory

predictions_timestamp

keep_data

[debug]

Packages