This directory contains machine learning model prototypes that help in predicting whether an ISO20022 pacs.008
XML message will be successfully processed (Success
) or fail processing (Failure
) leading to exception processing.
Amazon SageMaker built-in machine learning algorithms XGBoost and Linear Learner are used to train two different model.
Amazon SageMaker Autopilot is used demonstrate automated machine learning (AutoML) that reduces the effort required to
build, train, tune and deploy a model.
The prototype models' prediction is in form of a tuple [(1=Success, 0=Failure), Probability Score] where probability score is probability of the predicted outcome.
The project consists of Python notebooks in following directories:
- pacs008/synthetic-data: This directory contains two notebooks for generating synthetic data for the ML prototype.
- iso20022_lei_bic_datasets.ipynb: This notebook generates fake BIC database as a csv file and LEI database again as a csv file. Read the notebook for details. There two generated csv files that can be used for ML prototype. These csv files were used in pacs.008 XML message generation by the iso20022 message generator tool by this prototype.
- gen_pacs008_synthetic_dataset.ipynb: This notebook creates synthetic raw and labeled dataset using ISO20022 pacs.008 XML messages generated by ISO20022 Message Generator tool.
- pacs008/automl: This directory contains notebooks that use Amazon SageMaker Autopilot service to train ML models:
- pacs008_automl_model_training.ipynb: Prototype model using Amazon SageMaker Autopilot service. It uses labeled synthetic data generated by gen_pacs008_synthetic_dataset.ipynb to train several models by SageMaker Autopilot and then selecting the best performing model for deployment.
- pacs008_automl_model_deployment.ipynb: This notebook deploys the best performing ML model from ML SageMaker Autopilot training job.
- automl_batch_transform_example.ipynb: Demonstrates batch inference using Amazon SageMaker Batch Transform service.
- pacs008/xgboost: This directory contains notebooks that use Amazon SageMaker XGBoost built-in algorithm to train an ML model.
- pacs008_xgboost_inference_pipeline.ipynb: A notebook that trains an ML model using Amazon SageMaker XGBoost built-in algorithm. After training the model is deployed to an Amazon SageMaker Inference Endpoint. It uses Amazon SageMaker Inference Pipeline to deploy scikit-learn container for data transformation and XGBoost model for inference.
- pacs008_xgboost_local.ipynb: This notebooks demonstrates data analysis and feature engineering of a text feature
(
InstrForNxtAgt
). The text feature is transformed into numeric representation using text preprocessing techniques such as word frequency count, term frequency-inverse document frequency (TFIDF) and Multinomial Naive Bayes model to understand how text can help in predictions. The approach was used to write custom scikit-learn transformers to transform text to numeric features (feature engineering). - xgb_batch_transform_example.ipynb: A notebook that demonstrates use of Amazon SageMaker Batch Transform for batch inference using XGBoost model.
- pacs008/linear-learner: This directory contains notebooks that use Amazon SageMaker Linear Learner built-in algorithm to train ML models.
- pacs008_linear_learner_inference_pipeline.ipynb: A notebook that trains an ML model using Amazon SageMaker Linear Learner built-in algorithm. After training the model is deployed to an Amazon SageMaker Inference Endpoint. It uses Amazon SageMaker Inference Pipeline to deploy scikit-learn container for data transformation and Linear Learner model for inference.
- ll_batch_transform_example.ipynb: A notebook that demonstrates use of Amazon SageMaker Batch Transform for batch inference using a Linear Learner model.
- pacs008/sklearn-transformers: This directory contains custom scikit-learn transformers that are used in data preprocessing and
feature engineering tasks. These transformers are deployed in scikit-learn containers in SageMaker Inference Pipeline
that is used to deploy XGBoost and Linear Learner trained models.
- pacs008_sklearn_featurizer.py: Implements data preprocessing and featurizing features using scikit-learn pipeline and ColumnTransformer. This transforms and prepares data before training jobs and before using features in pacs.008 XML message for inference, either using realtime inference via Inference Endpoint or batch inference via Batch Transform.
- pacs008_sklearn_transformer.py: Implements scikit-learn custom transformers.
These notebooks should be run using a Python3 Jupyter notebook kernel. This can be Python 3 (Data Science)
kernel in
SageMaker Studio or conda_python3
kernel in SageMaker notebook instance.
The diagram below captures the ML lifecycle used to develop the ML prototype:
To get started with the ML prototype model, follow the steps below:
Note: The githib repository includes iso20022-raw-messages.tar.gz to get you stated quickly. If you want to use that set raw ISO20022 pacs.008 XML messages, you can skip steps 1 and 2.
-
Generate ISO20022 pacs.008 XML messages using the ISO20022 Message Generator tool.
rapide-iso20022 -n 50000 -d messages
-
Gzip the messages directory upload to Amazon SageMaker notebook (either Amazon SageMaker Studio notebook or Amazon SageMaker Notebook instance) in the
iso20022-message-generator/ml-models/pacs008/synthetic-data/iso20022-data
directory i.e. directory where you cloned this github repository. -
Now you have raw ISO20022 pacs.008 XML messages. Next step is to generate Synthetic raw and labeled raw datasets for use in ML model training. To do this use notebook to generate raw and raw labeled dataset. These raw datasets are further split into training and test datasets in each of the model training notebooks.
-
You can use Amazon SageMaker Autopilot or Amazon Sagemaker XGBoost or Amazon SageMaker Linear Learner or all notebooks to train and deploy an ML model. After deployment, you can test it by sending a test message to the Inference Endpoint. Training all models will allow you to compare and evaluate model performance using each of the approaches.
-
You can also use SageMaker Batch Transform notebook to perform batch inference using the training model and evaluate model's performance.
Use the included notebooks in the following order:
- Train an ML models using
pacs008_xgboost_inference_pipeline.ipynb
notebook. Prototype a model using Amazon SageMaker XGBoost algorithm. It uses labeled synthetic data generated by gen_pacs008_synthetic_dataset.ipynb to train several models by SageMaker Autopilot and then selecting the best performing model for deployment. - Deploy and test the trained model using
pacs008_xgboost_inference_pipeline.ipynb
notebook. - Perform batch inference using the trained model using
automl_batch_transform_example.ipynb
. This notebook evaluates model by computing confusion matrix. You can compare the Autopilot generated model to other two models which are feature engineered by hand and then trained.
Use the included notebooks in the following order:
- Train an ML models using
pacs008_automl_model_training.ipynb
notebook.: Prototype model using Amazon SageMaker Autopilot service. It uses labeled synthetic data generated by gen_pacs008_synthetic_dataset.ipynb to train several models by SageMaker Autopilot and then selecting the best performing model for deployment. - Deploy and test the trained model using
pacs008_automl_model_deployment.ipynb
notebook. This notebook deploys the XGBoost model from ML SageMaker Autopilot training job. - Perform batch inference using the trained model using
xgb_batch_transform_example.ipynb
. This notebook evaluates the model by computing confusion matrix. You can compare the XGBoost trained model to Autopilot and Linear Learner models. Linear learner model is feature engineered by hand and then trained using same feature engineering code. - You can read and execute the
pacs008_xgboost_local.ipynb
notebook. It runs locally (does not use SageMaker XGBoost training). The notebook demonstrates feature engineering for text features using scikit-learn transformers, text preprocessing, text feature engineering approach and check along the way if new features derived from the text feature help in improving prediction.
Use the included notebooks in the following order:
- Train an ML models using
pacs008_linear_learner_inference_pipeline.ipynb
notebook.: Prototype model using Amazon SageMaker Autopilot service. It uses labeled synthetic data generated by gen_pacs008_synthetic_dataset.ipynb to train several models by SageMaker Autopilot and then selecting the best performing model for deployment. - Deploy and test the trained model using
pacs008_linear_learner_inference_pipeline.ipynb
notebook. The training notebook also deploys the trained model using SageMaker Inference Pipeline. - Perform batch inference using the trained model using
ll_batch_transform_example.ipynb
. This notebook evaluates the model by computing confusion matrix. You can compare the Linear Learner trained model to Autopilot and XGBoost models. As mentioned XGBoost and Linear learner models are feature engineered by hand and then trained using same feature engineering code.