From a2c99991622a0660511d010279ea2cc872ea6dae Mon Sep 17 00:00:00 2001 From: "Anthony D. Blaom" Date: Thu, 17 Aug 2023 14:36:09 +1200 Subject: [PATCH] polish readme and extend readme example --- README.md | 131 +++++++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 104 insertions(+), 27 deletions(-) diff --git a/README.md b/README.md index 1f2eb19..a16846a 100644 --- a/README.md +++ b/README.md @@ -6,13 +6,14 @@ [MLJ](https://github.com/alan-turing-institute/MLJ.jl) is a Julia framework for combining and tuning machine learning models. MLJFlow is a package that extends -the MLJ capabilities to use [mlflow](https://mlflow.org/) as a backend for +the MLJ capabilities to use [MLflow](https://mlflow.org/) as a backend for model tracking and experiment management. To be specific, MLJFlow provides a -close to zero-preparation to use mlflow with MLJ; by the usage of function -extensions that automate the mlflow cycle (create experiment, create run, log +close to zero-preparation to use MLflow with MLJ; by the usage of function +extensions that automate the MLflow cycle (create experiment, create run, log metrics, log parameters, log artifacts, etc.). ## Background + This project is part of the GSoC 2023 program. The proposal description can be found [here](https://summerofcode.withgoogle.com/programs/2023/projects/iRxuzeGJ). The entire workload is divided into three different repositories: @@ -20,30 +21,106 @@ The entire workload is divided into three different repositories: [MLFlowClient.jl](https://github.com/JuliaAI/MLFlowClient.jl) and this one. ## Features -- [x] mlflow cycle automation (create experiment, create run, log metrics, log - parameters, log artifacts, etc.) -- [x] Wrapper type used by MLJ to store mlflow metadata and client instance - from MLFlowClient.jl -- [x] MLJ extended functions to allow mlflow logging -- [x] Polished compatibility with composed models -- [ ] Polished compatibility with tuned models -- [ ] Polished compatibility with iterative models - -## Example -```julia -# We first define a logger instance, providing the mlflow server address. -# The experiment name and artifact location are optional. -logger = MLFlowLogger("http://localhost:5000"; - experiment_name="MLJFlow tests", - artifact_location="./mlj-test") - -X, y = make_moons(100) # X is a 100x2 matrix, y is a 100-element vector - -# Writing a normal MLJ workflow + +- [x] MLflow cycle automation (create experiment, create run, log metrics, log parameters, + log artifacts, etc.) + +- [x] Provides a wrapper `MLFlowLogger` for MLFlowClient.jl clients and associated + metadata; instances of this type are valid "loggers", which can be passed to MLJ + functions supporting the `logger` keyword argument. + +- [x] Provides MLflow integration with MLJ's `evaluate!`/`evaluate` method (model + **performance evaluation**) + +- [x] Extends MLJ's `MLJ.save` method, to save trained machines as retrievable MLflow + client artifacts + +- [ ] Provides MLflow integration with MLJ's `TunedModel` wrapper (to log **hyper-parameter + tuning** workflows) + +- [ ] Provides MLflow integration with MLJ's `IteratedModel` wrapper (to log **controlled + iteration** of tree gradient boosters, neural networks, and other iterative models) + +- [x] Plays well with **composite models** (pipelines, stacks, etc.) + + +## Examples + +### Logging a model performance evaluation + +The example below assumes the user is familiar with basic MLflow concepts. We suppose an +MLflow API service is running on a local server, with address "http://127.0.0.1:5000". (In a +shell/console, run `mlflow server` to launch an mlflow service on a local server.) + +Refer to the [MLflow documentation](https://www.mlflow.org/docs/latest/index.html) for +necessary background. + +In addition to the packages listed on the first line below, we assume +MLJDecisionTreeClassifier is in the user's active Julia package environment. + +```julia +using MLJBase, MLJFlow, MLJModels +``` + +We first define a logger, providing the address of our running MLflow. The experiment +name and artifact location are optional. + +```julia +logger = MLFlowLogger( + "http://127.0.0.1:5000"; + experiment_name="MLJFlow test", + artifact_location="./mlj-test" +) +``` + +Next, grab some synthetic data and choose an MLJ model: + +```julia +X, y = make_moons(100) # a table and a vector with 100 rows DecisionTreeClassifier = @load DecisionTreeClassifier pkg=DecisionTree -dtc_machine = machine(dtc, X, y) +model = DecisionTreeClassifier(max_depth=4) +``` + +Now we call `evaluate` as usual but provide the `logger` as a keyword argument: + +```julia +evaluate(model, X, y, resampling=CV(nfolds=5), measures=[LogLoss(), Accuracy()], logger=logger) +``` + +Navigate to "http://127.0.0.1:5000" on your browser and select the "Experiment" matching +the name above ("MLJFlow test"). Select the single run displayed to see the logged results +of the performance evaluation. + + +### Saving and retrieving trained machines as MLflow artifacts + +Let's train the model on all data and save the trained machine as an MLflow artifact: + +```julia +mach = machine(model, X, y) |> fit! +run = MLJBase.save(logger, mach) +``` -# Passing the logger to the machine is enough to enable mlflow logging -e1 = evaluate!(dtc_machine, resampling=CV(), - measures=[LogLoss(), Accuracy()], verbosity=1, logger=logger) +Notice that in this case `MLJBase.save` returns a run (and instance of `MLFlowRun` from +MLFlowClient.jl). + +To retrieve an artifact we need to use the MLFlowClient.jl API, and for that we need to +know the MLflow service that our `logger` wraps: + +```julia + service = MLJFlow.service(logger) # DOESN'T WORK YET! +``` + +And we reconstruct our trained machine thus: + +```julia +using MLFlowClient +artifacts = MLFlowClient.listartifacts(service, run) +mach2 = machine(artifact[1].filepath) +``` + +We can predict using the deserialized machine: + +```julia +predict(mach2, X) ```