Skip to content

Commit

Permalink
Merge pull request #10 from JuliaAI/readme-polish
Browse files Browse the repository at this point in the history
Polish readme and extend readme example
  • Loading branch information
pebeto authored Aug 17, 2023
2 parents 3957399 + a2c9999 commit 0deb47c
Showing 1 changed file with 104 additions and 27 deletions.
131 changes: 104 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,44 +6,121 @@

[MLJ](https://github.com/alan-turing-institute/MLJ.jl) is a Julia framework for
combining and tuning machine learning models. MLJFlow is a package that extends
the MLJ capabilities to use [mlflow](https://mlflow.org/) as a backend for
the MLJ capabilities to use [MLflow](https://mlflow.org/) as a backend for
model tracking and experiment management. To be specific, MLJFlow provides a
close to zero-preparation to use mlflow with MLJ; by the usage of function
extensions that automate the mlflow cycle (create experiment, create run, log
close to zero-preparation to use MLflow with MLJ; by the usage of function
extensions that automate the MLflow cycle (create experiment, create run, log
metrics, log parameters, log artifacts, etc.).

## Background

This project is part of the GSoC 2023 program. The proposal description can be
found [here](https://summerofcode.withgoogle.com/programs/2023/projects/iRxuzeGJ).
The entire workload is divided into three different repositories:
[MLJ.jl](https://github.com/alan-turing-institute/MLJ.jl),
[MLFlowClient.jl](https://github.com/JuliaAI/MLFlowClient.jl) and this one.

## Features
- [x] mlflow cycle automation (create experiment, create run, log metrics, log
parameters, log artifacts, etc.)
- [x] Wrapper type used by MLJ to store mlflow metadata and client instance
from MLFlowClient.jl
- [x] MLJ extended functions to allow mlflow logging
- [x] Polished compatibility with composed models
- [ ] Polished compatibility with tuned models
- [ ] Polished compatibility with iterative models

## Example
```julia
# We first define a logger instance, providing the mlflow server address.
# The experiment name and artifact location are optional.
logger = MLFlowLogger("http://localhost:5000";
experiment_name="MLJFlow tests",
artifact_location="./mlj-test")

X, y = make_moons(100) # X is a 100x2 matrix, y is a 100-element vector

# Writing a normal MLJ workflow

- [x] MLflow cycle automation (create experiment, create run, log metrics, log parameters,
log artifacts, etc.)

- [x] Provides a wrapper `MLFlowLogger` for MLFlowClient.jl clients and associated
metadata; instances of this type are valid "loggers", which can be passed to MLJ
functions supporting the `logger` keyword argument.

- [x] Provides MLflow integration with MLJ's `evaluate!`/`evaluate` method (model
**performance evaluation**)

- [x] Extends MLJ's `MLJ.save` method, to save trained machines as retrievable MLflow
client artifacts

- [ ] Provides MLflow integration with MLJ's `TunedModel` wrapper (to log **hyper-parameter
tuning** workflows)

- [ ] Provides MLflow integration with MLJ's `IteratedModel` wrapper (to log **controlled
iteration** of tree gradient boosters, neural networks, and other iterative models)

- [x] Plays well with **composite models** (pipelines, stacks, etc.)


## Examples

### Logging a model performance evaluation

The example below assumes the user is familiar with basic MLflow concepts. We suppose an
MLflow API service is running on a local server, with address "http://127.0.0.1:5000". (In a
shell/console, run `mlflow server` to launch an mlflow service on a local server.)

Refer to the [MLflow documentation](https://www.mlflow.org/docs/latest/index.html) for
necessary background.

In addition to the packages listed on the first line below, we assume
MLJDecisionTreeClassifier is in the user's active Julia package environment.

```julia
using MLJBase, MLJFlow, MLJModels
```

We first define a logger, providing the address of our running MLflow. The experiment
name and artifact location are optional.

```julia
logger = MLFlowLogger(
"http://127.0.0.1:5000";
experiment_name="MLJFlow test",
artifact_location="./mlj-test"
)
```

Next, grab some synthetic data and choose an MLJ model:

```julia
X, y = make_moons(100) # a table and a vector with 100 rows
DecisionTreeClassifier = @load DecisionTreeClassifier pkg=DecisionTree
dtc_machine = machine(dtc, X, y)
model = DecisionTreeClassifier(max_depth=4)
```

Now we call `evaluate` as usual but provide the `logger` as a keyword argument:

```julia
evaluate(model, X, y, resampling=CV(nfolds=5), measures=[LogLoss(), Accuracy()], logger=logger)
```

Navigate to "http://127.0.0.1:5000" on your browser and select the "Experiment" matching
the name above ("MLJFlow test"). Select the single run displayed to see the logged results
of the performance evaluation.


### Saving and retrieving trained machines as MLflow artifacts

Let's train the model on all data and save the trained machine as an MLflow artifact:

```julia
mach = machine(model, X, y) |> fit!
run = MLJBase.save(logger, mach)
```

# Passing the logger to the machine is enough to enable mlflow logging
e1 = evaluate!(dtc_machine, resampling=CV(),
measures=[LogLoss(), Accuracy()], verbosity=1, logger=logger)
Notice that in this case `MLJBase.save` returns a run (and instance of `MLFlowRun` from
MLFlowClient.jl).

To retrieve an artifact we need to use the MLFlowClient.jl API, and for that we need to
know the MLflow service that our `logger` wraps:

```julia
service = MLJFlow.service(logger) # DOESN'T WORK YET!
```

And we reconstruct our trained machine thus:

```julia
using MLFlowClient
artifacts = MLFlowClient.listartifacts(service, run)
mach2 = machine(artifact[1].filepath)
```

We can predict using the deserialized machine:

```julia
predict(mach2, X)
```

0 comments on commit 0deb47c

Please sign in to comment.