Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add examples for MLflow #1772

Merged
merged 15 commits into from
May 31, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions docs/examples/tensorboard_streaming.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Introduction
In this exercise, you will learn how to stream TensorBoard events from the clients
to the server in order to visualize live training metrics from a central place on the server.

This exercise will be working with the ``tensorboard-streaming`` example in the examples folder,
This exercise will be working with the ``tensorboard-streaming`` example in the advanced examples folder under experiment-tracking,
which builds upon :doc:`hello_pt` by adding TensorBoard streaming.

The setup of this exercise consists of one **server** and two **clients**.
Expand Down Expand Up @@ -42,7 +42,7 @@ Adding TensorBoard Streaming to Configurations

Inside the config folder there are two files, ``config_fed_client.json`` and ``config_fed_server.json``.

.. literalinclude:: ../../examples/advanced/experiment-tracking/tensorboard-streaming/jobs/tensorboard-streaming/app/config/config_fed_client.json
.. literalinclude:: ../../examples/advanced/experiment-tracking/tensorboard/jobs/tensorboard-streaming/app/config/config_fed_client.json
:language: json
:linenos:
:caption: config_fed_client.json
Expand All @@ -60,7 +60,7 @@ which converts local events to federated events.
This changes the event ``analytix_log_stats`` into a fed event ``fed.analytix_log_stats``,
which will then be streamed from the clients to the server.

.. literalinclude:: ../../examples/advanced/experiment-tracking/tensorboard-streaming/jobs/tensorboard-streaming/app/config/config_fed_server.json
.. literalinclude:: ../../examples/advanced/experiment-tracking/tensorboard/jobs/tensorboard-streaming/app/config/config_fed_server.json
:language: json
:linenos:
:caption: config_fed_server.json
Expand All @@ -83,7 +83,7 @@ In this exercise, all of the TensorBoard code additions will be made in ``pt_lea

First we must initialize our TensorBoard writer to the ``AnalyticsSender`` we defined in the client config:

.. literalinclude:: ../../examples/advanced/experiment-tracking/tensorboard-streaming/jobs/tensorboard-streaming/app/custom/pt_learner.py
.. literalinclude:: ../../examples/advanced/experiment-tracking/tensorboard/jobs/tensorboard-streaming/app/custom/pt_learner.py
:language: python
:lines: 103-106
:lineno-start: 103
Expand All @@ -98,7 +98,7 @@ but we can also define it in the client config to be passed into the constructor
Now that our TensorBoard writer is set to ``AnalyticsSender``,
we can write and stream training metrics to the server in ``local_train()``:

.. literalinclude:: ../../examples/advanced/experiment-tracking/tensorboard-streaming/jobs/tensorboard-streaming/app/custom/pt_learner.py
.. literalinclude:: ../../examples/advanced/experiment-tracking/tensorboard/jobs/tensorboard-streaming/app/custom/pt_learner.py
:language: python
:lines: 144-174
:lineno-start: 144
Expand Down Expand Up @@ -160,4 +160,4 @@ Congratulations!
Now you will be able to see the live training metrics of each client from a central place on the server.

The full source code for this exercise can be found in
`examples/advanced/experiment-tracking/tensorboard-streaming <https://github.com/NVIDIA/NVFlare/tree/main/examples/advanced/experiment-tracking/tensorboard-streaming>`_.
`examples/advanced/experiment-tracking/tensorboard <https://github.com/NVIDIA/NVFlare/tree/main/examples/advanced/experiment-tracking/tensorboard>`_.
84 changes: 84 additions & 0 deletions examples/advanced/experiment-tracking/mlflow/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Hello PyTorch with MLflow

Example of using [NVIDIA FLARE](https://nvflare.readthedocs.io/en/main/index.html) to train an image classifier
using federated averaging ([FedAvg](https://arxiv.org/abs/1602.05629)) and [PyTorch](https://pytorch.org/)
as the deep learning training framework.

This example also highlights the MLflow streaming capability from the clients to the server.

> **_NOTE:_** This example uses the [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset and will load its data within the trainer code.

### 1. Install requirements and configure PYTHONPATH

Install additional requirements:

```
python -m pip install -r requirements.txt
```

Set `PYTHONPATH` to include custom files of this example:
```
export PYTHONPATH=${PWD}/../pt
```

### 2. Run the experiment

Use nvflare simulator to run the example:

```
nvflare simulator -w /tmp/nvflare/ -n 2 -t 2 ./jobs/hello-pt-mlflow
```

### 3. Access the logs and results

You can find the running logs and results inside the simulator's workspace in a directory named "simulate_job".

```bash
$ ls /tmp/nvflare/simulate_job/
app_server app_site-1 app_site-2 log.txt tb_events

```

By default, MLflow will create an experiment log directory under a directory named "mlruns" in the simulator's workspace.

### 4. MLflow Streaming

For the job `hello-pt-mlflow`, on the client side, the client code in `PTLearner` uses the syntax for mlflow (to make it easier to use code already using tracking with MLflow):

```
self.writer.log_metrics({"train_loss": cost.item(), "running_loss": running_loss}, current_step)

self.writer.log_metric("validation_accuracy", metric, epoch)

self.writer.log_text(f"last running_loss reset at '{len(self.train_loader) * epoch + i}' step", "running_loss_reset.txt")
```

The `MLflowWriter` actually mimics the mlflow to send the information in events to the server through NVFlare events
of type `analytix_log_stats` for the server to write the data to the MLflow tracking server.

The `ConvertToFedEvent` widget turns the event `analytix_log_stats` into a fed event `fed.analytix_log_stats`,
which will be delivered to the server side.

On the server side, the `MLflowReceiver` is configured to process `fed.analytix_log_stats` events,
which writes received data from these events to the MLflow tracking server.

This allows for the server to be the only party that needs to deal with authentication for the MLflow tracking server, and the server
can buffer the events from many clients to better manage the load of requests to the tracking server.

Note that the server also has `TBAnalyticsReceiver` configured, which also listens to `fed.analytix_log_stats` events by default,
so the data is also written into TB files on the server.

### 5. Tensorboard Streaming with MLflow

For the job `hello-pt-tb-mlflow`, on the client side, the client code in `PTLearner` uses the syntax for Tensorboard:

```
self.writer.add_scalar("train_loss", cost.item(), current_step)

self.writer.add_scalar("validation_accuracy", metric, epoch)
```

The `TBWriter` mimics Tensorboard SummaryWriter and streams events over to the server side instead.

Note that in this job, the server still has `MLflowReceiver` and `TBAnalyticsReceiver` configured the same as in the job with `MLflowWriter`
on the client side, and the events are converted by the `MLflowReceiver` to write to the MLflow tracking server.
180 changes: 180 additions & 0 deletions examples/advanced/experiment-tracking/mlflow/experiment_tracking.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
{
nvkevlu marked this conversation as resolved.
Show resolved Hide resolved
"cells": [
{
"cell_type": "markdown",
"id": "e129ede5",
"metadata": {},
"source": [
" # Hello PyTorch with MLflow"
]
},
{
"cell_type": "markdown",
"id": "9bf7e391",
"metadata": {},
"source": [
"In this example, we like to demonstrate that the example code used in hello-pt-tb with PyTorch Tensorboard tracking can be simply switched to using an MLflow tracking server without changing the code.\n"
]
},
{
"cell_type": "markdown",
"id": "18ec76f4",
"metadata": {},
"source": [
"\n",
"Example of using [NVIDIA FLARE](https://nvflare.readthedocs.io/en/main/index.html) to train an image classifier using federated averaging ([FedAvg]([FedAvg](https://arxiv.org/abs/1602.05629))) and [PyTorch](https://pytorch.org/) as the deep learning training framework. This example also highlights the streaming capability from the clients to the server with Tensorboard SummaryWriter sender syntax, but with a MLflow receiver\n",
"\n",
"> **_NOTE:_** This example uses the [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset and will load its data within the trainer code.\n"
]
},
{
"cell_type": "markdown",
"id": "bbca0050",
"metadata": {},
"source": [
"### 1. Install NVIDIA FLARE\n",
"\n",
"Follow the [Installation](https://nvflare.readthedocs.io/en/main/getting_started.html#installation) instructions.\n",
"Install additional requirements:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e2b5579b",
"metadata": {},
"outputs": [],
"source": [
"%pip install torch torchvision tensorboard mlflow"
]
},
{
"cell_type": "markdown",
"id": "b8226dd7",
"metadata": {},
"source": [
"### 2. Change Configuration\n",
"\n",
"in fed_server_config.json\n",
"\n",
"add the following to the components\n",
"```\n",
"{\n",
" \"id\": \"mlflow_receiver\",\n",
" \"path\": \"nvflare.app_opt.tracking.mlflow.mlflow_receiver.MLflowReceiver\",\n",
" \"args\": {\n",
" \"kwargs\": {\"experiment_name\": \"hello-pt-experiment\"},\n",
" \"artifact_location\": \"artifacts\"\n",
" }\n",
"}\n",
"```\n",
"This indicates that we are registering the MLflow Receiver in additional to the Tensorboard Receiver.\n",
"\n",
"Note that the job hello-pt-mlflow is an example using mlflow syntax and the MLflowWriter on the client side, and\n",
"hello-pt-tb-mlflow has the learner using tb syntax. Both work with MLflowReceiver.\n"
]
},
{
"cell_type": "markdown",
"id": "6fe3165d",
"metadata": {},
"source": [
"\n",
"### 3. Run the experiment\n",
"\n",
"Use nvflare simulator to run the examples with the additional common python files included in the python path:\n",
"\n",
"export PYTHONPATH=${PWD}/../pt\n",
"\n",
"```\n",
"nvflare simulator -w /tmp/nvflare/ -n 2 -t 2 ./jobs/hello-pt-tb-mlflow\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c8f08cef",
"metadata": {},
"outputs": [],
"source": [
"!nvflare simulator -w /tmp/nvflare/ -n 2 -t 2 ./jobs/hello-pt-tb-mlflow"
]
},
{
"cell_type": "markdown",
"id": "7b1fe44d",
"metadata": {},
"source": [
"### 4. Tensorboard Tracking\n",
"\n",
"On the client side, we are still using the TensorBoard SummaryWriter as the `AnalyticsSender`. \n",
"\n",
"Instead of writing to TB files, it actually generates NVFLARE events of type `analytix_log_stats`.\n",
"The `ConvertToFedEvent` widget will turn the event `analytix_log_stats` into a fed event `fed.analytix_log_stats`,\n",
"which will be delivered to the server side.\n",
"\n",
"On the server side, the `TBAnalyticsReceiver` is configured to process `fed.analytix_log_stats` events,\n",
"which writes received TB data into appropriate TB files on the server.\n",
"\n",
"To view training metrics that are being streamed to the server, run:\n",
"\n",
"```\n",
"tensorboard --logdir=/tmp/nvflare/simulate_job/tb_events\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "534d7879",
"metadata": {},
"source": [
"### 5. MLflow tracking\n",
"\n",
"On the server side, we also configured `MLflowReceiver` to process `fed.analytix_log_stats` events,\n",
"which writes received events to the MLflow tracking server.\n",
"\n",
"To view training metrics that are being streamed to the server, run:\n",
"\n",
"```\n",
"mlflow ui --backend-store-uri=/tmp/nvflare/mlruns\n",
"```\n",
"\n",
"Then \n",
"\n",
"Look at the URL in browser http://localhost:5000/"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "da1e7952-c3e6-4e90-a42e-648a823ede78",
"metadata": {},
"outputs": [],
"source": [
"!mlflow ui --backend-store-uri=/tmp/nvflare/mlruns"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "nvflare_example",
"language": "python",
"name": "nvflare_example"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
{
"format_version": 2,

"executors": [
{
"tasks": [
"train",
"submit_model",
"validate"
],
"executor": {
"id": "Executor",
"path": "nvflare.app_common.executors.learner_executor.LearnerExecutor",
"args": {
"learner_id": "pt_learner"
}
}
}
],
"task_result_filters": [
],
"task_data_filters": [
],
"components": [
{
"id": "pt_learner",
"path": "pt_learner.PTLearner",
"args": {
"lr": 0.01,
"epochs": 5,
"analytic_sender_id": "log_writer"
}
},
{
"id": "log_writer",
"path": "nvflare.app_opt.tracking.mlflow.mlflow_writer.MLflowWriter",
"args": {"event_type": "analytix_log_stats"}
},
{
"id": "event_to_fed",
"name": "ConvertToFedEvent",
"args": {"events_to_convert": ["analytix_log_stats"], "fed_event_prefix": "fed."}
}
]
}
Loading