add mlflow_sender and examples

formatting and unit tests fix 2, update tb_receiver.py to ignore events from mlflow 3. add notebook for interactive examples git basic code working with documentation update README.md
NVIDIA · Jan 9, 2023 · 02ba157 · 02ba157
1 parent 2628d7c
commit 02ba157
Show file tree

Hide file tree

Showing 55 changed files with 2,484 additions and 803 deletions.
diff --git a/examples/README.md b/examples/README.md
@@ -19,11 +19,21 @@ To get started with these examples, please follow the [Quickstart](https://nvfla
 ### 1.2 Deep Learning
 * [Hello PyTorch](./hello-pt/README.md)
   * Example using [NVIDIA FLARE](https://nvflare.readthedocs.io/en/main/index.html) an image classifier using [FedAvg]([FedAvg](https://arxiv.org/abs/1602.05629)) and [PyTorch](https://pytorch.org/) as the deep learning training framework.
-* [Hello PyTorch with TensorBoard](./hello-pt-tb/README.md)
-  * Example building upon [Hello PyTorch](./hello-pt/README.md) showcasing the [TensorBoard](https://tensorflow.org/tensorboard) streaming capability from the clients to the server.
 * [Hello TensorFlow](./hello-tf2/README.md)
   * Example of using [NVIDIA FLARE](https://nvflare.readthedocs.io/en/main/index.html) an image classifier using [FedAvg]([FedAvg](https://arxiv.org/abs/1602.05629)) and [TensorFlow](https://tensorflow.org/) as the deep learning training framework.
 
+## 2. Federated ML Experiment Tracking
+* [Machine Learning Experiment Tracking via different tools](experiment-tracking/README.md)
+
+  All examples, building upon [Hello PyTorch](./hello-pt/README.md) streaming capability from the clients to the server.
+
+  * [Hello PyTorch with TensorBoard](experiment-tracking/hello-pt-tb)
+    * showcasing the [TensorBoard](https://tensorflow.org/tensorboard) support
+  * [Hello PyTorch with MLFlow](experiment-tracking/hello-pt-mflow) 
+    * showcasing the [MLFlow](https://mlflow.org/)  support
+  * [Hello PyTorch with Weights & Biases](experiment-tracking/hello-pt-wandb)
+    * showcasing the [WandB](https://wandb.ai) support     
+
 ## 2. FL algorithms
 * [Federated Learning with CIFAR-10](./cifar10/README.md)
   * [Simulated Federated Learning with CIFAR-10](./cifar10/cifar10-sim/README.md)

diff --git a/examples/experiment-tracking/README.md b/examples/experiment-tracking/README.md
@@ -0,0 +1,117 @@
+# Experimental Tracking
+
+## Tools, Sender and Receivers
+
+Through several example, we will show how to track and visualize experiments in real time, compare results
+by leverage several experiment tracking tools. 
+* Tensorboard
+* MLFlow 
+* Weights and Biases 
+  * **Note**: user needs to signup at Weights and Biases to access service, NVFLARE will not provide the access
+
+During Federated Learning phase, user can choose and API syntax that they are used to, 
+such as APIs from one of above tools, the NVFLARE has developed component that mimic these tool's APIs. 
+These components are called experiment tracking LogWriters. All clients experiment logs are streamed over FL server, 
+where the actual experiment logs are recorded. The components that receive these logs are called Receivers.
+The receiver component will leverage the experiment tracking tool and record the logs during the experiment run. 
+
+In normal setting, we would have pairs of sender and receivers, such as
+* TBWriter  <-> TBReceiver
+* MLFlowWriter <-> MLFlowReceiver
+* WandBWriter <-> WandBReceiver
+
+We could also mix-match for any pairs. This allows one write the ML code using one API, 
+but be able to switch different experiment tracking tool(s), as matter of facts, one can use many receivers for the 
+same log data send from one sender. 
+
+![Experiment Tracking writers and Receivers](experiment_tracking.jpg)
+
+## Experiment logs streaming
+
+On the client side, When writer write the metrics, instead of writing to files, it actually generates NVFLARE events 
+of type `analytix_log_stats`. The `ConvertToFedEvent` widget will turn the local event `analytix_log_stats` into a 
+fed event `fed.analytix_log_stats`, which will be delivered to the server side.
+
+On the server side, the `Receiver` is configured to process `fed.analytix_log_stats` events,
+which writes received log data into appropriate end points.
+
+## Support custom experiment tracking tools
+
+There are many different experiment tracking tools, you might want to write a different writer and/or receiver for your needs.
+
+There are three things to consider develop a custom experiment tracking tool. 
+
+**Data Type**
+
+Currently, we choose to support metrics, params, and text data type. If you require other data type, may sure you add
+the AnalyticsDataType
+
+**Writer**
+
+implement LogWriter Interface to specify the API syntax
+
+**Receiver**
+
+Implement AnalyticsReceiver Interface, and determine how to represent different sites' log.  In all three implementations
+(Tensorboard, MLFlow, WandB), each site's log is represented as one run. Depending on the individual tool, the implementation 
+can be different. For example, Both Tensorboard and MLFLow, we simple create different runs for each client and map to the 
+site-name. In WandB implementation, we have to leverage multiprocess and let each run in different process.  
+
+## Examples Overview
+
+We illustrate how to leverage different writers and receivers via different examples.
+All examples will leverage the example hello-pt. 
+
+[hello-pt-tb](hello-pt-tb)
+* The first example shows how to use Tensorboard Tracking Tool ( both sender and receiver)
+* The second example shows how to use Tensorboard Sender only, while the receivers are MLFlow 
+![tb](tb.png)
+
+[hello-pt-tb](hello-pt-mlflow)
+* The first example shows how to use MLFlow Tracking Tool ( both sender and receiver)
+* The second example shows how to use MLFlow Sender only, while the receivers are Tensorboard
+
+![mlflow_1](mlflow_2.png)
+![mlflow_1](mlflow_1.png)
+
+[hello-pt-wandb](hello-pt-wandb)
+* The example shows how to use Weights & Biases Tracking Tool ( both sender and receiver)
+
+
+![wandb_1](wandb_1.png)
+
+## Setup for all examples
+
+Example of using [NVIDIA FLARE](https://nvflare.readthedocs.io/en/main/index.html) to train an image classifier using federated averaging ([FedAvg]([FedAvg](https://arxiv.org/abs/1602.05629))) and [PyTorch](https://pytorch.org/) as the deep learning training framework. This example also highlights the TensorBoard streaming capability from the clients to the server.
+
+> **_NOTE:_** This example uses the [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset and will load its data within the trainer code.
+
+### 1. Prepare venv.
+
+```
+python -m venv ~/nvflare-hello
+
+source ~/nvflare-hello/bin/activate
+```
+
+
+### 2. Install NVIDIA FLARE
+
+Follow the [Installation](https://nvflare.readthedocs.io/en/main/quickstart.html) instructions.
+
+### 3. Install Jupyter Notebook
+
+```
+pip install notebook
+```
+
+### 4. Examples in Notebook
+
+For detailed examples explaination, please check the notebook
+
+```
+jupyter notebook experiment_tracking.ipynb
+```
+
+
+