microsoft · Jinyu-W · May 19, 2023 · May 19, 2023 · May 19, 2023 · May 19, 2023
diff --git a/examples/mis/lwd/README.md b/examples/mis/lwd/README.md
@@ -0,0 +1,150 @@
+# LWD: Learning What to Defer for Maximum Independent Sets
+
+It is the application of the [LwD](http://proceedings.mlr.press/v119/ahn20a.html) method (published in ICML 2020).
+Some code are referred from the original GitHub [repository](https://github.com/sungsoo-ahn/learning_what_to_defer)
+held by the authors.
+
+## Algorithm Introduction
+
+### Deferred Markov Decision Process
+
+<div style="text-align: center;">
+
+  ![MDP Illustration](./figures/fig1_mdp.png)
+
+</div>
+
+#### State
+
+Each state of the MDP is represented as a *vertex-state* vector:
+
+<div style="text-align: center;">
+
+$s = [s_i: i \in V] \in \{0, 1, *\}^V$
+
+</div>
+
+where *0, 1, \** indicates vertex *i* is *excluded*, *included*,
+and *the determination is deferred and expected to be made in later iterations* respectively.
+The MDP is *initialized* with the deferred vertex-states, i.e., $s_i = *, \forall i \in V$,
+while *terminated* when (a) there is no deferred vertex-state left or (b) time limit is reached.
+
+#### Action
+
+Actions correspond to new assignments for the next state of vertices, defined only on the deferred vertices here:
+
+<div style="text-align: center;">
+
+$a_* = [a_i: i \in V_*] \in \{0, 1, *\}^{V_*}$
+
+</div>
+
+where $V_* = \{i: i \in V, x_i = *\}$.
+
+#### Transition
+
+The transition $P_{a_*}(s, s')$ consists of two deterministic phases:
+
+- *update phase*: takes the action $a_*$ to get an intermediate vertex-state $\hat{s}$,
+i.e., $\hat{s_i} = a_i$ if $i \in V_*$ and $\hat{s_i} = s_i$ otherwise.
+- *clean-up phase*: modifies $\hat{s}$ to yield a valid vertex-state $s'$.
+
+  - Whenever there exists a pair of included vertices adjacent to each other,
+  they are both mapped back to the deferred vertex-state.
+  - Excludes any deferred vertex neighboring with an included vertex.
+
+Here is an illustration of the transition fucntion:
+
+<div style="text-align: center;">
+
+![Transition](./figures/fig2_transition.png)
+
+</div>
+
+#### Reward
+
+A *cardinality reward* is defined here:
+
+<div style="text-align: center;">
+
+$R(s, s') = \sum_{i \in V_* \setminus V_*'}{s_i'}$
+
+</div>
+
+where $V_*$ and $V_*'$ are the set of vertices with deferred vertex-state with respect to $s$ and $s'$ respectively.
+By doing so, the overall reward of the MDP corresonds to the cardinality of the independent set returned.
+
+### Diversification Reward
+
+Couple two copies of MDPs defined on an indentical graph $G$ into a new MDP.
+Then the new MDP is associated with a pair of distinct vertex-state vectors $(s, \bar{s})$,
+and let the resulting solutions be $(x, \bar{x})$.
+We directly reward the deviation between the coupled solutions in terms of $l_1$-norm, i.e., $||x-\bar{x}||_1$.
+To be specific, the deviation is decomposed into rewards in each iteration of the MDP defined by:
+
+<div style="text-align: center;">
+
+$R_{div}(s, s', \bar{s}, \bar{s}') = \sum_{i \in \hat{V}}|s_i'-\bar{s}_i'|$, where $\hat{V}=(V_* \setminus V_*')\cup(\bar{V}_* \setminus \bar{V}_*')$
+
+</div>
+
+Here is an example of the diversity reward:
+
+<div style="text-align: center;">
+
+![Diversity Reward](./figures/fig3_diversity_reward.png)
+
+</div>
+
+*The Entropy Regularization plays a similar role to the diversity reward introduced above.
+But note that, the entropy regularition only attempts to generate diverse trajectories of the same MDP,
+which does not necessarily lead to diverse solutions at last,
+since there existing many trajectories resulting in the same solution.*
+
+### Design of the Neural Network
+
+The policy network $\pi(a|s)$ and the value network $V(s)$ is designed to follow the
+[GraphSAGE](https://proceedings.neurips.cc/paper/2017/hash/5dd9db5e033da9c6fb5ba83c7a7ebea9-Abstract.html) architecture,
+which is a general inductive framework that leverages node feature information
+to efficiently generate node embeddings by sampling and aggregating features from a node's local neighborhood.
+Each network consists of multiple layers $h^{(n)}$ with $n = 1, ..., N$
+where the $n$-layer with weights $W_1^{(n)}$ and $W_2^{(n)}$ performs the following transformation on input $H$:
+
+<div style="text-align: center;">
+
+$h^{(n)} = ReLU(HW_1^{(n)}+D^{-\frac{1}{2}}BD^{-\frac{1}{2}}HW_2^{(N)})$.
+
+</div>
+
+Here $B$ and $D$ corresponds to adjacency and degree matrix of the graph $G$, respectively. At the final layer,
+the policy and value networks apply softmax function and graph readout function with sum pooling instead of ReLU
+to generate actions and value estimates, respectively.
+
+### Input of the Neural Network
+
+- The subgraph that is induced on the deferred vertices $V_*$ as the input of the networks
+since the determined part of the graph no longer affects the future rewards of the MDP.
+- Input features:
+
+  - Vertex degrees;
+  - The current iteration-index of the MDP, normalized by the maximum number of iterations.
+
+### Training Algorithm
+
+The Proximal Policy Optimization (PPO) is used in this solution.
+
+## Quick Start
+
+Please make sure the environment is correctly set up, refer to
+[MARO](https://github.com/microsoft/maro#install-maro-from-source) for more installation guidance.
+To try the example code, you can simply run:
+
+```sh
+python examples/rl/run.py examples/mis/lwd/config.yml
+```
+
+The default log path is set to *examples/mis/lwd/log/test*, the recorded metrics and training curves can be found here.
+
+To adjust the configurations of the training workflow, go to file: *examples/mis/lwd/config.yml*,
+To adjust the problem formulation, network setting and some other detailed configurations,
+go to file *examples/mis/lwd/config.py*.
diff --git a/examples/mis/lwd/__init__.py b/examples/mis/lwd/__init__.py
@@ -0,0 +1,99 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+
+import torch
+
+from maro.rl.rl_component.rl_component_bundle import RLComponentBundle
+from maro.rl.utils.common import get_env
+from maro.simulator import Env
+
+from examples.mis.lwd.config import Config
+from examples.mis.lwd.env_sampler.mis_env_sampler import MISEnvSampler, MISPlottingCallback
+from examples.mis.lwd.simulator.mis_business_engine import MISBusinessEngine
+from examples.mis.lwd.ppo import get_ppo_policy, get_ppo_trainer
+
+
+config = Config()
+
+# Environments
+learn_env = Env(
+    business_engine_cls=MISBusinessEngine,
+    durations=config.max_tick,
+    options={
+        "graph_batch_size": config.train_graph_batch_size,
+        "num_samples": config.train_num_samples,
+        "device": torch.device(config.device),
+        "num_node_lower_bound": config.num_node_lower_bound,
+        "num_node_upper_bound": config.num_node_upper_bound,
+        "node_sample_probability": config.node_sample_probability,
+    },
+)
+
+test_env = Env(
+    business_engine_cls=MISBusinessEngine,
+    durations=config.max_tick,
+    options={
+        "graph_batch_size": config.eval_graph_batch_size,
+        "num_samples": config.eval_num_samples,
+        "device": torch.device(config.device),
+        "num_node_lower_bound": config.num_node_lower_bound,
+        "num_node_upper_bound": config.num_node_upper_bound,
+        "node_sample_probability": config.node_sample_probability,
+    },
+)
+
+# Agent, policy, and trainers
+agent2policy = {agent: f"ppo_{agent}.policy" for agent in learn_env.agent_idx_list}
+
+policies = [
+    get_ppo_policy(
+        name=f"ppo_{agent}.policy",
+        state_dim=config.input_dim,
+        action_num=config.output_dim,
+        hidden_dim=config.hidden_dim,
+        num_layers=config.num_layers,
+        init_lr=config.init_lr,
+    )
+    for agent in learn_env.agent_idx_list
+]
+
+trainers = [
+    get_ppo_trainer(
+        name=f"ppo_{agent}",
+        state_dim=config.input_dim,
+        hidden_dim=config.hidden_dim,
+        num_layers=config.num_layers,
+        init_lr=config.init_lr,
+        clip_ratio=config.clip_ratio,
+        max_tick=config.max_tick,
+        batch_size=config.batch_size,
+        reward_discount=config.reward_discount,
+        graph_batch_size=config.train_graph_batch_size,
+        graph_num_samples=config.train_num_samples,
+        num_train_epochs=config.num_train_epochs,
+        norm_base=config.reward_normalization_base,
+    )
+    for agent in learn_env.agent_idx_list
+]
+
+device_mapping = {f"ppo_{agent}.policy": config.device for agent in learn_env.agent_idx_list}
+
+# Build RLComponentBundle
+rl_component_bundle = RLComponentBundle(
+    env_sampler=MISEnvSampler(
+        learn_env=learn_env,
+        test_env=test_env,
+        policies=policies,
+        agent2policy=agent2policy,
+        diversity_reward_coef=config.diversity_reward_coef,
+        reward_normalization_base=config.reward_normalization_base,
+    ),
+    agent2policy=agent2policy,
+    policies=policies,
+    trainers=trainers,
+    device_mapping=device_mapping,
+    customized_callbacks=[MISPlottingCallback(log_dir=get_env("LOG_PATH", required=False, default="./"))],
+)
+
+
+__all__ = ["rl_component_bundle"]
diff --git a/examples/mis/lwd/config.py b/examples/mis/lwd/config.py
@@ -0,0 +1,39 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+
+
+class Config(object):
+    device: str = "cuda:0"
+
+    # Configuration for graph batch size
+    train_graph_batch_size = 32
+    eval_graph_batch_size = 32
+
+    # Configuration for num_samples
+    train_num_samples = 2
+    eval_num_samples = 10
+
+    # Configuration for the MISEnv
+    max_tick = 32  # Once the max_tick reached, the timeout processing will set all deferred nodes to excluded
+    num_node_lower_bound: int = 40
+    num_node_upper_bound: int = 50
+    node_sample_probability: float = 0.15
+
+    # Configuration for the reward definition
+    diversity_reward_coef = 0.1  # reward = cardinality reward + coef * diversity Reward
+    reward_normalization_base = 20
+
+    # Configuration for the GraphBasedActorCritic
+    input_dim = 2
+    output_dim = 3
+    hidden_dim = 128
+    num_layers = 5
+
+    # Configuration for PPO update
+    init_lr = 1e-4
+    clip_ratio = 0.2
+    reward_discount = 1.0
+
+    # Configuration for main loop
+    batch_size = 16
+    num_train_epochs = 4
diff --git a/examples/mis/lwd/config.yml b/examples/mis/lwd/config.yml
@@ -0,0 +1,37 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+
+# Please refer to `maro/rl/workflows/config/template.yml` for the complete template and detailed explanations.
+
+job: mis_lwd
+scenario_path: "examples/mis/lwd"
+# The log dir where you want to save the training loggings and model checkpoints.
+log_path: "examples/mis/lwd/log/test_40_50"
+main:
+  # Number of episodes to run. Each episode is one cycle of roll-out and training.
+  num_episodes: 1000
+  # This can be an integer or a list of integers. An integer indicates the interval at which policies are evaluated.
+  # A list indicates the episodes at the end of which policies are to be evaluated. Note that episode indexes are
+  # 1-based.
+  eval_schedule: [1, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 600, 700, 800, 900, 1000]
+  # Number of Episodes to run in evaluation.
+  num_eval_episodes: 5
+  min_n_sample: 1
+  logging:
+    stdout: INFO
+    file: DEBUG
+rollout:
+  logging:
+    stdout: INFO
+    file: DEBUG
+training:
+  mode: simple
+  load_path: null
+  load_episode: null
+  checkpointing:
+    path: null
+    # Interval at which trained policies / models are persisted to disk.
+    interval: 200
+  logging:
+    stdout: INFO
+    file: DEBUG
diff --git a/examples/mis/lwd/env_sampler/baseline.py b/examples/mis/lwd/env_sampler/baseline.py
@@ -0,0 +1,46 @@
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+
+import random
+from typing import Dict, List
+
+
+def _choose_by_weight(graph: Dict[int, List[int]], node2weight: Dict[int, float]) -> List[int]:
+    """Choose node in the order of descending weight if not blocked..
+
+    Args:
+        graph (Dict[int, List[int]]): The adjacent matrix of the target graph. The key is the node id of each node. The
+            value is a list of each node's neighbor nodes.
+        node2weight (Dict[int, float]): The node to weight dictionary with node id as key and node weight as value.
+
+    Returns:
+        List[int]: A list of chosen node id.
+    """
+    node_weight_list = [(node, weight) for node, weight in node2weight.items()]
+    # Shuffle the candidates to get random result in the case there are nodes sharing the same weight.
+    random.shuffle(node_weight_list)
+    # Sort node candidates with descending weight.
+    sorted_nodes = sorted(node_weight_list, key=lambda x: x[1], reverse=True)
+
+    chosen_node_id_set: set = set()
+    blocked_node_id_set: set = set()
+    # Choose node in the order of descending weight if it is not blocked yet by the chosen nodes.
+    for node, _ in sorted_nodes:
+        if node in blocked_node_id_set:
+            continue
+        chosen_node_id_set.add(node)
+        for neighbor_node in graph[node]:
+            blocked_node_id_set.add(neighbor_node)
+
+    chosen_node_ids = [node for node in chosen_node_id_set]
+    return chosen_node_ids
+
+def uniform_mis_solver(graph: Dict[int, List[int]]) -> List[int]:
+    node2weight: Dict[int, float] = {node: 1 for node in graph.keys()}
+    chosen_node_list = _choose_by_weight(graph, node2weight)
+    return chosen_node_list
+
+def greedy_mis_solver(graph: Dict[int, List[int]]) -> List[int]:
+    node2weight: Dict[int, float] = {node: 1 / (1 + len(neighbor_list)) for node, neighbor_list in graph.items()}
+    chosen_node_list = _choose_by_weight(graph, node2weight)
+    return chosen_node_list