Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Learning What to Defer algorithm for MIS problem example #592

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
150 changes: 150 additions & 0 deletions examples/mis/lwd/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
# LWD: Learning What to Defer for Maximum Independent Sets

It is the application of the [LwD](http://proceedings.mlr.press/v119/ahn20a.html) method (published in ICML 2020).
Some code are referred from the original GitHub [repository](https://github.com/sungsoo-ahn/learning_what_to_defer)
held by the authors.

## Algorithm Introduction

### Deferred Markov Decision Process

<div style="text-align: center;">

![MDP Illustration](./figures/fig1_mdp.png)

</div>

#### State

Each state of the MDP is represented as a *vertex-state* vector:

<div style="text-align: center;">

$s = [s_i: i \in V] \in \{0, 1, *\}^V$

</div>

where *0, 1, \** indicates vertex *i* is *excluded*, *included*,
and *the determination is deferred and expected to be made in later iterations* respectively.
The MDP is *initialized* with the deferred vertex-states, i.e., $s_i = *, \forall i \in V$,
while *terminated* when (a) there is no deferred vertex-state left or (b) time limit is reached.

#### Action

Actions correspond to new assignments for the next state of vertices, defined only on the deferred vertices here:

<div style="text-align: center;">

$a_* = [a_i: i \in V_*] \in \{0, 1, *\}^{V_*}$

</div>

where $V_* = \{i: i \in V, x_i = *\}$.

#### Transition

The transition $P_{a_*}(s, s')$ consists of two deterministic phases:

- *update phase*: takes the action $a_*$ to get an intermediate vertex-state $\hat{s}$,
i.e., $\hat{s_i} = a_i$ if $i \in V_*$ and $\hat{s_i} = s_i$ otherwise.
- *clean-up phase*: modifies $\hat{s}$ to yield a valid vertex-state $s'$.

- Whenever there exists a pair of included vertices adjacent to each other,
they are both mapped back to the deferred vertex-state.
- Excludes any deferred vertex neighboring with an included vertex.

Here is an illustration of the transition fucntion:

<div style="text-align: center;">

![Transition](./figures/fig2_transition.png)

</div>

#### Reward

A *cardinality reward* is defined here:

<div style="text-align: center;">

$R(s, s') = \sum_{i \in V_* \setminus V_*'}{s_i'}$

</div>

where $V_*$ and $V_*'$ are the set of vertices with deferred vertex-state with respect to $s$ and $s'$ respectively.
By doing so, the overall reward of the MDP corresonds to the cardinality of the independent set returned.

### Diversification Reward

Couple two copies of MDPs defined on an indentical graph $G$ into a new MDP.
Then the new MDP is associated with a pair of distinct vertex-state vectors $(s, \bar{s})$,
and let the resulting solutions be $(x, \bar{x})$.
We directly reward the deviation between the coupled solutions in terms of $l_1$-norm, i.e., $||x-\bar{x}||_1$.
To be specific, the deviation is decomposed into rewards in each iteration of the MDP defined by:

<div style="text-align: center;">

$R_{div}(s, s', \bar{s}, \bar{s}') = \sum_{i \in \hat{V}}|s_i'-\bar{s}_i'|$, where $\hat{V}=(V_* \setminus V_*')\cup(\bar{V}_* \setminus \bar{V}_*')$

</div>

Here is an example of the diversity reward:

<div style="text-align: center;">

![Diversity Reward](./figures/fig3_diversity_reward.png)

</div>

*The Entropy Regularization plays a similar role to the diversity reward introduced above.
But note that, the entropy regularition only attempts to generate diverse trajectories of the same MDP,
which does not necessarily lead to diverse solutions at last,
since there existing many trajectories resulting in the same solution.*

### Design of the Neural Network

The policy network $\pi(a|s)$ and the value network $V(s)$ is designed to follow the
[GraphSAGE](https://proceedings.neurips.cc/paper/2017/hash/5dd9db5e033da9c6fb5ba83c7a7ebea9-Abstract.html) architecture,
which is a general inductive framework that leverages node feature information
to efficiently generate node embeddings by sampling and aggregating features from a node's local neighborhood.
Each network consists of multiple layers $h^{(n)}$ with $n = 1, ..., N$
where the $n$-layer with weights $W_1^{(n)}$ and $W_2^{(n)}$ performs the following transformation on input $H$:

<div style="text-align: center;">

$h^{(n)} = ReLU(HW_1^{(n)}+D^{-\frac{1}{2}}BD^{-\frac{1}{2}}HW_2^{(N)})$.

</div>

Here $B$ and $D$ corresponds to adjacency and degree matrix of the graph $G$, respectively. At the final layer,
the policy and value networks apply softmax function and graph readout function with sum pooling instead of ReLU
to generate actions and value estimates, respectively.

### Input of the Neural Network

- The subgraph that is induced on the deferred vertices $V_*$ as the input of the networks
since the determined part of the graph no longer affects the future rewards of the MDP.
- Input features:

- Vertex degrees;
- The current iteration-index of the MDP, normalized by the maximum number of iterations.

### Training Algorithm

The Proximal Policy Optimization (PPO) is used in this solution.

## Quick Start

Please make sure the environment is correctly set up, refer to
[MARO](https://github.com/microsoft/maro#install-maro-from-source) for more installation guidance.
To try the example code, you can simply run:

```sh
python examples/rl/run.py examples/mis/lwd/config.yml
```

The default log path is set to *examples/mis/lwd/log/test*, the recorded metrics and training curves can be found here.

To adjust the configurations of the training workflow, go to file: *examples/mis/lwd/config.yml*,
To adjust the problem formulation, network setting and some other detailed configurations,
go to file *examples/mis/lwd/config.py*.
99 changes: 99 additions & 0 deletions examples/mis/lwd/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.

import torch

from maro.rl.rl_component.rl_component_bundle import RLComponentBundle
from maro.rl.utils.common import get_env
from maro.simulator import Env

from examples.mis.lwd.config import Config
from examples.mis.lwd.env_sampler.mis_env_sampler import MISEnvSampler, MISPlottingCallback
from examples.mis.lwd.simulator.mis_business_engine import MISBusinessEngine
from examples.mis.lwd.ppo import get_ppo_policy, get_ppo_trainer


config = Config()

# Environments
learn_env = Env(
business_engine_cls=MISBusinessEngine,
durations=config.max_tick,
options={
"graph_batch_size": config.train_graph_batch_size,
"num_samples": config.train_num_samples,
"device": torch.device(config.device),
"num_node_lower_bound": config.num_node_lower_bound,
"num_node_upper_bound": config.num_node_upper_bound,
"node_sample_probability": config.node_sample_probability,
},
)

test_env = Env(
business_engine_cls=MISBusinessEngine,
durations=config.max_tick,
options={
"graph_batch_size": config.eval_graph_batch_size,
"num_samples": config.eval_num_samples,
"device": torch.device(config.device),
"num_node_lower_bound": config.num_node_lower_bound,
"num_node_upper_bound": config.num_node_upper_bound,
"node_sample_probability": config.node_sample_probability,
},
)

# Agent, policy, and trainers
agent2policy = {agent: f"ppo_{agent}.policy" for agent in learn_env.agent_idx_list}

policies = [
get_ppo_policy(
name=f"ppo_{agent}.policy",
state_dim=config.input_dim,
action_num=config.output_dim,
hidden_dim=config.hidden_dim,
num_layers=config.num_layers,
init_lr=config.init_lr,
)
for agent in learn_env.agent_idx_list
]

trainers = [
get_ppo_trainer(
name=f"ppo_{agent}",
state_dim=config.input_dim,
hidden_dim=config.hidden_dim,
num_layers=config.num_layers,
init_lr=config.init_lr,
clip_ratio=config.clip_ratio,
max_tick=config.max_tick,
batch_size=config.batch_size,
reward_discount=config.reward_discount,
graph_batch_size=config.train_graph_batch_size,
graph_num_samples=config.train_num_samples,
num_train_epochs=config.num_train_epochs,
norm_base=config.reward_normalization_base,
)
for agent in learn_env.agent_idx_list
]

device_mapping = {f"ppo_{agent}.policy": config.device for agent in learn_env.agent_idx_list}

# Build RLComponentBundle
rl_component_bundle = RLComponentBundle(
env_sampler=MISEnvSampler(
learn_env=learn_env,
test_env=test_env,
policies=policies,
agent2policy=agent2policy,
diversity_reward_coef=config.diversity_reward_coef,
reward_normalization_base=config.reward_normalization_base,
),
agent2policy=agent2policy,
policies=policies,
trainers=trainers,
device_mapping=device_mapping,
customized_callbacks=[MISPlottingCallback(log_dir=get_env("LOG_PATH", required=False, default="./"))],
)


__all__ = ["rl_component_bundle"]
39 changes: 39 additions & 0 deletions examples/mis/lwd/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.


class Config(object):
device: str = "cuda:0"

# Configuration for graph batch size
train_graph_batch_size = 32
eval_graph_batch_size = 32

# Configuration for num_samples
train_num_samples = 2
eval_num_samples = 10

# Configuration for the MISEnv
max_tick = 32 # Once the max_tick reached, the timeout processing will set all deferred nodes to excluded
num_node_lower_bound: int = 40
num_node_upper_bound: int = 50
node_sample_probability: float = 0.15

# Configuration for the reward definition
diversity_reward_coef = 0.1 # reward = cardinality reward + coef * diversity Reward
reward_normalization_base = 20

# Configuration for the GraphBasedActorCritic
input_dim = 2
output_dim = 3
hidden_dim = 128
num_layers = 5

# Configuration for PPO update
init_lr = 1e-4
clip_ratio = 0.2
reward_discount = 1.0

# Configuration for main loop
batch_size = 16
num_train_epochs = 4
37 changes: 37 additions & 0 deletions examples/mis/lwd/config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.

# Please refer to `maro/rl/workflows/config/template.yml` for the complete template and detailed explanations.

job: mis_lwd
scenario_path: "examples/mis/lwd"
# The log dir where you want to save the training loggings and model checkpoints.
log_path: "examples/mis/lwd/log/test_40_50"
main:
# Number of episodes to run. Each episode is one cycle of roll-out and training.
num_episodes: 1000
# This can be an integer or a list of integers. An integer indicates the interval at which policies are evaluated.
# A list indicates the episodes at the end of which policies are to be evaluated. Note that episode indexes are
# 1-based.
eval_schedule: [1, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 600, 700, 800, 900, 1000]
# Number of Episodes to run in evaluation.
num_eval_episodes: 5
min_n_sample: 1
logging:
stdout: INFO
file: DEBUG
rollout:
logging:
stdout: INFO
file: DEBUG
training:
mode: simple
load_path: null
load_episode: null
checkpointing:
path: null
# Interval at which trained policies / models are persisted to disk.
interval: 200
logging:
stdout: INFO
file: DEBUG
46 changes: 46 additions & 0 deletions examples/mis/lwd/env_sampler/baseline.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.

import random
from typing import Dict, List


def _choose_by_weight(graph: Dict[int, List[int]], node2weight: Dict[int, float]) -> List[int]:
"""Choose node in the order of descending weight if not blocked..

Args:
graph (Dict[int, List[int]]): The adjacent matrix of the target graph. The key is the node id of each node. The
value is a list of each node's neighbor nodes.
node2weight (Dict[int, float]): The node to weight dictionary with node id as key and node weight as value.

Returns:
List[int]: A list of chosen node id.
"""
node_weight_list = [(node, weight) for node, weight in node2weight.items()]
# Shuffle the candidates to get random result in the case there are nodes sharing the same weight.
random.shuffle(node_weight_list)
# Sort node candidates with descending weight.
sorted_nodes = sorted(node_weight_list, key=lambda x: x[1], reverse=True)

chosen_node_id_set: set = set()
blocked_node_id_set: set = set()
# Choose node in the order of descending weight if it is not blocked yet by the chosen nodes.
for node, _ in sorted_nodes:
if node in blocked_node_id_set:
continue
chosen_node_id_set.add(node)
for neighbor_node in graph[node]:
blocked_node_id_set.add(neighbor_node)

chosen_node_ids = [node for node in chosen_node_id_set]
return chosen_node_ids

def uniform_mis_solver(graph: Dict[int, List[int]]) -> List[int]:
node2weight: Dict[int, float] = {node: 1 for node in graph.keys()}
chosen_node_list = _choose_by_weight(graph, node2weight)
return chosen_node_list

def greedy_mis_solver(graph: Dict[int, List[int]]) -> List[int]:
node2weight: Dict[int, float] = {node: 1 / (1 + len(neighbor_list)) for node, neighbor_list in graph.items()}
chosen_node_list = _choose_by_weight(graph, node2weight)
return chosen_node_list
Loading