Policy ensemble #79

PURANJAY14 · 2023-06-11T18:52:43Z

PURANJAY14
Jun 11, 2023

Hi,
Can you suggest an easy implementation to create a policy ensemble using skrl library(say we have 3 agents pre trained on a certain environment. I wanted to create a policy that picks the best action from the given 3 and update the weights for all the 3 agents).

Toni-SM · 2023-06-11T21:03:00Z

Toni-SM
Jun 11, 2023
Maintainer

Hi @PURANJAY14

Two questions:

What would be the procedure for selecting the best action from among the models?
What would be the base algorithm to update each model (on-policy, off-policy)?

1 reply

PURANJAY14 Jun 12, 2023
Author

Let's say the action space is continuous and for selecting the best action we take some weighted mean of actions with weights as the value function(Q values). Base algorithm would be a ppo type algorithm on-policy. We update all the policies as we train.

Toni-SM · 2023-06-12T18:31:35Z

Toni-SM
Jun 12, 2023
Maintainer

Hi @PURANJAY14

According to your description, the simplest solution is to create a new trainer that calculates the ensemble's action and uses it to step on the environment as shown below.

The relevant points are:

Compute the ensemble action given all agent actions.
Recompute the log_prob of each agent before update them (in the case off PPO).

import copy
import tqdm

import torch

from skrl.trainers.torch import Trainer


ENSEMBLE_TRAINER_DEFAULT_CONFIG = {
    "timesteps": 100000,            # number of timesteps to train for
    "headless": False,              # whether to use headless mode (no rendering)
    "disable_progressbar": False,   # whether to disable the progressbar. If None, disable on non-TTY
    "close_environment_at_exit": True,   # whether to close the environment on normal program termination
}


class EnsembleTrainer(Trainer):
    def __init__(self, env, agents, agents_scope=None, cfg=None):
        _cfg = copy.deepcopy(ENSEMBLE_TRAINER_DEFAULT_CONFIG)
        _cfg.update(cfg if cfg is not None else {})
        super().__init__(env=env, agents=agents, agents_scope=agents_scope, cfg=_cfg)

        # init agents
        for agent in self.agents:
            agent.init(trainer_cfg=self.cfg)

    def train(self):
        # reset env
        states, infos = self.env.reset()

        for timestep in tqdm.tqdm(range(self.initial_timestep, self.timesteps), disable=self.disable_progressbar):

            # pre-interaction
            for agent in self.agents:
                agent.pre_interaction(timestep=timestep, timesteps=self.timesteps)

            # compute actions
            with torch.no_grad():
                actions_list = [agent.act(states, timestep=timestep, timesteps=self.timesteps)[0] for agent in self.agents]
                # TODO: compute ensemble to generate one action with shape (num_envs, action_space_size)
                actions = ...
                # TODO: recompute log_prob (for PPO)
                for agent in self.agents:
                    agent._current_log_prob = ...

            # step the environments
            next_states, rewards, terminated, truncated, infos = self.env.step(actions)

            # render scene
            if not self.headless:
                self.env.render()

            # record the environments' transitions
            with torch.no_grad():
                for agent in self.agents:
                    agent.record_transition(states=states,
                                            actions=actions,
                                            rewards=rewards,
                                            next_states=next_states,
                                            terminated=terminated,
                                            truncated=truncated,
                                            infos=infos,
                                            timestep=timestep,
                                            timesteps=self.timesteps)

            # post-interaction
            for agent in self.agents:
                agent.post_interaction(timestep=timestep, timesteps=self.timesteps)

            # reset environments
            with torch.no_grad():
                if terminated.any() or truncated.any():
                    states, infos = self.env.reset()
                else:
                    states = next_states

4 replies

PURANJAY14 Jun 12, 2023
Author

Thanks a lot. It works for ppo

khanhphan1311 Oct 2, 2023

Hi @Toni-SM ,
Could you please explain more on why we need to recompute each agent _current_log_prob and how should we recompute it?

In agent.act(), we already compute it:

# sample stochastic actions
actions, log_prob, outputs = self.policy.act({"states": self._state_preprocessor(states), **rnn}, role="policy")
self._current_log_prob = log_prob

Toni-SM Oct 3, 2023
Maintainer

Hi @khanhphan1311

Yes, the log_prob is computed for each agent (using their sampled actions) in agent.act method.
The idea about recompute it, after generating the ensemble actions, is to get the log_prob for the ensemble actions and not from the individual actions of each agent,

However, I have not gone in depth about the effect of using the log_prob of the individual actions or the ensemble actions during training.

khanhphan1311 Oct 13, 2023

Hi @Toni-SM ,

Could you please read through this paper: https://arxiv.org/pdf/2205.09284.pdf ? (Code: https://github.com/microsoft/EPPO )

I think the idea of ensemble policy is same as mentioned in this discussion, but the paper also mentions about diversity enhancement regularization on the policy space.

Could you please give me some ideas how to apply to the EnsembleTrainer?

PURANJAY14 · 2023-06-12T21:41:44Z

PURANJAY14
Jun 12, 2023
Author

But for amp_humanoid there is an error :
File "/data/personal/conda/envs/rlgpu/lib/python3.7/site-packages/skrl/agents/torch/amp/amp.py", line 335, in record_transition
next_values *= infos['terminate'].view(-1, 1).logical_not()
RuntimeError: The size of tensor a (2048) must match the size of tensor b (4096) at non-singleton dimension 0

Currently there is no on-policy update and just picking the first agent's action as the ensemble action output.

4 replies

Toni-SM Jun 13, 2023
Maintainer

What is the number of environments?
Also, can you please show the shapes of both, next_values and infos['terminate'] (without the view)?

PURANJAY14 Jun 13, 2023
Author

numENvs=4096
next_values=(4096)
infos['terminate']=(2048)

Toni-SM Jun 14, 2023
Maintainer

I cannot reproduce this error. I got a shape torch.Size([4096]) for infos['terminate']

Are you using the SequentialTrainer with a list of 2 agents or the EnsembleTrainer as discussed above?

PURANJAY14 Jun 14, 2023
Author

Thanks a lot ! It is working now. I was working with the sequential trainer instead of ensemble trainer hence the bug. Now it's working fine.

PURANJAY14 · 2023-10-10T06:21:35Z

PURANJAY14
Oct 10, 2023
Author

Shouldn't this training code for ensembles(note I have only used actions=action_list[1]) result in an increasing reward over time since we are essentially using 1 agent of the ensemble or do I need to explicitly recompute the log_probability? I can see no learning from the simulations using the code below.

def train(self):        
        for agent in self.agents:
                agent.set_running_mode("train")
        
        # reset env
        states, infos = self.env.reset()
        
        for timestep in tqdm.tqdm(range(self.initial_timestep, self.timesteps), disable=self.disable_progressbar):

            # pre-interaction
            for agent in self.agents:
                agent.pre_interaction(timestep=timestep, timesteps=self.timesteps)



            # compute actions
            with torch.no_grad():
                actions_list = [agent.act(states, timestep=timestep, timesteps=self.timesteps)[0] for agent in self.agents]
                
                
                # TODO: compute ensemble to generate one action with shape (num_envs, action_space_size)
                actions = actions_list[1]               
                

            # step the environments
            next_states, rewards, terminated, truncated, infos = self.env.step(actions)
            
            
            

            # render scene
            if not self.headless:
                self.env.render()

            # record the environments' transitions
            with torch.no_grad():
                for agent in self.agents:
                    
                    agent.record_transition(states=states,
                                            actions=actions,
                                            rewards=rewards,
                                            next_states=next_states,
                                            terminated=terminated,
                                            truncated=truncated,
                                            infos=infos,
                                            timestep=timestep,
                                            timesteps=self.timesteps)
                    
                     

            # post-interaction
            for agent in self.agents:
                agent.post_interaction(timestep=timestep, timesteps=self.timesteps)
                
            

            # reset environments
            with torch.no_grad():
                if terminated.any() or truncated.any():
                    states, infos = self.env.reset()
                else:
                    states = next_states

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Policy ensemble #79

{{title}}

Replies: 4 comments 9 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Policy ensemble #79

PURANJAY14 Jun 11, 2023

Replies: 4 comments · 9 replies

Toni-SM Jun 11, 2023 Maintainer

PURANJAY14 Jun 12, 2023 Author

Toni-SM Jun 12, 2023 Maintainer

PURANJAY14 Jun 12, 2023 Author

khanhphan1311 Oct 2, 2023

Toni-SM Oct 3, 2023 Maintainer

khanhphan1311 Oct 13, 2023

PURANJAY14 Jun 12, 2023 Author

Toni-SM Jun 13, 2023 Maintainer

PURANJAY14 Jun 13, 2023 Author

Toni-SM Jun 14, 2023 Maintainer

PURANJAY14 Jun 14, 2023 Author

PURANJAY14 Oct 10, 2023 Author

PURANJAY14
Jun 11, 2023

Replies: 4 comments 9 replies

Toni-SM
Jun 11, 2023
Maintainer

PURANJAY14 Jun 12, 2023
Author

Toni-SM
Jun 12, 2023
Maintainer

PURANJAY14 Jun 12, 2023
Author

Toni-SM Oct 3, 2023
Maintainer

PURANJAY14
Jun 12, 2023
Author

Toni-SM Jun 13, 2023
Maintainer

PURANJAY14 Jun 13, 2023
Author

Toni-SM Jun 14, 2023
Maintainer

PURANJAY14 Jun 14, 2023
Author

PURANJAY14
Oct 10, 2023
Author