Policy ensemble #79
Replies: 4 comments 9 replies
-
Hi @PURANJAY14 Two questions:
|
Beta Was this translation helpful? Give feedback.
-
Hi @PURANJAY14 According to your description, the simplest solution is to create a new trainer that calculates the ensemble's action and uses it to step on the environment as shown below. The relevant points are:
import copy
import tqdm
import torch
from skrl.trainers.torch import Trainer
ENSEMBLE_TRAINER_DEFAULT_CONFIG = {
"timesteps": 100000, # number of timesteps to train for
"headless": False, # whether to use headless mode (no rendering)
"disable_progressbar": False, # whether to disable the progressbar. If None, disable on non-TTY
"close_environment_at_exit": True, # whether to close the environment on normal program termination
}
class EnsembleTrainer(Trainer):
def __init__(self, env, agents, agents_scope=None, cfg=None):
_cfg = copy.deepcopy(ENSEMBLE_TRAINER_DEFAULT_CONFIG)
_cfg.update(cfg if cfg is not None else {})
super().__init__(env=env, agents=agents, agents_scope=agents_scope, cfg=_cfg)
# init agents
for agent in self.agents:
agent.init(trainer_cfg=self.cfg)
def train(self):
# reset env
states, infos = self.env.reset()
for timestep in tqdm.tqdm(range(self.initial_timestep, self.timesteps), disable=self.disable_progressbar):
# pre-interaction
for agent in self.agents:
agent.pre_interaction(timestep=timestep, timesteps=self.timesteps)
# compute actions
with torch.no_grad():
actions_list = [agent.act(states, timestep=timestep, timesteps=self.timesteps)[0] for agent in self.agents]
# TODO: compute ensemble to generate one action with shape (num_envs, action_space_size)
actions = ...
# TODO: recompute log_prob (for PPO)
for agent in self.agents:
agent._current_log_prob = ...
# step the environments
next_states, rewards, terminated, truncated, infos = self.env.step(actions)
# render scene
if not self.headless:
self.env.render()
# record the environments' transitions
with torch.no_grad():
for agent in self.agents:
agent.record_transition(states=states,
actions=actions,
rewards=rewards,
next_states=next_states,
terminated=terminated,
truncated=truncated,
infos=infos,
timestep=timestep,
timesteps=self.timesteps)
# post-interaction
for agent in self.agents:
agent.post_interaction(timestep=timestep, timesteps=self.timesteps)
# reset environments
with torch.no_grad():
if terminated.any() or truncated.any():
states, infos = self.env.reset()
else:
states = next_states |
Beta Was this translation helpful? Give feedback.
-
But for amp_humanoid there is an error : Currently there is no on-policy update and just picking the first agent's action as the ensemble action output. |
Beta Was this translation helpful? Give feedback.
-
Shouldn't this training code for ensembles(note I have only used actions=action_list[1]) result in an increasing reward over time since we are essentially using 1 agent of the ensemble or do I need to explicitly recompute the log_probability? I can see no learning from the simulations using the code below.
|
Beta Was this translation helpful? Give feedback.
-
Hi,
Can you suggest an easy implementation to create a policy ensemble using skrl library(say we have 3 agents pre trained on a certain environment. I wanted to create a policy that picks the best action from the given 3 and update the weights for all the 3 agents).
Beta Was this translation helpful? Give feedback.
All reactions