The use of linear dynamics #5

yilin-wu98 · 2020-09-13T15:45:43Z

Hi, I am trying to use your code for the pointmass_uwall task and thank you for making your code public. I have a question about the linear_loss part in the vae code. In the train_vae.py, the use_linear_dynamics is set to True but the linearity_weight is set to 0. Therefore, in the VAE training, the linear_loss is not actually added to the total loss.

In the pointmass_uwall task, do you change the linearity_weight to some number(not 0.0) in the pre-training of the VAE?

snasiriany · 2020-09-13T20:14:24Z

Hi Yilin,

In the official implementation of the paper, we use a linearity_weight of 0.0 (so there's no linear dynamics penalty). I have noticed that using the linear dynamics (where linearity weight is greater than 0.0) does help to have a more smooth latent space, but this can cause "holes" in the latent space that are undesirable. For the best results, I would recommend combining both, to get the best of both world:

VAE A: using linear dynamics. This leads to a more smooth latent space. Used as the representation for training the policy/Qf in RL.
VAE B: no linear dyanmics. This VAE is used for sampling images in the optimization step of subgoals.

During the optimization step of subgoals, everytime you sample images from VAE B (by decoding the latents to images using VAE B's decoder), you can then encode it back to the latent space of VAE A (using VAE A's encoder). If this is confusing, let me know and I can elaborate further.

Alternatively, you could simply a standard VAE (no linear dynamics) for everything, like we did in the paper.

yilin-wu98 · 2020-09-14T02:02:20Z

Thanks for the explanation! I will train a standard VAE first.
Also, I have noticed that you trained a reprojection network after training the VAE. In the code, it has two options. One is to do encode(decode(z)) to get the z_hat(reprojected) and the other is to use the reprojection_network to predict z_hat given z. I want to know why we need to use this additional reprojection_network?

snasiriany · 2020-09-14T22:43:48Z

The reason why we encode(decode(z)) rather than use z directly is that the Q function was trained on the means of image encodings. So given a randomly sampled latent z from a Gaussian, it might not correspond to the mean of a posterior distribution. To ensure this is the case, we decode z into an image first, then encode the image to get a posterior distribution, and finally consider the mean of this posterior distribution.

The reprojection network is there to make this process faster, since decoding and encoding back can be computationally costly. The reprojection network is trained with supervised learning, with (z, mean(encode(decode(z)))) pairs.

We didn't actually use the reprojection network in the paper, and I don't recall how well it works. Our standard implementation doesn't use it.

yilin-wu98 · 2020-09-15T08:46:07Z

Thanks! I will follow the standard implementation.

yilin-wu98 · 2020-09-18T07:08:29Z

Hi Soroush, sorry to bother you again. When I want to run the code on my Ubuntu 18.04, cuda 10.2 server, I find the version incompatible with your code. Therefore, I create the docker image, following the installation guide and build docker docker image with your Dockerfile. However, the pretrain-vae part works fine with "pm" environment, but when I try to generate data for "pnr" environment, it returns the error of "found 0 gpus for rendering ...... fail to initialize OpenGL". Do you know any solution to this?

Below is the entire error message:

Found 0 GPUs for rendering. Using device 5.
Device id outside of range of available devices.
Traceback (most recent call last):
File "/mounts/target/launchers/run_experiment_from_doodad.py", line 38, in
**run_experiment_kwargs
File "/home/yilin/leap/railrl/launchers/launcher_util.py", line 449, in run_experiment_here
return experiment_function(variant)
File "/home/yilin/leap/railrl/launchers/exp_launcher.py", line 51, in vae_dataset_experiment
generate_vae_dataset_fctn(vae_variant['generate_vae_dataset_kwargs'])
File "/home/yilin/leap/railrl/launchers/vae_exp_launcher_util.py", line 241, in generate_vae_dataset
env = gym.make(env_id)
File "/env/lib/python3.5/site-packages/gym/envs/registration.py", line 167, in make
return registry.make(id)
File "/env/lib/python3.5/site-packages/gym/envs/registration.py", line 119, in make
env = spec.make()
File "/env/lib/python3.5/site-packages/gym/envs/registration.py", line 83, in make
env = self._entry_point()
File "/home/yilin/multiworld/multiworld/envs/mujoco/init.py", line 1669, in create_image_84_sawyer_pnr_arena_train_env_big_v0
reward_type='vectorized_state_distance'
File "/home/yilin/multiworld/multiworld/core/image_env.py", line 73, in init
sim = self._wrapped_env.initialize_camera(init_camera)
File "/home/yilin/multiworld/multiworld/envs/mujoco/mujoco_env.py", line 152, in initialize_camera
viewer = mujoco_py.MjRenderContextOffscreen(sim, device_id=self.device_id)
File "mujoco_py/mjrendercontext.pyx", line 43, in mujoco_py.cymj.MjRenderContext.init
File "mujoco_py/mjrendercontext.pyx", line 108, in mujoco_py.cymj.MjRenderContext._setup_opengl_context
File "mujoco_py/opengl_context.pyx", line 128, in mujoco_py.cymj.OffscreenOpenGLContext.init
RuntimeError: Failed to initialize OpenGL

Additionally, when I run in the docker container to test render(), env=gym.make(...), env.render('rgb_array') or env.render() both gives me the error below.

import gym
env = gym.make('FetchPush-v1')
env.render('rgb_array')
GLFW error (code %d): %s 65544 b'X11: The DISPLAY environment variable is missing'
GLFW error (code %d): %s 65544 b'X11: The DISPLAY environment variable is missing'
Traceback (most recent call last):
File "", line 1, in
File "/env/lib/python3.5/site-packages/gym/core.py", line 284, in render
return self.env.render(mode)
File "/env/lib/python3.5/site-packages/gym/envs/robotics/robot_env.py", line 92, in render
self._get_viewer().render()
File "/env/lib/python3.5/site-packages/gym/envs/robotics/robot_env.py", line 103, in _get_viewer
self.viewer = mujoco_py.MjViewer(self.sim)
File "/env/lib/python3.5/site-packages/mujoco_py/mjviewer.py", line 133, in init
super().init(sim)
File "/env/lib/python3.5/site-packages/mujoco_py/mjviewer.py", line 26, in init
super().init(sim)
File "mujoco_py/mjrendercontext.pyx", line 267, in mujoco_py.cymj.MjRenderContextWindow.init
File "mujoco_py/mjrendercontext.pyx", line 43, in mujoco_py.cymj.MjRenderContext.init
File "mujoco_py/mjrendercontext.pyx", line 96, in mujoco_py.cymj.MjRenderContext._setup_opengl_context
File "mujoco_py/opengl_context.pyx", line 44, in mujoco_py.cymj.GlfwContext.init
File "mujoco_py/opengl_context.pyx", line 64, in mujoco_py.cymj.GlfwContext._init_glfw
mujoco_py.cymj.GlfwError: Failed to initialize GLFW

snasiriany · 2020-09-18T21:06:11Z

A few things:

It seems to be an issue with mujoco-py and GLFW. Can you try the solution here and see if it addresses the issue? Failed to initialize OpenGL openai/mujoco-py#187 (comment)
It seems like this GLFW issue is only with the docker image, and not on your local machine? If the idea I posted above doesn't solve the issue, perhaps you can try to generate the VAE dataset without the docker image, and once you have the dataset, then load the generated dataset within the docker image and try training?
Also, I'm concerned that it says "Found 0 GPUs for rendering." Does the GPU work at all? If you try to run a small test job (via the docker image), will it use the GPU at all?

yilin-wu98 · 2020-09-19T13:40:10Z

I tried with solution in the Failed to initialize OpenGL openai/mujoco-py#187, but it doesn't work for me. I think the error we encountered are different because my errors shows 'Found 0 GPU for rendering' but his doesn't. I also try docker pull image directly from the hub and it is the same error.
Yes, this GLFW issue is only in the docker not in the local machine. I want to try separate the generating data and training vae, but I find even though training vae doesn't require rendering, the visualization process requires functions in VAEWrappedEnv. And the same error appears in the env=gym.make when it calls to initialize the camera. You could see the error message in the previous comments :
File "/home/yilin/multiworld/multiworld/core/image_env.py", line 73, in init sim = self._wrapped_env.initialize_camera(init_camera) File "/home/yilin/multiworld/multiworld/envs/mujoco/mujoco_env.py", line 152, in initialize_camera viewer = mujoco_py.MjRenderContextOffscreen(sim, device_id=self.device_id)

Is there any way to workaround this?

3.I think GPU works because when I do torch.cuda.device_count(), it returns 8. Pytorch could find GPU to use.

snasiriany · 2020-09-19T23:50:36Z

I'm not immediately sure how to address the issue. I think there's something in the dockerfile (https://github.com/snasiriany/leap/blob/master/docker/Dockerfile) that is incompatible with your hardware.

Since the GPU works, I think the issue is strictly with the rendering software. I think the first step is in investigating the error message you posted GLFW error (code %d): %s 65544 b'X11: The DISPLAY environment variable is missing'. Maybe you could take a look into this and let me know what you find?

For faster development, I'd recommend running the docker image interactively with docker run -it docker_image_name. While you're in the docker image, you can (1) run short python scripts calling the render function and (2) installing dependencies, and (3 switching back and forth between these two steps until you hopefully find a solution.

snasiriany closed this as completed Nov 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The use of linear dynamics #5

The use of linear dynamics #5

yilin-wu98 commented Sep 13, 2020

snasiriany commented Sep 13, 2020

yilin-wu98 commented Sep 14, 2020

snasiriany commented Sep 14, 2020

yilin-wu98 commented Sep 15, 2020

yilin-wu98 commented Sep 18, 2020 •

edited

Loading

snasiriany commented Sep 18, 2020

yilin-wu98 commented Sep 19, 2020

snasiriany commented Sep 19, 2020

The use of linear dynamics #5

The use of linear dynamics #5

Comments

yilin-wu98 commented Sep 13, 2020

snasiriany commented Sep 13, 2020

yilin-wu98 commented Sep 14, 2020

snasiriany commented Sep 14, 2020

yilin-wu98 commented Sep 15, 2020

yilin-wu98 commented Sep 18, 2020 • edited Loading

snasiriany commented Sep 18, 2020

yilin-wu98 commented Sep 19, 2020

snasiriany commented Sep 19, 2020

yilin-wu98 commented Sep 18, 2020 •

edited

Loading