TPU pod initiation Error #12
bethejulia
started this conversation in
General
Replies: 1 comment
-
This would be very strange |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
HI
I was trying a TPU POD v2-32 and it was created successfully. But running an example as shown in: https://cloud.google.com/tpu/docs/jax-pods it produced the following error as shown below.
If you people can guide on what is going on, it will be of great help.
(Actually I tried in v3-32 , it also shows the same thing!)
Thanks in advance
Thoma
mbctbiofuel@cloudshell:~ (mytpu1)$ gcloud compute tpus tpu-vm ssh node-1 --zone=us-central1-a --worker=all --command="python3 example.py"
SSH: Attempting to connect to worker 0...
SSH: Attempting to connect to worker 1...
SSH: Attempting to connect to worker 2...
SSH: Attempting to connect to worker 3...
Traceback (most recent call last):
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 435, in backends
backend = _init_backend(platform)
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 488, in _init_backend
backend = factory()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 189, in tpu_client_timer_callback
client = xla_client.make_tpu_client()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jaxlib/xla_client.py", line 173, in make_tpu_client
return make_tfrt_tpu_c_api_client()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jaxlib/xla_client.py", line 106, in make_tfrt_tpu_c_api_client
return _xla.get_c_api_client('tpu', options)
jaxlib.xla_extension.XlaRuntimeError: ABORTED: The TPU is already in use by another process probably owned by another user. Run "$ sudo lsof -w /dev/accel0" to figure out which process is using the TPU. If you still get this message, run "$ sudo rm /tmp/libtpu_lockfile".
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "example.py", line 5, in
device_count = jax.device_count()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 564, in device_count
return int(get_backend(backend).device_count())
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 533, in get_backend
return _get_backend_uncached(platform)
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 514, in _get_backend_uncached
bs = backends()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 452, in backends
raise RuntimeError(err_msg)
RuntimeError: Unable to initialize backend 'tpu': ABORTED: The TPU is already in use by another process probably owned by another user. Run "$ sudo lsof -w /dev/accel0" to figure out which process is using the TPU. If you still get this message, run "$ sudo rm /tmp/libtpu_lockfile". (set JAX_PLATFORMS='' to automatically choose an available backend)
Traceback (most recent call last):
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 435, in backends
backend = _init_backend(platform)
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 488, in _init_backend
backend = factory()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 189, in tpu_client_timer_callback
client = xla_client.make_tpu_client()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jaxlib/xla_client.py", line 173, in make_tpu_client
return make_tfrt_tpu_c_api_client()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jaxlib/xla_client.py", line 106, in make_tfrt_tpu_c_api_client
return _xla.get_c_api_client('tpu', options)
jaxlib.xla_extension.XlaRuntimeError: ABORTED: The TPU is already in use by another process probably owned by another user. Run "$ sudo lsof -w /dev/accel0" to figure out which process is using the TPU. If you still get this message, run "$ sudo rm /tmp/libtpu_lockfile".
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "example.py", line 5, in
device_count = jax.device_count()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 564, in device_count
return int(get_backend(backend).device_count())
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 533, in get_backend
return _get_backend_uncached(platform)
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 514, in _get_backend_uncached
bs = backends()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 452, in backends
raise RuntimeError(err_msg)
RuntimeError: Unable to initialize backend 'tpu': ABORTED: The TPU is already in use by another process probably owned by another user. Run "$ sudo lsof -w /dev/accel0" to figure out which process is using the TPU. If you still get this message, run "$ sudo rm /tmp/libtpu_lockfile". (set JAX_PLATFORMS='' to automatically choose an available backend)
Traceback (most recent call last):
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 435, in backends
backend = _init_backend(platform)
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 488, in _init_backend
backend = factory()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 189, in tpu_client_timer_callback
client = xla_client.make_tpu_client()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jaxlib/xla_client.py", line 173, in make_tpu_client
return make_tfrt_tpu_c_api_client()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jaxlib/xla_client.py", line 106, in make_tfrt_tpu_c_api_client
return _xla.get_c_api_client('tpu', options)
jaxlib.xla_extension.XlaRuntimeError: ABORTED: The TPU is already in use by another process probably owned by another user. Run "$ sudo lsof -w /dev/accel0" to figure out which process is using the TPU. If you still get this message, run "$ sudo rm /tmp/libtpu_lockfile".
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "example.py", line 5, in
device_count = jax.device_count()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 564, in device_count
return int(get_backend(backend).device_count())
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 533, in get_backend
return _get_backend_uncached(platform)
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 514, in _get_backend_uncached
bs = backends()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 452, in backends
raise RuntimeError(err_msg)
RuntimeError: Unable to initialize backend 'tpu': ABORTED: The TPU is already in use by another process probably owned by another user. Run "$ sudo lsof -w /dev/accel0" to figure out which process is using the TPU. If you still get this message, run "$ sudo rm /tmp/libtpu_lockfile". (set JAX_PLATFORMS='' to automatically choose an available backend)
Traceback (most recent call last):
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 435, in backends
backend = _init_backend(platform)
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 488, in _init_backend
backend = factory()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 189, in tpu_client_timer_callback
client = xla_client.make_tpu_client()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jaxlib/xla_client.py", line 173, in make_tpu_client
return make_tfrt_tpu_c_api_client()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jaxlib/xla_client.py", line 106, in make_tfrt_tpu_c_api_client
return _xla.get_c_api_client('tpu', options)
jaxlib.xla_extension.XlaRuntimeError: ABORTED: The TPU is already in use by another process probably owned by another user. Run "$ sudo lsof -w /dev/accel0" to figure out which process is using the TPU. If you still get this message, run "$ sudo rm /tmp/libtpu_lockfile".
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "example.py", line 5, in
device_count = jax.device_count()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 564, in device_count
return int(get_backend(backend).device_count())
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 533, in get_backend
return _get_backend_uncached(platform)
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 514, in _get_backend_uncached
bs = backends()
File "/home/mbctbiofuel/.local/lib/python3.8/site-packages/jax/_src/xla_bridge.py", line 452, in backends
raise RuntimeError(err_msg)
RuntimeError: Unable to initialize backend 'tpu': ABORTED: The TPU is already in use by another process probably owned by another user. Run "$ sudo lsof -w /dev/accel0" to figure out which process is using the TPU. If you still get this message, run "$ sudo rm /tmp/libtpu_lockfile". (set JAX_PLATFORMS='' to automatically choose an available backend)
Command execution on worker 2 failed with exit status 1. Continuing.
Command execution on worker 1 failed with exit status 1. Continuing.
Command execution on worker 3 failed with exit status 1. Continuing.
Command execution on worker 0 failed with exit status 1. Continuing.
mbctbiofuel@cloudshell:~ (mytpu1)$ ls /dev/accel*
ls: cannot access '/dev/accel*': No such file or directory
Beta Was this translation helpful? Give feedback.
All reactions