Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fail to run tensorflow example in https://volcano.sh/en/docs/tf_on_volcano/ #2653

Closed
shuaiyy opened this issue Jan 17, 2023 · 1 comment · Fixed by volcano-sh/website#293
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@shuaiyy
Copy link

shuaiyy commented Jan 17, 2023

What happened:
run the demo in https://volcano.sh/en/docs/tf_on_volcano/

got error

in ps and worker 0

job name = ps
task index = 0
Traceback (most recent call last):
  File "/var/tf_dist_mnist/dist_mnist.py", line 303, in <module>
    tf.app.run()
  File "/opt/conda/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "/var/tf_dist_mnist/dist_mnist.py", line 144, in main
    cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index)
  File "/opt/conda/lib/python3.5/site-packages/tensorflow/python/training/server_lib.py", line 147, in __init__
    self._server_def.SerializeToString(), status)
  File "/opt/conda/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 519, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Could not parse port for local server from ""

in worker 1:

job name = ps
task index = 1
Traceback (most recent call last):
  File "/var/tf_dist_mnist/dist_mnist.py", line 303, in <module>
    tf.app.run()
  File "/opt/conda/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "/var/tf_dist_mnist/dist_mnist.py", line 144, in main
    cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index)
  File "/opt/conda/lib/python3.5/site-packages/tensorflow/python/training/server_lib.py", line 147, in __init__
    self._server_def.SerializeToString(), status)
  File "/opt/conda/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 519, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Task 1 was not defined in job "ps"

What you expected to happen:

the dome in docs can run success

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Volcano Version: install by master branch,image with latest tag
  • Kubernetes version (use kubectl version): 1.18.8
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release): CentOS-7
  • Kernel (e.g. uname -a): Linux emr-header-1.cluster-337861 3.10.0-1160.42.2.el7.x86_64 Rename hpw.cloud keyword to volcano.sh #1 SMP Tue Sep 7 14:49:57 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools: kubectl apply -f
  • Others:
@shuaiyy shuaiyy added the kind/bug Categorizes issue or PR as related to a bug. label Jan 17, 2023
@hwdef
Copy link
Member

hwdef commented Jan 17, 2023

This is a documentation error, and the documentation will be updated later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants