Skip to content
This repository has been archived by the owner on Feb 20, 2024. It is now read-only.

examples/TfFeedForward.py does not run correctly #179

Open
easyfan327 opened this issue Dec 4, 2019 · 2 comments
Open

examples/TfFeedForward.py does not run correctly #179

easyfan327 opened this issue Dec 4, 2019 · 2 comments

Comments

@easyfan327
Copy link

  1. add TfFeedForward.py to model
  2. start new train job
  3. the new train job is labeled as STARTED however never proceed to RUNNING

p.s. executed bash scripts/setup_node.sh to enable GPU support

@easyfan327
Copy link
Author

logs in worker for reference:
Traceback (most recent call last):
File "/root/rafiki/utils/service.py", line 50, in run_worker
start_worker(service_id, service_type, container_id)
File "scripts/start_worker.py", line 40, in start_worker
worker.start()
File "/root/rafiki/worker/train.py", line 56, in start
self._monitor.pull_job_info()
File "/root/rafiki/worker/train.py", line 257, in pull_job_info
self.model_class = load_model_class(model.model_file_bytes, model.model_class)
File "/root/rafiki/model/utils.py", line 51, in load_model_class
raise InvalidModelClassError(e)
rafiki.model.utils.InvalidModelClassError: Traceback (most recent call last):
File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in
from tensorflow.python.pywrap_tensorflow_internal import *
File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in
_pywrap_tensorflow_internal = swig_import_helper()
File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
_mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
File "/usr/local/envs/rafiki/lib/python3.6/imp.py", line 243, in load_module
return load_dynamic(name, filename, file)
File "/usr/local/envs/rafiki/lib/python3.6/imp.py", line 343, in load_dynamic
return _load(spec)
ImportError: /usr/lib/x86_64-linux-gnu/libcuda.so.1: file too short

@pinpom
Copy link
Contributor

pinpom commented Dec 6, 2019

hi @easyfan327, since rafiki has been upgraded to version 0.2.0, it is recommended that you install the most updated version of rafiki from nginyc/rafiki/master. Please remember to delete any old rafiki's instances (incl. docker images and containers) remaining on your machine before installing the new version.
When scaling rafiki on GPU, also remember to add 'GPU_COUNT': 1 to budget while you create a train job (refer to latest doc: https://nginyc.github.io/rafiki/docs/0.2.0/src/python/rafiki.client.html#rafiki.client.Client.create_train_job).
For example:
client.create_train_job( app='fashion_mnist_app', task='IMAGE_CLASSIFICATION', train_dataset_id='70efcbf6-b576-44d0-83b7-fd93e8ee03d3', val_dataset_id='9c28d97a-3d08-4903-b217-1169a13e5d6a', budget={ 'MODEL_TRIAL_COUNT': 5, 'GPU_COUNT': 1}, models=[ 'b67f3017-8f37-45cc-a7c5-a3f8912ac72e' ] )
I have no problem while running this model. FYR, attached herewith the code Please try again and let me know if there's any issues.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants