Skip to content
This repository has been archived by the owner on Feb 20, 2024. It is now read-only.

Random uuid name cause potential "No module named xxx" error during load parameters #159

Open
vivansxu opened this issue Aug 2, 2019 · 3 comments
Labels
bug Something isn't working

Comments

@vivansxu
Copy link

vivansxu commented Aug 2, 2019

I encountered a "No module named xxx" error when loading parameter of my model is called when launching an inference job. Here is the error trace:

2019-07-10 02:21:07,256 rafiki.utils.service INFO Starting worker "75be99ec25a6" for service of ID "614d740e-9791-4c64-aafe-dc17cf7e7866"...
2019-07-10 02:21:07,511 rafiki.worker.inference INFO Starting inference worker for service of id 614d740e-9791-4c64-aafe-dc17cf7e7866...
2019-07-10 02:21:07,519 rafiki.cache.cache INFO add_worker_of_inference_job:INFERENCE_WORKERS_b6592484-deb4-4df2-bce3-ffc82d9a125a=614d740e-9791-4c64-aafe-dc17cf7e7866
2019-07-10 02:21:09,131 rafiki.utils.service ERROR Error while running worker:
2019-07-10 02:21:09,131 rafiki.utils.service ERROR Traceback (most recent call last):
File "/root/rafiki/utils/service.py", line 31, in run_worker
start_worker(service_id, service_type, container_id)
File "scripts/start_worker.py", line 24, in start_worker
worker.start()
File "/root/rafiki/worker/inference.py", line 41, in start
self._model = self._load_model(trial_id)
File "/root/rafiki/worker/inference.py", line 91, in _load_model
model_inst.load_parameters(parameters)
File "/root/e4568ce2-9d44-47b8-ac7f-1e8143168140.py", line 235, in load_parameters
ModuleNotFoundError: No module named '797342b4-9d38-432f-91f6-727eac25db71'

After debugging I figured that it is a potential bug of Rafiki and pickle. This bug is caused by pickling self-defined class objects(defined in model source code).
Pickle requires the pickled object's class to be importable during pickle.loads(), by using the same import path memorized during pickle.dumps. However, each time a train trail or inference job is launched, a random UUID name will be given to the model source code file name. This caused the inconsistency of import path during dumping and loading.
This bug is not revealed because currently, the models in Rafiki are only pickling imported class object or python "primitives". Their import path is consistent.
Potential fix for this bug could be:

  1. Change randomly generated file name to the hash of something (e.g. model name + trail id), then use the same way of hashing for both train job and inference job.
  2. Remember the generated name during train job and use the same name during inference job. (Model.load_model_class do take the third parameter "temp_mod_name" but it is never called except in "test_model_class")
  3. Change the way of importing the model source file. (Not sure)

Thank you!

@nginyc
Copy link
Owner

nginyc commented Aug 5, 2019

Hi @vivansxu, thanks for the bug report. I got the gist of the bug. To better understand, can you provide the implementation (or description of the implementation) of your model, or specifically for the load_parameters method?

@nginyc nginyc added the bug Something isn't working label Aug 5, 2019
@vivansxu
Copy link
Author

vivansxu commented Aug 6, 2019

Hi @nginyc, thanks for your reply. Following are my dump_parameters() and load_parameters()

def dump_parameters(self):
    params = {}
    with tempfile.NamedTemporaryFile() as tmp:
        pickle.dump((self.G, self.D, self.Gs), tmp, protocol=pickle.HIGHEST_PROTOCOL)
        with open(tmp.name, 'rb') as f:
            h5_model_bytes = f.read()
        params['h5_model_base64'] = base64.b64encode(h5_model_bytes).decode('utf-8')
    return params
def load_parameters(self, params):
    h5_model_base64 = params.get('h5_model_base64')
    with tempfile.NamedTemporaryFile() as tmp:
        h5_model_bytes = base64.b64decode(h5_model_base64.encode('utf-8'))
        with open(tmp.name, 'wb') as f:
            f.write(h5_model_bytes)
        unpickler = pickle.Unpickler(tmp)
        self.G, self.D, self.Gs = unpickler.load()

self.G, self.D and self.Gs are all Network objects, where Network is a class I defined in my model file.

Thank you!

@nginyc
Copy link
Owner

nginyc commented Aug 6, 2019

Ok I see. Do you want to try making a PR to fix this? It seems like you could be already onto a fix. I would consider option 1.

Thanks for the help!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants