Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use mp_context=multiprocessing.get_context("spawn") in ProcessPoolExecutor will crash #126

Closed
qindazhu opened this issue Nov 13, 2020 · 15 comments
Labels
bug Something isn't working

Comments

@qindazhu
Copy link

With this PR k2-fsa/snowfall#5, I will get error

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/linuxbrew/.linuxbrew/opt/python@3.8/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/home/linuxbrew/.linuxbrew/opt/python@3.8/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/home/linuxbrew/.linuxbrew/opt/python@3.8/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/home/linuxbrew/.linuxbrew/opt/python@3.8/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/home/linuxbrew/.linuxbrew/opt/python@3.8/lib/python3.8/runpy.py", line 265, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/home/linuxbrew/.linuxbrew/opt/python@3.8/lib/python3.8/runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/home/linuxbrew/.linuxbrew/opt/python@3.8/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/ceph-hw/snowfall/egs/librispeech/asr/simple_v1/prepare.py", line 47, in <module>
    cut_set = CutSet.from_manifests(
  File "/ceph-hw/lhotse/lhotse/cut.py", line 1319, in compute_and_store_features
    executor.submit(
  File "/home/linuxbrew/.linuxbrew/Cellar/python@3.8/3.8.6_1/lib/python3.8/concurrent/futures/process.py", line 645, in submit
    self._start_queue_management_thread()
  File "/home/linuxbrew/.linuxbrew/Cellar/python@3.8/3.8.6_1/lib/python3.8/concurrent/futures/process.py", line 584, in _start_queue_management_thread
    self._adjust_process_count()
  File "/home/linuxbrew/.linuxbrew/Cellar/python@3.8/3.8.6_1/lib/python3.8/concurrent/futures/process.py", line 608, in _adjust_process_count
Traceback (most recent call last):
  File "./prepare.py", line 47, in <module>
    cut_set = CutSet.from_manifests(
  File "/ceph-hw/lhotse/lhotse/cut.py", line 1328, in compute_and_store_features
    cut_set = CutSet.from_cuts(f.result() for f in futures)
  File "/ceph-hw/lhotse/lhotse/cut.py", line 989, in from_cuts
    return CutSet({cut.id: cut for cut in cuts})
  File "/ceph-hw/lhotse/lhotse/cut.py", line 989, in <dictcomp>
    return CutSet({cut.id: cut for cut in cuts})
  File "/ceph-hw/lhotse/lhotse/cut.py", line 1328, in <genexpr>
    cut_set = CutSet.from_cuts(f.result() for f in futures)
  File "/home/linuxbrew/.linuxbrew/Cellar/python@3.8/3.8.6_1/lib/python3.8/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/home/linuxbrew/.linuxbrew/Cellar/python@3.8/3.8.6_1/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
/home/linuxbrew/.linuxbrew/opt/python@3.8/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 10 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

BTW, as we pass spawn , it starts a fresh python interpreter process, then will output too many duplicate logs
https://github.com/k2-fsa/snowfall/blob/7201fdebd18231df4c3a6a4c198e1d0a7d7c7d22/egs/librispeech/asr/simple_v1/prepare.py#L17-L21
which is a little bit annoying, if would be great if you can fix that together with the error above, but not urgent.

@pzelasko
Copy link
Collaborator

pzelasko commented Nov 13, 2020 via email

@mthrok
Copy link

mthrok commented Nov 13, 2020

@pzelasko

Further investigating pytorch/audio#1021 and following up your comment on OpenMP, I disabled OpenMP at the time of sox compilation pytorch/audio#1026, and the test seems to get unstack without the use of multiprocessing context with spawn. I am still not sure if this works on other OSs too, and I still have to talk with the team but if this works, we might be able to fix it on torchaudio side.

@pzelasko
Copy link
Collaborator

Thanks @mthrok - let me know when torchaudio conda/pip packages have the fix, I will then revert the "spawn" thing. Anyway, I expect that Dask executor is immune to this issue.

@qindazhu
Copy link
Author

@pzelasko, with distributed, I get runtime error below.

File "/ceph-hw/.local/lib/python3.8/site-packages/distributed/process.py", line 33, in _call_and_set_future
    res = func(*args, **kwargs)
  File "/ceph-hw/.local/lib/python3.8/site-packages/distributed/process.py", line 203, in _start
    process.start()
  File "/home/linuxbrew/.linuxbrew/opt/python@3.8/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/home/linuxbrew/.linuxbrew/opt/python@3.8/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/home/linuxbrew/.linuxbrew/opt/python@3.8/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/home/linuxbrew/.linuxbrew/opt/python@3.8/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/linuxbrew/.linuxbrew/opt/python@3.8/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 42, in _launch
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "/home/linuxbrew/.linuxbrew/opt/python@3.8/lib/python3.8/multiprocessing/spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "/home/linuxbrew/.linuxbrew/opt/python@3.8/lib/python3.8/multiprocessing/spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

@pzelasko
Copy link
Collaborator

Ha, it could actually be the same reason "spawn" didn't work for you... Could you wrap the script's code into a function (called e.g. def main():) and at the end of script add:

if __name__ == '__main__': 
    main()

That should solve these issues. Basically I think the problem is that the new Python process executes the whole script again while initializing, and the if __name__ == '__main__' idiom prevents that.

@qindazhu
Copy link
Author

@pzelasko I tried this yesterday, it can run successfully now, many thanks! (and sorry for forgetting tell you this yesterday as we were busy).
However, there is an issue now that the scripts k2-fsa/snowfall#11 will take much longer time to prepare train-clean-100 of Librispeech (more than 1 hours), I wonder what we do now in argumentation as I was thinking we don't need to take so long time to prepare time (before we add argumentation, it only takes about 10-15 minutes to prepare train-clean-100)

@danpovey
Copy link
Collaborator

danpovey commented Nov 16, 2020 via email

@jimbozhang
Copy link
Contributor

Haowen, can you please make a PR for this fix?

On Mon, Nov 16, 2020 at 11:28 AM Haowen Qiu @.***> wrote: @pzelasko https://github.com/pzelasko I tried this yesterday, it can run successfully now, many thanks! (and sorry for forgetting tell you this yesterday as we were busy). However, there is an issue now that the scripts k2-fsa/snowfall#11 <k2-fsa/snowfall#11> will take much longer time to prepare train-clean-100 of Librispeech (more than 1 hours), I wonder what we do now in argumentation as I was thinking we don't need to take so long time to prepare time (before we add argumentation, it only takes about 10-15 minutes to prepare train-clean-100) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#126 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOYUWJB7SPAENX63UULSQCL5FANCNFSM4TURVCQA .

I'm doing this.

@qindazhu
Copy link
Author

qindazhu commented Nov 16, 2020

Thanks @jimbozhang, also wondering how long time you take now to prepare train--clean-100 with the latest scripts, just to make sure it's not my local issue.

@jimbozhang
Copy link
Contributor

jimbozhang commented Nov 16, 2020

Thanks @jimbozhang, also wondering how long time you take now to prepare train--clean-100 with the latest scripts, make sure it's not my local issue.

I just ran it on our shared machine (ip: 10.**.*.72) before 10 minutes. I'll let you know when the preparing finished.

@qindazhu
Copy link
Author

OK, thanks

@pzelasko
Copy link
Collaborator

Since it helped I'm closing the issue.

@pzelasko pzelasko added the bug Something isn't working label Nov 16, 2020
@mthrok
Copy link

mthrok commented Nov 16, 2020

@qindazhu

For the clarification, with "spawn" method, you are not facing a crush but you are experiencing the slow down, right?

@pzelasko
Copy link
Collaborator

@mthrok they fixed the slowdown, it was about setting torch num threads and interop num threads to 1

@qindazhu
Copy link
Author

@mthrok they fixed the slowdown, it was about setting torch num threads and interop num threads to 1

yes. see @danpovey experiment here k2-fsa/snowfall#18. You can just check the latest prepare.py in snowfall

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants