-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use mp_context=multiprocessing.get_context("spawn")
in ProcessPoolExecutor will crash
#126
Comments
Ugh, I think I am going to drop this parallelism method altogether. Can you
try to use dask instead?
pip install dask distributed
from distributed import Client
with Client() as executor: ...
I use it on CLSP to distribute the jobs on the grid, but it also supports
local execution which is used in the example above. Let me know if that
solves the issues.
pt., 13 lis 2020, 08:00 użytkownik Haowen Qiu <notifications@github.com>
napisał:
… With this PR k2-fsa/snowfall#5 <k2-fsa/snowfall#5>,
I will get error
Traceback (most recent call last):
File "<string>", line 1, in <module>
File ***@***.***/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File ***@***.***/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File ***@***.***/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File ***@***.***/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File ***@***.***/lib/python3.8/runpy.py", line 265, in run_path
return _run_module_code(code, init_globals, run_name,
File ***@***.***/lib/python3.8/runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File ***@***.***/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/ceph-hw/snowfall/egs/librispeech/asr/simple_v1/prepare.py", line 47, in <module>
cut_set = CutSet.from_manifests(
File "/ceph-hw/lhotse/lhotse/cut.py", line 1319, in compute_and_store_features
executor.submit(
File ***@***.***/3.8.6_1/lib/python3.8/concurrent/futures/process.py", line 645, in submit
self._start_queue_management_thread()
File ***@***.***/3.8.6_1/lib/python3.8/concurrent/futures/process.py", line 584, in _start_queue_management_thread
self._adjust_process_count()
File ***@***.***/3.8.6_1/lib/python3.8/concurrent/futures/process.py", line 608, in _adjust_process_count
Traceback (most recent call last):
File "./prepare.py", line 47, in <module>
cut_set = CutSet.from_manifests(
File "/ceph-hw/lhotse/lhotse/cut.py", line 1328, in compute_and_store_features
cut_set = CutSet.from_cuts(f.result() for f in futures)
File "/ceph-hw/lhotse/lhotse/cut.py", line 989, in from_cuts
return CutSet({cut.id: cut for cut in cuts})
File "/ceph-hw/lhotse/lhotse/cut.py", line 989, in <dictcomp>
return CutSet({cut.id: cut for cut in cuts})
File "/ceph-hw/lhotse/lhotse/cut.py", line 1328, in <genexpr>
cut_set = CutSet.from_cuts(f.result() for f in futures)
File ***@***.***/3.8.6_1/lib/python3.8/concurrent/futures/_base.py", line 439, in result
return self.__get_result()
File ***@***.***/3.8.6_1/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
***@***.***/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 10 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
BTW, as we pass spawn , it starts a fresh python interpreter process,
then will output too many duplicate logs
https://github.com/k2-fsa/snowfall/blob/7201fdebd18231df4c3a6a4c198e1d0a7d7c7d22/egs/librispeech/asr/simple_v1/prepare.py#L17-L21
which is a little bit annoying, if would be great if you can fix that
together with the error above, but not urgent.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#126>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADZRKQGOFY3FNJGVBFG34Z3SPUUXRANCNFSM4TURVCQA>
.
|
Further investigating pytorch/audio#1021 and following up your comment on OpenMP, I disabled OpenMP at the time of sox compilation pytorch/audio#1026, and the test seems to get unstack without the use of multiprocessing context with spawn. I am still not sure if this works on other OSs too, and I still have to talk with the team but if this works, we might be able to fix it on torchaudio side. |
Thanks @mthrok - let me know when torchaudio conda/pip packages have the fix, I will then revert the "spawn" thing. Anyway, I expect that Dask executor is immune to this issue. |
@pzelasko, with
|
Ha, it could actually be the same reason "spawn" didn't work for you... Could you wrap the script's code into a function (called e.g. if __name__ == '__main__':
main() That should solve these issues. Basically I think the problem is that the new Python process executes the whole script again while initializing, and the |
@pzelasko I tried this yesterday, it can run successfully now, many thanks! (and sorry for forgetting tell you this yesterday as we were busy). |
Haowen, can you please make a PR for this fix?
…On Mon, Nov 16, 2020 at 11:28 AM Haowen Qiu ***@***.***> wrote:
@pzelasko <https://github.com/pzelasko> I tried this yesterday, it can
run successfully now, many thanks! (and sorry for forgetting tell you this
yesterday as we were busy).
However, there is an issue now that the scripts k2-fsa/snowfall#11
<k2-fsa/snowfall#11> will take much longer time
to prepare train-clean-100 of Librispeech (more than 1 hours), I wonder
what we do now in argumentation as I was thinking we don't need to take so
long time to prepare time (before we add argumentation, it only takes about
10-15 minutes to prepare train-clean-100)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#126 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLOYUWJB7SPAENX63UULSQCL5FANCNFSM4TURVCQA>
.
|
I'm doing this. |
Thanks @jimbozhang, also wondering how long time you take now to prepare train--clean-100 with the latest scripts, just to make sure it's not my local issue. |
I just ran it on our shared machine (ip: 10.**.*.72) before 10 minutes. I'll let you know when the preparing finished. |
OK, thanks |
Since it helped I'm closing the issue. |
For the clarification, with "spawn" method, you are not facing a crush but you are experiencing the slow down, right? |
@mthrok they fixed the slowdown, it was about setting torch num threads and interop num threads to 1 |
yes. see @danpovey experiment here k2-fsa/snowfall#18. You can just check the latest prepare.py in snowfall |
With this PR k2-fsa/snowfall#5, I will get error
BTW, as we pass
spawn
, it starts a fresh python interpreter process, then will output too many duplicate logshttps://github.com/k2-fsa/snowfall/blob/7201fdebd18231df4c3a6a4c198e1d0a7d7c7d22/egs/librispeech/asr/simple_v1/prepare.py#L17-L21
which is a little bit annoying, if would be great if you can fix that together with the error above, but not urgent.
The text was updated successfully, but these errors were encountered: