You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.
This issue may caused by using stream to read files when a trial is running, and this stream is not closed when a trial job is finished. Fix stream in #885 .
I ran experiments in the same environment, with both the git master and 0.6 release. The problem still happens at ~4000 trials.
The number of open files steadily increases as an experiment runs. The program is keeping the metrics file of every trial open ("lsof | grep node" shows thousands of lines like this: /experiments/{experiment ID}/trials/{trial ID}/.nni/metrics). When it reaches the number of open files limit set by the OS, the error happens again.
Hi, this error is fixed in #1189, and I've verified that it works on my environment. NNI used destroy() function to close file stream, which does not work sometime, I've changed another way to close file stream.
Short summary about the issue/question:
Long running jobs result in EMFILE error at TrialNum ~4000. It seems to be due to node leaking.
lsof | grep node | wc -l
lsof | grep node
There are thousands of lines similar to this:
Brief what process you are following:
How to reproduce it:
nni Environment:
Anything else we need to know:
The text was updated successfully, but these errors were encountered: