Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Training service error: spawn /bin/sh EMFILE #867

Closed
BabakAp opened this issue Mar 18, 2019 · 4 comments
Closed

Training service error: spawn /bin/sh EMFILE #867

BabakAp opened this issue Mar 18, 2019 · 4 comments

Comments

@BabakAp
Copy link

BabakAp commented Mar 18, 2019

Short summary about the issue/question:
Long running jobs result in EMFILE error at TrialNum ~4000. It seems to be due to node leaking.

lsof | grep node | wc -l

44968

lsof | grep node
There are thousands of lines similar to this:

node 15731 user 106r REG 8,1 0 4457298 /home/user/nni/experiments/TPuECjPB/trials/FTP4I/.nni/metrics

Brief what process you are following:

  1. Create experiment, running locally on an Ubuntu machine
  2. Let it run >4000 trials.

How to reproduce it:

nni Environment:

  • nni version: 0.5.2.1
  • nni mode(local|pai|remote): local
  • OS: Ubuntu 16.04.6 LTS
  • python version: 3.6.8
  • is conda or virtualenv used?: conda
  • is running in docker?: No

Anything else we need to know:

@SparkSnail
Copy link
Contributor

SparkSnail commented Mar 26, 2019

This issue may caused by using stream to read files when a trial is running, and this stream is not closed when a trial job is finished. Fix stream in #885 .

@BabakAp
Copy link
Author

BabakAp commented Apr 3, 2019

I ran experiments in the same environment, with both the git master and 0.6 release. The problem still happens at ~4000 trials.
The number of open files steadily increases as an experiment runs. The program is keeping the metrics file of every trial open ("lsof | grep node" shows thousands of lines like this: /experiments/{experiment ID}/trials/{trial ID}/.nni/metrics). When it reaches the number of open files limit set by the OS, the error happens again.

@SparkSnail
Copy link
Contributor

SparkSnail commented Jun 21, 2019

Hi, this error is fixed in #1189, and I've verified that it works on my environment. NNI used destroy() function to close file stream, which does not work sometime, I've changed another way to close file stream.

@BabakAp
Copy link
Author

BabakAp commented Jul 7, 2019

I can confirm that with nni 0.9.1 the issue is not reproducible in the same environment. Thank you for the great work!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants