Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

when searchspace type set loguniform, could the _value min not set too small? #4195

Open
DavideHe opened this issue Sep 18, 2021 · 14 comments
Open

Comments

@DavideHe
Copy link

DavideHe commented Sep 18, 2021

Describe the issue:
I have an expriment run some time,expriment could stop
I find that when the expriment stops, after ps -aux , COMMAND col appear some "[bash] < defunct >",Parent process don't deal with this Zombie process?

or
search space i use {"_type":"loguniform","_value":[0.000001,0.001]}
Is it that _value min value too small when numerical calculation log, log go wrong ?(i guss too small _value will regard as 0?)

when I set _type to uniform; expriment run normally.

Environment:

  • NNI version: 2.0
  • Training service (local|remote|pai|aml|etc):
  • Client OS:
  • Server OS (for remote mode only):
  • Python version: 3.6
  • PyTorch/TensorFlow version: PyTorch1.2
  • Is conda/virtualenv/ venv used?:
  • Is running in Docker?: Yes

How to reproduce it?:

@DavideHe DavideHe changed the title could nni let >2 tasks run on one GPU at the same time? when searchspace type set loguniform, could the _value min not set too small? Sep 18, 2021
@DavideHe
Copy link
Author

DavideHe commented Sep 22, 2021

I found this maybe because I run this expriment on dorcker. I later run it well on a real machine .
I guess that every task nni will run a run.sh shell script,but the task end, nnictl don't kill the process clearly on the docker but it do normally on the real mechine. these Zombie processes("[bash] < defunct >")will be regards as the task continue,so the other task will not be running,becuase of "trialConcurrency" be set.

@Fiascolsy
Copy link
Contributor

Would you like to share us the version of docker image? I will try it in docker ASAP.

@DavideHe
Copy link
Author

Would you like to share us the version of docker image? I will try it in docker ASAP.

docker will not be shared because it is not pulled common docker hub.
but i can provide the linux version and kernel version , other key lib info:
Linux version 4.9.99 (gcc version 6.3.0 (GCC) ) #1 SMP Mon Oct 21 14:47:51 CST 2019
Ubuntu 18.04.3 LTS
other key libs:
torch 1.2 (cuda 10.0)
torchvision 0.4.0
tensorboard 1.14.0
nni 2.0 and 2.4 i tried
scipy 1.2.1

@Fiascolsy
Copy link
Contributor

Thank you. And I wonder if you run the experiment with a config file like '*.yaml'? If you do, please share me a copy with that : )

@DavideHe
Copy link
Author

DavideHe commented Sep 22, 2021

Thank you. And I wonder if you run the experiment with a config file like '*.yaml'? If you do, please share me a copy with that : )

config.yml , as below:

experimentName: xxxx
trialConcurrency: 2
maxExecDuration: 1d
maxTrialDuration: 1h    ## nni=2.4
maxTrialNum: 1000
#choice: local, remote, pai, kubeflow
trainingServicePlatform: local
searchSpacePath: search_space.json
# logDir: xxxxx/temp/nni-experiments/  
#choice: true, false
useAnnotation: false
tuner:
  #choice: TPE, Random, Anneal, Evolution
  builtinTunerName: TPE
  classArgs:
    #choice: maximize, minimize
    optimize_mode: minimize
assessor:
  builtinAssessorName: Medianstop
  classArgs:
    optimize_mode: minimize
trial:
  command: python train.py
  codeDir: .
  gpuNum: 1
localConfig:
  gpuIndices: "0"
  maxTrialNumPerGpu: 2
  useActiveGpu: true

@Fiascolsy
Copy link
Contributor

I ran the experiment in docker with almost the same environment(GCC 8.4.0 && some latest dependencies), because some previous versions were hard to found. And it worked normally. So I would like to suggest you turning to the log at ./nni-experiments/*/log/dispatcher.log and see what happened. Anyway, I would report this to my team in case of any potential scheduling issues : )

@DavideHe
Copy link
Author

I ran the experiment in docker with almost the same environment(GCC 8.4.0 && some latest dependencies), because some previous versions were hard to found. And it worked normally. So I would like to suggest you turning to the log at ./nni-experiments/*/log/dispatcher.log and see what happened. Anyway, I would report this to my team in case of any potential scheduling issues : )

thank you very much, I will check the log as you say. if I find some reason, I will exchange the message with u.

@DavideHe
Copy link
Author

DavideHe commented Sep 27, 2021

I ran the experiment in docker with almost the same environment(GCC 8.4.0 && some latest dependencies), because some previous versions were hard to found. And it worked normally. So I would like to suggest you turning to the log at ./nni-experiments/*/log/dispatcher.log and see what happened. Anyway, I would report this to my team in case of any potential scheduling issues : )

I think about a way of trying to cheat that crontab a task "ps -aux|grep -v grep|grep defunct |awk '{print $2}'|xargs -n1 -I {} kill -s 9 {}". when this zombie pid appear,kill it because I think these zombie pid appear so that nni regard as this zombie task run continue.
But when I try to kill the zombie pid,it failed because it's the subprocess from present process of "nni_node".
I want to know that nni_node could do this and kill this zombie processes.

@Fiascolsy
Copy link
Contributor

Did you notice the time stamp between dispatcher.log and nnimanager.log? And a normal exit should come up with the words like "terminated" and "task stopped"

@DavideHe
Copy link
Author

d a normal exit should come up with the word

in dispatcher.log and nnimanager.log,I didn't find the words "terminated" and "task stopped". my nni version is 2.4 .
But,
in nnimanager.log ,I found "INFO (NNIManager) Trial job OJlVS status changed from WAITING to RUNNING" but did not find "Trial job OJlVS status changed from RUNNING to EARLY_STOPPED" or "Trial job OJlVS status changed from RUNNING to SUCCEEDED" that will be in log if the task run well.

@Fiascolsy
Copy link
Contributor

It may crush while processing. A possible solution is to use ps -a and find PID with python then kill it.
Could you show the full contents of these two logs? Because I couldn't reproduce your issue, I'd like to try to find some reasons.

@DavideHe
Copy link
Author

It may crush while processing. A possible solution is to use ps -a and find PID with python then kill it.
Could you show the full contents of these two logs? Because I couldn't reproduce your issue, I'd like to try to find some reasons.

I could show some key log, the duplicate log will be replaced with "..." .because the log is too long.
If you want to some details, I will add it.

[nnimanager.log]

[2021-09-26 20:40:22] INFO (NNIDataStore) Datastore initialization done
[2021-09-26 20:40:22] INFO (RestServer) RestServer start
[2021-09-26 20:40:22] INFO (RestServer) RestServer base port is 10001
[2021-09-26 20:40:22] INFO (main) Rest server listening on: http://0.0.0.0:10001
[2021-09-26 20:40:23] INFO (NNIManager) Starting experiment: JWNRVqB5
[2021-09-26 20:40:23] INFO (NNIManager) Setup training service...
[2021-09-26 20:40:23] INFO (LocalTrainingService) Construct local machine training service.
[2021-09-26 20:40:23] INFO (NNIManager) Setup tuner...
[2021-09-26 20:40:23] INFO (NNIManager) Change NNIManager status from: INITIALIZED to: RUNNING
[2021-09-26 20:40:23] INFO (NNIManager) Add event listeners
[2021-09-26 20:40:23] INFO (LocalTrainingService) Run local machine training service.
[2021-09-26 20:40:24] INFO (NNIManager) NNIManager received command from dispatcher: ID,
[2021-09-26 20:40:24] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"smoothness_factor_B": 0.12344684162133342, "smoothne
[2021-09-26 20:40:24] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 1, "parameter_source": "algorithm", "parameters": {"smoothness_factor_B": 0.02964895098933397, "smoothne
[2021-09-26 20:40:28] INFO (NNIManager) submitTrialJob: form: {
  sequenceId: 0,
  hyperParameters: {
    value: '{"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"smoothness_factor_B": 0.12344684162133342, "smoothness_factor_F": 0.0840482602657689, "lr": 0.00856663943203427, "l2reg": 6.65679
    index: 0
  },
  placementConstraint: { type: 'None', gpus: [] }
}
[2021-09-26 20:40:28] INFO (NNIManager) submitTrialJob: form: {
  sequenceId: 1,
  hyperParameters: {
    value: '{"parameter_id": 1, "parameter_source": "algorithm", "parameters": {"smoothness_factor_B": 0.02964895098933397, "smoothness_factor_F": 0.0243258542087993, "lr": 0.31669333939952093, "l2reg": 1.18135
    index: 0
  },
  placementConstraint: { type: 'None', gpus: [] }
}
[2021-09-26 20:40:38] INFO (NNIManager) Trial job ag0xE status changed from WAITING to RUNNING
[2021-09-26 20:40:38] INFO (NNIManager) Trial job N2CvB status changed from WAITING to RUNNING
[2021-09-26 20:55:25] INFO (NNIManager) Trial job N2CvB status changed from RUNNING to SUCCEEDED
[2021-09-26 20:55:25] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 2, "parameter_source": "algorithm", "parameters": {"smoothness_factor_B": 0.0025011091358313627, "smooth
[2021-09-26 20:55:30] INFO (NNIManager) submitTrialJob: form: {
  sequenceId: 2,
  hyperParameters: {
    value: '{"parameter_id": 2, "parameter_source": "algorithm", "parameters": {"smoothness_factor_B": 0.0025011091358313627, "smoothness_factor_F": 0.024762317882029608, "lr": 0.007935301172050544, "l2reg": 1.
    index: 0
  },
  placementConstraint: { type: 'None', gpus: [] }
}                                                                                                                                                                                                                 
[2021-09-26 20:55:35] INFO (NNIManager) Trial job jSUsi status changed from WAITING to RUNNING
[2021-09-26 20:55:55] INFO (NNIManager) Trial job ag0xE status changed from RUNNING to SUCCEEDED
........
[2021-09-26 21:46:00] INFO (NNIManager) Trial job LTg1n status changed from WAITING to RUNNING
[2021-09-26 21:46:43] INFO (NNIManager) NNIManager received command from dispatcher: KI, "LTg1n"
[2021-09-26 21:46:43] INFO (NNIManager) cancelTrialJob: "LTg1n"
[2021-09-26 21:46:46] INFO (NNIManager) Trial job LTg1n status changed from RUNNING to EARLY_STOPPED
[2021-09-26 21:46:46] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 28, "parameter_source": "algorithm", "parameters": {"smoothness_factor_B": 0.2890083732287309, "smoothne
[2021-09-26 21:46:51] INFO (NNIManager) submitTrialJob: form: {
  sequenceId: 28,
  hyperParameters: {
    value: '{"parameter_id": 28, "parameter_source": "algorithm", "parameters": {"smoothness_factor_B": 0.2890083732287309, "smoothness_factor_F": 0.06516928292602955, "lr": 0.052740274340341654, "l2reg": 0.000
    index: 0
  },
  placementConstraint: { type: 'None', gpus: [] }
}
[2021-09-26 21:46:56] INFO (NNIManager) Trial job Rt0gb status changed from WAITING to RUNNING
[2021-09-26 21:47:33] INFO (NNIManager) NNIManager received command from dispatcher: KI, "Rt0gb"
[2021-09-26 21:47:33] INFO (NNIManager) cancelTrialJob: "Rt0gb"
[2021-09-26 21:47:36] INFO (NNIManager) Trial job Rt0gb status changed from RUNNING to EARLY_STOPPED
[2021-09-26 21:47:36] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 29, "parameter_source": "algorithm", "parameters": {"smoothness_factor_B": 0.38436087286203074, "smoothn
[2021-09-26 21:47:41] INFO (NNIManager) submitTrialJob: form: {
  sequenceId: 29,
  hyperParameters: {
    value: '{"parameter_id": 29, "parameter_source": "algorithm", "parameters": {"smoothness_factor_B": 0.38436087286203074, "smoothness_factor_F": 0.05798365321353835, "lr": 0.006745436289733636, "l2reg": 0.00
    index: 0
  },
  placementConstraint: { type: 'None', gpus: [] }
 
[2021-09-26 21:47:46] INFO (NNIManager) Trial job OJlVS status changed from WAITING to RUNNING

[dispatcher.log]

[2021-09-26 20:40:24] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher started
[2021-09-26 20:40:24] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001402 seconds
[2021-09-26 20:40:24] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials                 
[2021-09-26 20:40:24] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.002151 seconds
[2021-09-26 20:40:24] INFO (hyperopt.tpe/Thread-1) TPE using 0 trials                 
[2021-09-26 20:55:25] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.037862 seconds
[2021-09-26 20:55:25] INFO (hyperopt.tpe/Thread-1) TPE using 1/1 trials with best loss 1.831326
[2021-09-26 20:55:55] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001942 seconds
[2021-09-26 20:55:55] INFO (hyperopt.tpe/Thread-1) TPE using 2/2 trials with best loss 1.831326
[2021-09-26 21:04:52] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001929 seconds
[2021-09-26 21:04:52] INFO (hyperopt.tpe/Thread-1) TPE using 2/2 trials with best loss 1.831326
[2021-09-26 21:05:17] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001477 seconds
[2021-09-26 21:05:17] INFO (hyperopt.tpe/Thread-1) TPE using 3/3 trials with best loss 1.831326
[2021-09-26 21:05:37] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001463 seconds
[2021-09-26 21:05:37] INFO (hyperopt.tpe/Thread-1) TPE using 3/3 trials with best loss 1.831326
[2021-09-26 21:14:39] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001498 seconds
[2021-09-26 21:14:39] INFO (hyperopt.tpe/Thread-1) TPE using 4/4 trials with best loss 1.792171
[2021-09-26 21:14:54] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001461 seconds
[2021-09-26 21:14:54] INFO (hyperopt.tpe/Thread-1) TPE using 5/5 trials with best loss 1.792171
[2021-09-26 21:15:29] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001596 seconds
[2021-09-26 21:15:29] INFO (hyperopt.tpe/Thread-1) TPE using 5/5 trials with best loss 1.792171
[2021-09-26 21:15:54] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.002464 seconds
[2021-09-26 21:15:54] INFO (hyperopt.tpe/Thread-1) TPE using 5/5 trials with best loss 1.792171
[2021-09-26 21:16:19] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001464 seconds
[2021-09-26 21:16:19] INFO (hyperopt.tpe/Thread-1) TPE using 5/5 trials with best loss 1.792171
[2021-09-26 21:25:26] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.002475 seconds                                                                                                                            
[2021-09-26 21:25:26] INFO (hyperopt.tpe/Thread-1) TPE using 6/6 trials with best loss 1.792171
[2021-09-26 21:25:51] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001419 seconds
[2021-09-26 21:25:51] INFO (hyperopt.tpe/Thread-1) TPE using 7/7 trials with best loss 1.792171
[2021-09-26 21:26:11] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001608 seconds
[2021-09-26 21:26:11] INFO (hyperopt.tpe/Thread-1) TPE using 7/7 trials with best loss 1.792171
[2021-09-26 21:26:36] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001412 seconds
[2021-09-26 21:26:36] INFO (hyperopt.tpe/Thread-1) TPE using 7/7 trials with best loss 1.792171
[2021-09-26 21:26:56] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001390 seconds
[2021-09-26 21:26:56] INFO (hyperopt.tpe/Thread-1) TPE using 7/7 trials with best loss 1.792171
[2021-09-26 21:27:21] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001403 seconds
[2021-09-26 21:27:21] INFO (hyperopt.tpe/Thread-1) TPE using 7/7 trials with best loss 1.792171
[2021-09-26 21:36:23] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001425 seconds
[2021-09-26 21:36:23] INFO (hyperopt.tpe/Thread-1) TPE using 8/8 trials with best loss 1.704994
[2021-09-26 21:36:53] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001403 seconds
[2021-09-26 21:36:53] INFO (hyperopt.tpe/Thread-1) TPE using 9/9 trials with best loss 1.704994
[2021-09-26 21:37:08] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001398 seconds
[2021-09-26 21:37:08] INFO (hyperopt.tpe/Thread-1) TPE using 9/9 trials with best loss 1.704994
......
[2021-09-26 21:26:56] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001390 seconds
[2021-09-26 21:26:56] INFO (hyperopt.tpe/Thread-1) TPE using 7/7 trials with best loss 1.792171
[2021-09-26 21:27:21] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001403 seconds                                                                                                                            
[2021-09-26 21:27:21] INFO (hyperopt.tpe/Thread-1) TPE using 7/7 trials with best loss 1.792171
[2021-09-26 21:36:23] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001425 seconds
[2021-09-26 21:36:23] INFO (hyperopt.tpe/Thread-1) TPE using 8/8 trials with best loss 1.704994
[2021-09-26 21:36:53] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001403 seconds
[2021-09-26 21:36:53] INFO (hyperopt.tpe/Thread-1) TPE using 9/9 trials with best loss 1.704994
[2021-09-26 21:37:08] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001398 seconds
[2021-09-26 21:37:08] INFO (hyperopt.tpe/Thread-1) TPE using 9/9 trials with best loss 1.704994
[2021-09-26 21:37:58] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001409 seconds
[2021-09-26 21:37:58] INFO (hyperopt.tpe/Thread-1) TPE using 9/9 trials with best loss 1.704994
[2021-09-26 21:38:43] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001761 seconds
[2021-09-26 21:38:43] INFO (hyperopt.tpe/Thread-1) TPE using 9/9 trials with best loss 1.704994
[2021-09-26 21:39:34] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.002555 seconds
[2021-09-26 21:39:34] INFO (hyperopt.tpe/Thread-1) TPE using 9/9 trials with best loss 1.704994
[2021-09-26 21:40:24] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001438 seconds
[2021-09-26 21:40:24] INFO (hyperopt.tpe/Thread-1) TPE using 9/9 trials with best loss 1.704994
[2021-09-26 21:41:14] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001427 seconds
[2021-09-26 21:41:14] INFO (hyperopt.tpe/Thread-1) TPE using 9/9 trials with best loss 1.704994
[2021-09-26 21:42:04] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001431 seconds
[2021-09-26 21:42:04] INFO (hyperopt.tpe/Thread-1) TPE using 9/9 trials with best loss 1.704994
[2021-09-26 21:45:50] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001383 seconds
[2021-09-26 21:45:50] INFO (hyperopt.tpe/Thread-1) TPE using 10/10 trials with best loss 1.704994
[2021-09-26 21:46:46] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001407 seconds
[2021-09-26 21:46:46] INFO (hyperopt.tpe/Thread-1) TPE using 10/10 trials with best loss 1.704994
[2021-09-26 21:47:36] INFO (hyperopt.tpe/Thread-1) tpe_transform took 0.001410 seconds
[2021-09-26 21:47:36] INFO (hyperopt.tpe/Thread-1) TPE using 10/10 trials with best loss 1.704994

@Fiascolsy
Copy link
Contributor

[2021-09-26 21:46:43] INFO (NNIManager) NNIManager received command from dispatcher: KI, "LTg1n"
[2021-09-26 21:46:43] INFO (NNIManager) cancelTrialJob: "LTg1n"
[2021-09-26 21:46:46] INFO (NNIManager) Trial job LTg1n status changed from RUNNING to EARLY_STOPPED
In order to save on computing resources, NNI supports an early stopping policy and has an interface called **Assessor** to do this job.

Assessor receives the intermediate result from a trial and decides whether the trial should be killed using a specific algorithm. Once the trial experiment meets the early stopping conditions (which means Assessor is pessimistic about the final results), the assessor will kill the trial and the status of the trial will be `EARLY_STOPPED`.

The sentences are from docs/en_US/builtin_assessor.rst.
All above, there may be something unexpected with your docker, like versions or configs, etc. Maybe a latest environment will help : )

@DavideHe
Copy link
Author

[2021-09-26 21:46:43] INFO (NNIManager) NNIManager received command from dispatcher: KI, "LTg1n"
[2021-09-26 21:46:43] INFO (NNIManager) cancelTrialJob: "LTg1n"
[2021-09-26 21:46:46] INFO (NNIManager) Trial job LTg1n status changed from RUNNING to EARLY_STOPPED
In order to save on computing resources, NNI supports an early stopping policy and has an interface called **Assessor** to do this job.

Assessor receives the intermediate result from a trial and decides whether the trial should be killed using a specific algorithm. Once the trial experiment meets the early stopping conditions (which means Assessor is pessimistic about the final results), the assessor will kill the trial and the status of the trial will be `EARLY_STOPPED`.

The sentences are from docs/en_US/builtin_assessor.rst.
All above, there may be something unexpected with your docker, like versions or configs, etc. Maybe a latest environment will help : )

I understand the early_stopped policy.my meaning of before description "INFO (NNIManager) Trial job OJlVS status changed from WAITING to RUNNING" is that the task of OJlVS log in nnimanager.log only show state is RUNING ,not SUCCEED or EARLY_STOPPED if run right.
"INFO (NNIManager) Trial job OJlVS status changed from WAITING to RUNNING" is the end line of nnimanager.log

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants