Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batchSize=1 for paralell runs with mpi #1023

Closed
10 tasks done
AaronEpiney opened this issue Jul 16, 2019 · 4 comments · Fixed by #1221
Closed
10 tasks done

batchSize=1 for paralell runs with mpi #1023

AaronEpiney opened this issue Jul 16, 2019 · 4 comments · Fixed by #1221
Assignees
Labels
defect priority_critical RAVENv2.0 Defects and Features in release of RAVEN v2.0

Comments

@AaronEpiney
Copy link
Collaborator

AaronEpiney commented Jul 16, 2019


Issue Description

Putting the batchsize = 1 in the parallel description leads to an execution error ("node" files are not created).
Input

    <batchSize>1</batchSize>
    <NumMPI>2</NumMPI>
    <internalParallel>False</internalParallel>
    <mode>
      mpi
      <runQSUB/>
      <memory>6gb</memory>
    </mode>
    <expectedTime>1:00:00</expectedTime>

Error

Exception in thread XXX++1:
Traceback (most recent call last):
  File "/home/XXX/.conda/envs/raven_libraries/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/XXX/.conda/envs/raven_libraries/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/XXX/project_TREAT/raven/framework/Runners/SharedMemoryRunner.py", line 136, in <lambda>
    self.thread = InterruptibleThread(target = lambda q, *arg : q.append(self.functionToRun(*arg)),
  File "/home/XXX/project_TREAT/raven/framework/Models/Code.py", line 479, in evaluateSample
    inputFiles = self.createNewInput(myInput, samplerType, **kwargs)
  File "/home/XXX/project_TREAT/raven/framework/Models/Code.py", line 383, in createNewInput
    newInput    = self.code.createNewInput(newInputSet,self.oriInputFiles,samplerType,**copy.deepcopy(kwargs))
  File "/home/XXX/project_TREAT/raven/framework/CodeInterfaces/RAVEN/RAVENInterface.py", line 253, in createNewInput
    raise IOError(self.printTag+' ERROR: The nodefile "'+str(nodeFileToUse)+'" does not exist!')
OSError: RAVEN INTERFACE ERROR: The nodefile "/home/XXX/./node_0" does not exist!

while the following completes without error

    <batchSize>2</batchSize>
    <NumMPI>2</NumMPI>
    <internalParallel>False</internalParallel>
    <mode>
      mpi
      <runQSUB/>
      <memory>6gb</memory>
    </mode>
    <expectedTime>1:00:00</expectedTime>

For Change Control Board: Issue Review

This review should occur before any development is performed as a response to this issue.

  • 1. Is it tagged with a type: defect or task?
  • 2. Is it tagged with a priority: critical, normal or minor?
  • 3. If it will impact requirements or requirements tests, is it tagged with requirements?
  • 4. If it is a defect, can it cause wrong results for users? If so an email needs to be sent to the users.
  • 5. Is a rationale provided? (Such as explaining why the improvement is needed or why current code is wrong.)

For Change Control Board: Issue Closure

This review should occur when the issue is imminently going to be closed.

  • 1. If the issue is a defect, is the defect fixed?
  • 2. If the issue is a defect, is the defect tested for in the regression test system? (If not explain why not.)
  • 3. If the issue can impact users, has an email to the users group been written (the email should specify if the defect impacts stable or master)?
  • 4. If the issue is a defect, does it impact the latest release branch? If yes, is there any issue tagged with release (create if needed)?
  • 5. If the issue is being closed without a pull request, has an explanation of why it is being closed been provided?
@alfoa
Copy link
Collaborator

alfoa commented Jul 22, 2019

Duplicated of #939 . It looks like it is an old problem.

@alfoa
Copy link
Collaborator

alfoa commented Jul 22, 2019

@PaulTalbot-INL It looks like it is in the modifyInfo of the MPI mode (when the new batch size is created and a new node file is generated...it looks like that the information about the NodeParameter is not returned back

@alfoa alfoa added defect priority_critical RAVENv2.0 Defects and Features in release of RAVEN v2.0 labels Jul 22, 2019
@alfoa alfoa mentioned this issue Jul 22, 2019
10 tasks
alfoa added a commit that referenced this issue Apr 22, 2020
@alfoa alfoa mentioned this issue Apr 22, 2020
9 tasks
@wangcj05
Copy link
Collaborator

Defect with crash, no wrong results, and email to the users is optional.

@wangcj05
Copy link
Collaborator

checklist is satisfied. This PR can be closed with PR #1221

PaulTalbot-INL pushed a commit that referenced this issue May 20, 2020
* added test for issue 1023

* Closes #1023

* fixed typo in date

* addressed wangc's comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect priority_critical RAVENv2.0 Defects and Features in release of RAVEN v2.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants