Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DEFECT] unhandled RayTaskError #1851

Closed
13 tasks done
joshua-cogliati-inl opened this issue Jun 14, 2022 · 0 comments · Fixed by #1852
Closed
13 tasks done

[DEFECT] unhandled RayTaskError #1851

joshua-cogliati-inl opened this issue Jun 14, 2022 · 0 comments · Fixed by #1852

Comments

@joshua-cogliati-inl
Copy link
Contributor

joshua-cogliati-inl commented Jun 14, 2022

Thank you for the defect report

Defect Description

Raven threw this unhandled RayTaskError:

2022-06-09 14:56:31,297 WARNING worker.py:1245 -- WARNING: 198 PYTHON worker processes have been started on node: 18e6db2773e31e256e0e30fbcce8d389325158c0cc23f250709bd520 with address: 10.40.0.67. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/home/fred/.conda/envs/raven_libraries_lemhi/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/home/fred/.conda/envs/raven_libraries_lemhi/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/fred/raven/lemhi/raven/ravenframework/JobHandler.py", line 464, in startLoop
    self.cleanJobQueue()
  File "/home/fred/raven/lemhi/raven/ravenframework/JobHandler.py", line 964, in cleanJobQueue
    if run is not None and run.isDone():
  File "/home/fred/raven/lemhi/raven/ravenframework/Runners/DistributedMemoryRunner.py", line 74, in isDone
    ray.get(self.thread, timeout=waitTimeOut)
  File "/home/fred/.conda/envs/raven_libraries_lemhi/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/fred/.conda/envs/raven_libraries_lemhi/lib/python3.7/site-packages/ray/worker.py", line 1713, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ESC[36mray::evaluateSample()ESC[39m (pid=612599, ip=10.40.0.67)
  File "/home/fred/raven/lemhi/raven/ravenframework/Models/EnsembleModel.py", line 509, in evaluateSample
    returnValue = (Input,self._externalRun(Input, jobHandler))
  File "/home/fred/raven/lemhi/raven/ravenframework/Models/EnsembleModel.py", line 657, in _externalRun
    iterationCount, jobHandler)
  File "/home/fred/raven/lemhi/raven/ravenframework/Models/EnsembleModel.py", line 753, in __advanceModel
    f'failed! Trace:\n{"*"*72}\n{msg}\n{"*"*72}')
  File "/home/fred/raven/lemhi/raven/ravenframework/BaseClasses/MessageUser.py", line 77, in raiseAnError
    self.messageHandler.error(self,etype,msg,str(tag),verbosity,color)
  File "/home/fred/raven/lemhi/raven/ravenframework/MessageHandler.py", line 234, in error
    raise etype(message)

Steps to Reproduce

Run complicated HERON and RAVEN Run RAVEN input for several hours.

Expected Behavior

isDone to handle the RayTaskError and leave _collectRunnerResponse to deal with it.

Screenshots and Input Files

No response

OS

Linux

OS Version

No response

Dependency Manager

CONDA

For Change Control Board: Issue Review

  • Is it tagged with a type: defect or task?
  • Is it tagged with a priority: critical, normal or minor?
  • If it will impact requirements or requirements tests, is it tagged with requirements?
  • If it is a defect, can it cause wrong results for users? If so an email needs to be sent to the users.
  • Is a rationale provided? (Such as explaining why the improvement is needed or why current code is wrong.)

For Change Control Board: Issue Closure

  • If the issue is a defect, is the defect fixed?
  • If the issue is a defect, is the defect tested for in the regression test system? (If not explain why not.)
  • If the issue can impact users, has an email to the users group been written (the email should specify if the defect impacts stable or master)?
  • If the issue is a defect, does it impact the latest release branch? If yes, is there any issue tagged with release (create if needed)?
  • If the issue is being closed without a pull request, has an explanation of why it is being closed been provided?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant