Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support and validation to GPU in StdBase/ReReco #10799

Merged
merged 3 commits into from
Sep 14, 2021

Conversation

amaltaro
Copy link
Contributor

@amaltaro amaltaro commented Sep 10, 2021

Fixes #10388

Status

ready

Description

This PR implements GPU functionality within WMCore (only at the request-level, job level will be done in a different issue/PR).
Summary of changes is:

  • New request spec parameters called 'RequiresGPU' and 'GPUParams', where:
  • RequiresGPU: can be one of these values ("forbidden", "optional", "required"), with default value to forbidden, thus not using GPUs
  • GPUParams: a dictionary JSON encoded, with a default value to None JSON encoded. It must be provided if RequiresGPU=optional or RequiresGPU=required.

List of mandatory parameters within GPUParams is:

  • GPUMemoryMB (renamed from GPUMemory !): integer greater than 0
  • CUDACapabilities: a list of string values. Each value must match the CUDA_VERSION_REGEX regular expression and max length.
  • CUDARuntime: a string value matching the CUDA_VERSION_REGEX constraints.

And a list of the 3 optional parameters is:

  • GPUName: a string value with less than 100 chars
  • CUDADriverVersion: a string value matching the CUDA_VERSION_REGEX constraints.
  • CUDARuntimeVersion: a string value matching the CUDA_VERSION_REGEX constraints.

NOTE: full support in TaskChain and StepChain is going to be done in a different GH issue/pull request, but the bulk of the development is already provided in this PR.

Is it backward compatible (if not, which system it affects?)

NO (new feature!)

Related PRs

None

External dependencies / deployment changes

None

@amaltaro
Copy link
Contributor Author

amaltaro commented Sep 10, 2021

copying a few names from the GH issue discussion @justinasr @mrceyhun @fwyzard @srimanob @hufnagel

This is still a work in progress, but I wanted to draw your attention that this is happening and this is the current candidate implementation to go into the Workload Management system, hopefully in the next week.

The most important input/feedback that I would like to get from you is, whether you see any inconsistencies or use cases that are not covered with the current schema (data type + length + regular expression). Thank you very much!

UPDATE: adding Jordan as well @jordan-martins

@cmsdmwmbot
Copy link

Jenkins results:

  • Python2 Unit tests: failed
    • 2 tests added
  • Python3 Unit tests: succeeded
    • 2 tests added
    • 3 changes in unstable tests
  • Python2 Pylint check: failed
    • 40 warnings and errors that must be fixed
    • 2 warnings
    • 143 comments to review
  • Python3 Pylint check: failed
    • 48 warnings and errors that must be fixed
    • 2 warnings
    • 166 comments to review
  • Pylint py3k check: failed
    • 1 warnings
  • Pycodestyle check: succeeded
    • 30 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/12430/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python2 Unit tests: succeeded
    • 2 tests added
  • Python3 Unit tests: succeeded
    • 2 tests added
    • 3 changes in unstable tests
  • Python2 Pylint check: failed
    • 45 warnings and errors that must be fixed
    • 3 warnings
    • 147 comments to review
  • Python3 Pylint check: failed
    • 53 warnings and errors that must be fixed
    • 3 warnings
    • 167 comments to review
  • Pylint py3k check: failed
    • 1 warnings
  • Pycodestyle check: succeeded
    • 33 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/12431/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python2 Unit tests: failed
    • 6 tests added
    • 1 changes in unstable tests
  • Python3 Unit tests: failed
    • 6 tests added
    • 3 changes in unstable tests
  • Python2 Pylint check: failed
    • 64 warnings and errors that must be fixed
    • 15 warnings
    • 267 comments to review
  • Python3 Pylint check: failed
    • 78 warnings and errors that must be fixed
    • 15 warnings
    • 398 comments to review
  • Pylint py3k check: failed
    • 3 errors and warnings that should be fixed
    • 1 warnings
  • Pycodestyle check: succeeded
    • 107 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/12435/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python2 Unit tests: succeeded
    • 6 tests added
  • Python3 Unit tests: succeeded
    • 6 tests added
    • 1 changes in unstable tests
  • Python2 Pylint check: failed
    • 64 warnings and errors that must be fixed
    • 15 warnings
    • 267 comments to review
  • Python3 Pylint check: failed
    • 78 warnings and errors that must be fixed
    • 15 warnings
    • 398 comments to review
  • Pylint py3k check: failed
    • 3 errors and warnings that should be fixed
    • 1 warnings
  • Pycodestyle check: succeeded
    • 106 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/12436/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

The baseline development to support GPUs within WMCore is provided in this PR, which also supports those new workflow arguments at the ReReco spec type. TaskChain and StepChain will be addressed in different issues/PR.

I'm going to run some extra tests in the next hours, and if everything goes fine, I will request these changes to be deployed in cmsweb-testbed today, run a final validation, and deploy this to production as well on Thursday.

@amaltaro
Copy link
Contributor Author

Looking at a real request JSON, I see this:

  "RequiresGPU": "forbidden",
  "GPUParams": "\"\"",

which doesn't look great with those scaped chars. Maybe we should default it to encoded None instead (thus, 'null'). Checking...

@cmsdmwmbot
Copy link

Jenkins results:

  • Python2 Unit tests: succeeded
    • 6 tests added
  • Python3 Unit tests: succeeded
    • 6 tests added
    • 1 changes in unstable tests
  • Python2 Pylint check: failed
    • 64 warnings and errors that must be fixed
    • 15 warnings
    • 267 comments to review
  • Python3 Pylint check: failed
    • 78 warnings and errors that must be fixed
    • 15 warnings
    • 398 comments to review
  • Pylint py3k check: failed
    • 3 errors and warnings that should be fixed
    • 1 warnings
  • Pycodestyle check: succeeded
    • 106 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/12439/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python2 Unit tests: succeeded
    • 6 tests added
  • Python3 Unit tests: succeeded
    • 6 tests added
    • 1 changes in unstable tests
  • Python2 Pylint check: failed
    • 64 warnings and errors that must be fixed
    • 15 warnings
    • 268 comments to review
  • Python3 Pylint check: failed
    • 78 warnings and errors that must be fixed
    • 15 warnings
    • 399 comments to review
  • Pylint py3k check: failed
    • 3 errors and warnings that should be fixed
    • 1 warnings
  • Pycodestyle check: succeeded
    • 106 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/12440/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

I'm going to need this code merged, such that I can resume working on its implementation for TaskChain #10400 and StepChain #10401

Basic tests went fine in my VM. I'm going to squash the 5th commit with the 1st one, and from my side it's good to go.
However, Todor, please feel free to leave your review. If something needs to be addressed, I can do so in the other(s) PR to be created today touching this GPU code.

fix Lexicon logic for GPUParams and its internals

add RequiresGPU vs GPUParams validation; fix Py2 compatibility

Change default from empty string to None. StdBase and Lexicon

Call GPU setter in StdBase.setupProcessingTask
fix WMWorkload set call

update getters/setters to deal with None default instead
clean unit tests

update Lexicon unit tests for new None default

unit test for getter/setters methods for GPU settings

WMWorkload unit test fix

update unit tests for getters/setters with None
@cmsdmwmbot
Copy link

Jenkins results:

  • Python2 Unit tests: succeeded
    • 6 tests added
    • 1 changes in unstable tests
  • Python3 Unit tests: succeeded
    • 6 tests added
    • 2 changes in unstable tests
  • Python2 Pylint check: failed
    • 64 warnings and errors that must be fixed
    • 15 warnings
    • 268 comments to review
  • Python3 Pylint check: failed
    • 78 warnings and errors that must be fixed
    • 15 warnings
    • 399 comments to review
  • Pylint py3k check: failed
    • 3 errors and warnings that should be fixed
    • 1 warnings
  • Pycodestyle check: succeeded
    • 106 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/12441/artifact/artifacts/PullRequestReport.html

try:
data = json.loads(candidate)
except Exception:
raise AssertionError("GPUParams is not a valid JSON object")
Copy link
Contributor

@todor-ivanov todor-ivanov Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the question @amaltaro , but why should we raise Assertion error here? It seems strange to me to get a general exception and raise assertion error instead. And even more, from the message it seems like this is about to a specific use case related to the data structure itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lexicon checks either return False or raise an AssertionError in case of failures during the input data validation.
This is also the standard behaviour of the check function in this module.

return _gpuInternalParameters(data)


CUDA_VERSION_REGEX = {"re": r"^\d+\.\d+(\.\d+)?$", "maxLength": 100}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this one is about to be in the global scope, I'd say we should move it at the top of the file similar to others defined on lines 27-31

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could be done as well. However, looking at this module, you can see that the regular expression is usually defined closer to the function that will consume it, so I just kept the consistency.
Anyhow, this will be refactored once a decision is made on how to separate Lexicon logic from lexicon rules/regex.

task of this spec.
:param requiresGPU: string defining whether GPUs are needed. For TaskChains, it
could be a dictionary key'ed by the taskname.
:param gpuParams: GPU settings. A JSON encoded object, from either a None object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if it is a json encoded from None. I can guess, of course, but it is a little bit obscure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This:

In [2]: json.dumps(None)
Out[2]: 'null'

which will be the default value in the specs/workflows. Also defined in the StdBase spec file.

all underneath CMSSW type step object.
:param requiresGPU: string defining whether GPUs are needed. For TaskChains, it
could be a dictionary key'ed by the taskname.
:param gpuParams: GPU settings. A JSON encoded object, from either a None object
Copy link
Contributor

@todor-ivanov todor-ivanov Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as the one given bellow for setGPUSettings

taskIterator = self.taskIterator()

for task in taskIterator:
task.setTaskGPUSettings(requiresGPU, gpuParams)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following comment is just for my personal education. (I am pretty sure you have already double checked this).
Isn't it a repeated recursion, when combined with the one from setTaskGPUSettings() on line: https://github.com/dmwm/WMCore/pull/10799/files#diff-81efe0a8bcf6b4cb2d5ee526c24a027563fa129b414de2aca75e78f1b38acbf1R1503

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question! And indeed it's confusing!
The WMWorkload taskIterator method only iterate through the top level tasks. Each workload usually (always?) has a single top level task.
While the recursion in WMTask iterates through all sub-tasks. Example, from a Processing task, we have a Merge sub-task, then we also have a LogCollect and Cleanup sub-tasks.

Copy link
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this quite big PR @amaltaro . From all what I managed to grasp in this whole picture it looks good to me. Of course I cannot catch possible errors better than the ones you already have with the tests you've done. I just left few comments inline related to some clarity of the code while reading. None of them is related to a possible problems, so they are not blockers in any case.

@amaltaro
Copy link
Contributor Author

Thank you very much for this review, Todor.
I totally agree that this is too big for having a decent code review. That's why I have separated the TaskChain and StepChain cases in their own issues, otherwise it would be even bigger and more complex.

Hopefully the manual tests and unit tests will be enough to make sure this code behaves ;)

@amaltaro amaltaro merged commit 232f3ab into dmwm:master Sep 14, 2021
@amaltaro
Copy link
Contributor Author

Here is a fairly good documentation for these GPU developments: https://github.com/dmwm/WMCore/wiki/GPU-Support

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for GPU parameters at ReReco spec level
3 participants