Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Rucio in MSOutput Consumer thread && Handle Duplicate Rule error from Rucio client. #9772

Merged
merged 1 commit into from
Jul 2, 2020

Conversation

todor-ivanov
Copy link
Contributor

@todor-ivanov todor-ivanov commented Jun 30, 2020

Fixes #9771

Status

tested

Description

Adding Support for Rucio to MSOutput. The change is only in the Consumer Thread.

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

No

External dependencies / deployment changes

No

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
  • Pylint check: failed
    • 5 warnings and errors that must be fixed
    • 24 warnings
    • 23 comments to review
  • Pycodestyle check: succeeded
    • 5 comments to review
  • Python3 compatibility checks: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/10187/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Todor, I have made many comments along the code. Mostly aesthetic comments though.
For the new Rucio method, given that it's mostly a GET request, I'd suggest to bring it with a unit test as well (using a real NANO data, for the moment).

self.ddm = DDM(url=self.msConfig['ddmUrl'],
logger=self.logger,
enableDataPlacement=self.msConfig['enableDataPlacement'])
if self.msConfig['defaultDataManSys'] == 'DDM':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed yesterday, I'd suggest to use the same logic used in MSTransferor. So either we enable Rucio in the service, or we use the default DM technology (DDM/PhEDEx), e.g.:
https://github.com/dmwm/deployment/blob/master/reqmgr2ms/config-transferor.py#L81

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you and as we discussed yesterday on a chat this may happen only once this PR is merged: #9759

Because we are going to be using the same flag and the Rucio client will be instantiated on the master thread for both services: MSTransferror and MSOutput. Otherwise if I try to re-implement it here again, we will get into git conflicts that need to be resolved later and we are risking to loose one of the two PRs (depending which one is merged first), as it happened the last time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope I can finish that PR tomorrow, but if you want to not depend on that change, you can reuse (copy/paste) it from there and have it implemented here. Anyway is fine with me though.
Note that you will need to change the ms-output configuration with similar changes as I applied here:
https://github.com/dmwm/deployment/pull/924/files

Once you have a deployment PR, please link it from the description of this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will deal with that later today.

Copy link
Contributor Author

@todor-ivanov todor-ivanov Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amaltaro are you quite far from finalizing this PR #9759? I see we are not overlapping in files changed in our PRs and I would like to keep it like that. So It would be much cleaner if we merge your PR first and then I rebase mine on top of that - this way we will avoid eventual git conflicts, that would require manual resolution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'd suggest to leave this one open and revisit it once I'm done with 9759, which is not going to be today.

self.ddm = DDM(url=self.msConfig['ddmUrl'],
logger=self.logger,
enableDataPlacement=self.msConfig['enableDataPlacement'])
elif self.msConfig['defaultDataManSys'] == 'Rucio':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think this object could be created in the MSCore module, as currently implemented in this PR:
https://github.com/dmwm/WMCore/pull/9759/files#diff-24825ae3f77716fa4f3123460f4f52c8R38

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

src/python/WMCore/MicroService/Unified/MSOutput.py Outdated Show resolved Hide resolved
src/python/WMCore/MicroService/Unified/MSOutput.py Outdated Show resolved Hide resolved
src/python/WMCore/MicroService/Unified/MSOutput.py Outdated Show resolved Hide resolved
src/python/WMCore/MicroService/Unified/MSOutput.py Outdated Show resolved Hide resolved
# return the history of rules per every Did, but no shorter path exists
msg = "A duplicate rule for: \naccount: %s \ndids: %s \nrseExpression: %s\ncopies: %s\n"
ruleHistory = []
for did in dids:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This we quickly discussed over slack. We should have a separate method for dealing with the rules history.
In addition to that, can you please add a comment to what a duplicate rule is? AFAIK, it's considered as a unique key of:
rucio account + did + rse expression + scope

If this is correct, then I don't think we need to break down the list of dids and fetch the history for every single of them. We would either have a duplicate rule (for all of them), or for none of them.
@nsmith- might be able to clarify it though.

Copy link
Contributor Author

@todor-ivanov todor-ivanov Jun 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes as discussed earlier I am working on that.

And yes, the unique constraint you mention is correct with just one minor but really important detail. We do not submit per single did but rather a whole list of dids. And this constraint looked like an Oracle one to me (one of the exceptions I received was kind of pointing in that direction), and since it seems to be a general unique constraint it is raising the DuplicateRule exception regardless of the number of the dids in the submission request violating it. Which immediately skips the rule creation for the ones that are in the same list but are not violating this requirement, and masks that quite well. Which immediately turns into a silent bug. So Yes, I am quite sure we must break the list in pieces here. I tested this several times in many different configurations.My opinion is that this think should be addressed in Rucio though, because asking the full history per every single did from our side may result in prolonged queries.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed the tuple rucio account + did (=name, scope) + rse expression is the unique key for rules

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but the did is still single not plural :) and we are in the situation I explained above. Could this ever be addressed in Rucio or it would be too complicate and we just keep it as it is. The way I handle this is doing the job anyway. We are just risking a little bit of longish queries.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see. If you provide multiple DIDs, it just loops over them and makes a rule for each. It may be that, given one DID in the list hits the duplicate rule exception, the others are processed as normal, but I am not sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. This is definitely a bug on the Rucio side. Would you mind creating a GH issue for them under: https://github.com/rucio/rucio ?

Regarding our real use case, the only scenario that I can think of where we would have some dids with rules, and some not (within the same rule uniqueness), is that a partial rule gets created for a sub-set of the dids that we provide. This can only happen in the following situations:

  • if the client makes one HTTP call for each did in the list of dids (thus, single transaction on the server side)
  • if the server breaks the request into multiple database transactions, one for each did

As Rucio relies heavily on Oracle, I'd say they probably implemented all the dids rule creation in a single transaction. Still, we should contact them to make sure that we only implemented what is needed. Please check their slack channel today during CERN time: https://github.com/rucio/rucio#getting-support

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not loop through the DIDs on the client side for safety?
You can catch the duplicate rule exception there, at the expense of more API calls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nsmith- , Yes we are doing exactly that, and it works. It is simply slower doing it on the client side, because it waits for getting the full history data per did and then it (eventually) parses it. This could be much faster doing it on the server side. And also it may be more clear to the other users too, if they could have a response from the server giving the information for which of the dids a rule was created and for which not.

Copy link
Contributor Author

@todor-ivanov todor-ivanov Jul 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amaltaro The question was forwarded to Rucio Support Team rucio/rucio#3807.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I suggest we leave it open for historical reference!

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
    • 1 tests no longer failing
    • 1 changes in unstable tests
  • Pylint check: failed
    • 5 warnings and errors that must be fixed
    • 24 warnings
    • 24 comments to review
  • Pycodestyle check: succeeded
    • 5 comments to review
  • Python3 compatibility checks: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/10194/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Todor, I have made further comments along the code.

self.ddm = DDM(url=self.msConfig['ddmUrl'],
logger=self.logger,
enableDataPlacement=self.msConfig['enableDataPlacement'])
if self.msConfig['defaultDataManSys'] == 'DDM':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope I can finish that PR tomorrow, but if you want to not depend on that change, you can reuse (copy/paste) it from there and have it implemented here. Anyway is fine with me though.
Note that you will need to change the ms-output configuration with similar changes as I applied here:
https://github.com/dmwm/deployment/pull/924/files

Once you have a deployment PR, please link it from the description of this PR.

src/python/WMCore/MicroService/Unified/MSOutput.py Outdated Show resolved Hide resolved
src/python/WMCore/MicroService/Unified/MSOutput.py Outdated Show resolved Hide resolved
src/python/WMCore/MicroService/Unified/MSOutput.py Outdated Show resolved Hide resolved
# return the history of rules per every Did, but no shorter path exists
msg = "A duplicate rule for: \naccount: %s \ndids: %s \nrseExpression: %s\ncopies: %s\n"
ruleHistory = []
for did in dids:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Todor, please correct me if I misunderstood your comment.
When a list of dids is passed to the Rucio server, Rucio does NOT deal with them in a single transaction? Meaning that some dids might succeed to have a rule created, while others not. Have you actually seen this behaviour in your tests?
If that's correct, then that's a serious problem that needs to be addressed on the server side. Otherwise clients have to implement all these extra checks in order to get their multi-dids rules going through.

The rucio slack channel might be a good place to confirm this behaviour. Can you please confirm it with the experts before we take any decision here?

@todor-ivanov
Copy link
Contributor Author

@amaltaro Thank you for your review. I think one way or another I addressed all your comments, except the one related to the UseRucio flag. I will do it later today. In the meantime you may want to take a look at my latest commit - I am keeping them separate for the moment, so you can look at the new changes. I will squash them before we merge.

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
    • 1 changes in unstable tests
  • Pylint check: failed
    • 5 warnings and errors that must be fixed
    • 24 warnings
    • 24 comments to review
  • Pycodestyle check: succeeded
    • 5 comments to review
  • Python3 compatibility checks: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/10202/artifact/artifacts/PullRequestReport.html

response = self.cli.add_replication_rule([did], copies, rseExpression, **kwargs)
except DuplicateRule:
pass
ruleHistory = self.listRuleHistory(dids)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this probably deserves a checkRuleExists method :-D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't try to make it too perfect though ;)

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
    • 1 changes in unstable tests
  • Pylint check: failed
    • 6 warnings and errors that must be fixed
    • 23 warnings
    • 24 comments to review
  • Pycodestyle check: succeeded
    • 5 comments to review
  • Python3 compatibility checks: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/10205/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: failed
    • 1 new failures
  • Pylint check: failed
    • 6 warnings and errors that must be fixed
    • 23 warnings
    • 24 comments to review
  • Pycodestyle check: succeeded
    • 5 comments to review
  • Python3 compatibility checks: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/10213/artifact/artifacts/PullRequestReport.html

@todor-ivanov todor-ivanov force-pushed the feature_MSOutput_Rucio_subscr branch from 0736a9a to 6bfc5cb Compare July 1, 2020 21:17
@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
    • 1 changes in unstable tests
  • Pylint check: failed
    • 6 warnings and errors that must be fixed
    • 23 warnings
    • 24 comments to review
  • Pycodestyle check: succeeded
    • 5 comments to review
  • Python3 compatibility checks: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/10217/artifact/artifacts/PullRequestReport.html

@todor-ivanov
Copy link
Contributor Author

@amaltaro could you please look at the Final version of this PR. And let me mention about the unit tests, before you started asking where are they :) Last night I spent some time trying to setup and develop those but because of MongoDB dependencies this task turned into a non trivial one. For testing any of the methods inside the MSOutput Class, I need an instance of it, but then we need the MongoDb to be up and running in the container for the unit tests - which is not the case for the moment. And the whole thing is breaking with [1]. And even more, I have no idea how would this behave when the Unit test is about to be run by Jenkins. So I did not push whatever I have so far because it will simply continue breaking.

[1]

[dmwm@a4e07b4a650f Unified_t]$ nosetests MSOutput_t.py
/home/dmwm/unittestdeploy/wmagent/1.3.6.pre4/sw/slc7_amd64_gcc630/external/py2-cryptography/2.4.2-comp2/lib/python2.7/site-packages/cryptography/hazmat/bindings/openssl/binding.py:163: CryptographyDeprecationWarning: OpenSSL version 1.0.1 is no longer supported by the OpenSSL project, please upgrade. A future version of cryptography will drop support for it.
  utils.CryptographyDeprecationWarning
E
======================================================================
ERROR: Test the getCampaignConfig method
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/dmwm/wmcore_unittest/WMCore/test/python/WMCore_t/MicroService_t/Unified_t/MSOutput_t.py", line 47, in setUp
    self.ms = MSOutput(self.msConfig, 'MSOutputConsumer')
  File "/home/dmwm/wmcore_unittest/WMCore/src/python/WMCore/MicroService/Unified/MSOutput.py", line 134, in __init__
    self.msOutDB = MongoDB(**msOutDBConfig).msOutDB
  File "/home/dmwm/wmcore_unittest/WMCore/src/python/WMCore/Database/MongoDB.py", line 47, in __init__
    raise ex
ServerSelectionTimeoutError: localhost:8230: [Errno 111] Connection refused
-------------------- >> begin captured logging << --------------------
root: DEBUG: Reading mocked SiteDB data from the file /home/dmwm/wmcore_unittest/WMCore/test/python/../data/Mock/SiteDBMockData.json 
root: INFO: Configuration including default values:
...

root: DEBUG: Initializing CRIC with url: https://cms-cric.cern.ch/
root: ERROR: Could not connect to MongoDB server: mongodb://localhost
localhost:8230: [Errno 111] Connection refusedGiving up Now.
--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------
Ran 1 test in 30.127s

FAILED (errors=1)

@amaltaro
Copy link
Contributor

amaltaro commented Jul 2, 2020

Pylint check/report has a few bold issues that need to be sorted.
Todor, I don't really see a need to override the NotImplementedError builtin exception as well. Why can't you simply

raise NotImplementedError("implement bulk requests in docUpdater blah blah")

?

That's an unneeded complexity in my opinion. I haven't gone through all your comments yet btw, but either we get it done ASAP (within 50min, at most), or Imran won't make another CMSWEB deployment.

…Rule error from Rucio client.

Dealing with empty destination lists for RelVal && Skip marking the workflow as 'done' in DRY-RUN mode.

New RelVal rseExpressions && separate method for replication rule history.

Adding 'UnsupportedError' and 'NotImplementedError' exceptions.

Fetch rule history only for duplicate dids

pylint warnings
@todor-ivanov todor-ivanov force-pushed the feature_MSOutput_Rucio_subscr branch from 6bfc5cb to b48812f Compare July 2, 2020 08:35
@todor-ivanov
Copy link
Contributor Author

I just removed my version of this Exception.

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
    • 1 tests no longer failing
  • Pylint check: failed
    • 5 warnings and errors that must be fixed
    • 23 warnings
    • 20 comments to review
  • Pycodestyle check: succeeded
    • 2 comments to review
  • Python3 compatibility checks: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/10223/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Todor. These changes look good to me. However, you just got yourself 3 issues to be created:

  • create unit tests for MSOutput
  • create unit tests for MSOutputTemplate
  • integrate mongodb in our unit test setup (including docker image)

Can you please get those issues created? Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Support for Rucio in MSOutput
4 participants