-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Rucio in MSOutput Consumer thread && Handle Duplicate Rule error from Rucio client. #9772
Add support for Rucio in MSOutput Consumer thread && Handle Duplicate Rule error from Rucio client. #9772
Conversation
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Todor, I have made many comments along the code. Mostly aesthetic comments though.
For the new Rucio method, given that it's mostly a GET request, I'd suggest to bring it with a unit test as well (using a real NANO data, for the moment).
self.ddm = DDM(url=self.msConfig['ddmUrl'], | ||
logger=self.logger, | ||
enableDataPlacement=self.msConfig['enableDataPlacement']) | ||
if self.msConfig['defaultDataManSys'] == 'DDM': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed yesterday, I'd suggest to use the same logic used in MSTransferor. So either we enable Rucio in the service, or we use the default DM technology (DDM/PhEDEx), e.g.:
https://github.com/dmwm/deployment/blob/master/reqmgr2ms/config-transferor.py#L81
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you and as we discussed yesterday on a chat this may happen only once this PR is merged: #9759
Because we are going to be using the same flag and the Rucio client will be instantiated on the master thread for both services: MSTransferror and MSOutput. Otherwise if I try to re-implement it here again, we will get into git conflicts that need to be resolved later and we are risking to loose one of the two PRs (depending which one is merged first), as it happened the last time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope I can finish that PR tomorrow, but if you want to not depend on that change, you can reuse (copy/paste) it from there and have it implemented here. Anyway is fine with me though.
Note that you will need to change the ms-output configuration with similar changes as I applied here:
https://github.com/dmwm/deployment/pull/924/files
Once you have a deployment PR, please link it from the description of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will deal with that later today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amaltaro are you quite far from finalizing this PR #9759? I see we are not overlapping in files changed in our PRs and I would like to keep it like that. So It would be much cleaner if we merge your PR first and then I rebase mine on top of that - this way we will avoid eventual git conflicts, that would require manual resolution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'd suggest to leave this one open and revisit it once I'm done with 9759, which is not going to be today.
self.ddm = DDM(url=self.msConfig['ddmUrl'], | ||
logger=self.logger, | ||
enableDataPlacement=self.msConfig['enableDataPlacement']) | ||
elif self.msConfig['defaultDataManSys'] == 'Rucio': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think this object could be created in the MSCore module, as currently implemented in this PR:
https://github.com/dmwm/WMCore/pull/9759/files#diff-24825ae3f77716fa4f3123460f4f52c8R38
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above.
# return the history of rules per every Did, but no shorter path exists | ||
msg = "A duplicate rule for: \naccount: %s \ndids: %s \nrseExpression: %s\ncopies: %s\n" | ||
ruleHistory = [] | ||
for did in dids: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This we quickly discussed over slack. We should have a separate method for dealing with the rules history.
In addition to that, can you please add a comment to what a duplicate rule is? AFAIK, it's considered as a unique key of:
rucio account + did + rse expression + scope
If this is correct, then I don't think we need to break down the list of dids and fetch the history for every single of them. We would either have a duplicate rule (for all of them), or for none of them.
@nsmith- might be able to clarify it though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes as discussed earlier I am working on that.
And yes, the unique constraint you mention is correct with just one minor but really important detail. We do not submit per single did
but rather a whole list of dids
. And this constraint looked like an Oracle one to me (one of the exceptions I received was kind of pointing in that direction), and since it seems to be a general unique
constraint it is raising the DuplicateRule
exception regardless of the number of the dids
in the submission request violating it. Which immediately skips the rule creation for the ones that are in the same list but are not violating this requirement, and masks that quite well. Which immediately turns into a silent bug. So Yes, I am quite sure we must break the list in pieces here. I tested this several times in many different configurations.My opinion is that this think should be addressed in Rucio though, because asking the full history per every single did from our side may result in prolonged queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed the tuple rucio account + did (=name, scope) + rse expression
is the unique key for rules
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but the did
is still single not plural :) and we are in the situation I explained above. Could this ever be addressed in Rucio or it would be too complicate and we just keep it as it is. The way I handle this is doing the job anyway. We are just risking a little bit of longish queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see. If you provide multiple DIDs, it just loops over them and makes a rule for each. It may be that, given one DID in the list hits the duplicate rule exception, the others are processed as normal, but I am not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. This is definitely a bug on the Rucio side. Would you mind creating a GH issue for them under: https://github.com/rucio/rucio ?
Regarding our real use case, the only scenario that I can think of where we would have some dids with rules, and some not (within the same rule uniqueness), is that a partial rule gets created for a sub-set of the dids that we provide. This can only happen in the following situations:
- if the client makes one HTTP call for each did in the list of dids (thus, single transaction on the server side)
- if the server breaks the request into multiple database transactions, one for each did
As Rucio relies heavily on Oracle, I'd say they probably implemented all the dids rule creation in a single transaction. Still, we should contact them to make sure that we only implemented what is needed. Please check their slack channel today during CERN time: https://github.com/rucio/rucio#getting-support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not loop through the DIDs on the client side for safety?
You can catch the duplicate rule exception there, at the expense of more API calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @nsmith- , Yes we are doing exactly that, and it works. It is simply slower doing it on the client side, because it waits for getting the full history data per did
and then it (eventually) parses it. This could be much faster doing it on the server side. And also it may be more clear to the other users too, if they could have a response from the server giving the information for which of the dids
a rule was created and for which not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amaltaro The question was forwarded to Rucio Support Team rucio/rucio#3807.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, I suggest we leave it open for historical reference!
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Todor, I have made further comments along the code.
self.ddm = DDM(url=self.msConfig['ddmUrl'], | ||
logger=self.logger, | ||
enableDataPlacement=self.msConfig['enableDataPlacement']) | ||
if self.msConfig['defaultDataManSys'] == 'DDM': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope I can finish that PR tomorrow, but if you want to not depend on that change, you can reuse (copy/paste) it from there and have it implemented here. Anyway is fine with me though.
Note that you will need to change the ms-output configuration with similar changes as I applied here:
https://github.com/dmwm/deployment/pull/924/files
Once you have a deployment PR, please link it from the description of this PR.
# return the history of rules per every Did, but no shorter path exists | ||
msg = "A duplicate rule for: \naccount: %s \ndids: %s \nrseExpression: %s\ncopies: %s\n" | ||
ruleHistory = [] | ||
for did in dids: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Todor, please correct me if I misunderstood your comment.
When a list of dids is passed to the Rucio server, Rucio does NOT deal with them in a single transaction? Meaning that some dids might succeed to have a rule created, while others not. Have you actually seen this behaviour in your tests?
If that's correct, then that's a serious problem that needs to be addressed on the server side. Otherwise clients have to implement all these extra checks in order to get their multi-dids rules going through.
The rucio slack channel might be a good place to confirm this behaviour. Can you please confirm it with the experts before we take any decision here?
@amaltaro Thank you for your review. I think one way or another I addressed all your comments, except the one related to the |
Jenkins results:
|
response = self.cli.add_replication_rule([did], copies, rseExpression, **kwargs) | ||
except DuplicateRule: | ||
pass | ||
ruleHistory = self.listRuleHistory(dids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this probably deserves a checkRuleExists
method :-D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't try to make it too perfect though ;)
Jenkins results:
|
Jenkins results:
|
0736a9a
to
6bfc5cb
Compare
Jenkins results:
|
@amaltaro could you please look at the Final version of this PR. And let me mention about the unit tests, before you started asking where are they :) Last night I spent some time trying to setup and develop those but because of MongoDB dependencies this task turned into a non trivial one. For testing any of the methods inside the [1]
|
Pylint check/report has a few bold issues that need to be sorted.
? That's an unneeded complexity in my opinion. I haven't gone through all your comments yet btw, but either we get it done ASAP (within 50min, at most), or Imran won't make another CMSWEB deployment. |
…Rule error from Rucio client. Dealing with empty destination lists for RelVal && Skip marking the workflow as 'done' in DRY-RUN mode. New RelVal rseExpressions && separate method for replication rule history. Adding 'UnsupportedError' and 'NotImplementedError' exceptions. Fetch rule history only for duplicate dids pylint warnings
6bfc5cb
to
b48812f
Compare
I just removed my version of this Exception. |
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Todor. These changes look good to me. However, you just got yourself 3 issues to be created:
- create unit tests for MSOutput
- create unit tests for MSOutputTemplate
- integrate mongodb in our unit test setup (including docker image)
Can you please get those issues created? Thanks
Fixes #9771
Status
tested
Description
Adding Support for Rucio to MSOutput. The change is only in the Consumer Thread.
Is it backward compatible (if not, which system it affects?)
YES
Related PRs
No
External dependencies / deployment changes
No