-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSTransferor created several rules for the same DID with slightly altered RSE expressions that overlap #10278
Comments
One last comment, most of the sites impacted with this issue are the sites where wmcore_transferor has no quota to continue with the input data placement. https://cmsweb.cern.ch/ms-transferor/data/status?detail=true |
The link [1] provides the over-replicated datasets organized by RSE. This list includes datasets with more than 2 rules to simplify the query. Some RSEs 400+ datasets with multiple rules (up to 19 rules per dataset). |
Hi @flgomezc @FernandoGarzon, Thank you for creating the issue and explaining what you observed. Could you please tell what's the meaning of 'stuck' here:
I can imagine (but need to be sure) this was a rule which was created, but lack of space or other limitation prevented the actual data transfer to happen. I am asking because this may be a sign of two things. Either the service is over committing, or it is not calculating/refreshing the site quota properly. p.s. BTW The section |
I may be very confused, but my understanding of what MSTransferor is trying to achieve with these long RSE expressions is to say "Dear Rucio, please place at this list of sites. We don't care where. You pick." We had discussed with the Rucio development team the desire not to have WMCoreTransferor checking quotas and deciding where to place data, but to let Rucio handle that. Also, I don't think there's any reason that we have to try to keep the data at only a small number of those sites. So, let me specifically ask:
|
I'm confused as well, and this appears to be an incorrect interpretation of the RSE expressions. What MSTransferor is doing is requesting one copy of the DID be placed at one of a choice of many RSEs. The issue that probably should be reported here is not that the RSE expression is large, but rather that MSTransferor created several rules for the same DID with slightly altered RSE expressions that have some overlap. |
Ah, yes! That does sound like a problem. If MSTransferor is relying on Rucio to complain about duplicated rules, but producing slightly different RSE expressions each time the rule is created, I can see how that would be a problem. On the other hand, you can imagine that MSTransferor might actually have different requests that need to place the data at different sites. Consider the following scenario: Request A: Site White list: X, Y, Z ==> RSE expression: X|Y|Z In this case, it's not obvious to me that a single rule could satisfy. For example, if tasks for B cannot run on Y and Z, then relying on A's rule would effectively block that request from using 2/3 of the desired resources. If, on the other hand, it were possible to update Rule A to add V and W to the RSE expression (e.g. making it "V|W|X|Y|Z") then it's possible that some data needed by B would be placed at Y or Z and then not be accessible to B for processing. @todor-ivanov: Can you investigate why MSTransferor is making multiple rules for the same dataset with different RSE expressions? Is this just a bug (i.e. code being sloppy) or is there a reason to do this (e.g. requests with different site whitelists)? |
I agree that there is a valid use case for such semi-duplicate rules. But I believe @flgomezc found cases of 19 such rules on one DID, which does seem a bit excessive. That said, we discussed this in slack and I believe @amaltaro identified the issue and found that it was already fixed since the incident. We should confirm here though. |
Agreed. Who is it that can confirm? @amaltaro? @todor-ivanov? |
For the sake of my own education, I'm glad you guys have commented on this issue. I'd like to clarify some of your concerns expressed in this thread. @todor-ivanov 'stuck' meant that lack of space prevented that the data transfer to happen. However, I'm not 100% sure if that's actually the case here. @nsmith- I changed the name of the issue for more clarity. |
Hi Kevin @klannon,
I am searching now for the exact channel the issue was previously communicated. |
Hi @klannon , I found several threads related to the story, all scattered among different channels (slack, github, etc.), but long story short: the strange behavior of MStransferor is triggered because it failed to send mail alerts for some workflows. Citing Alan @amaltaro here:
And the full investigation Alan did, together with the provided solution he was talking about are [1] and [2]. [1] [2] |
Very good. How do we confirm that? Do you need to get the creation date of the problematic rules? @FernandoGarzon, can you provide that? |
Yes, the rule creation date would be quite identifying. I would also add: it is not a bad idea to confirm if there are any similar rules created after Feb 1st. |
Well we are making new rules all the time, are these rules where the RSE expression is a semi-duplicate of each other? |
Sorry for not passing this out. The 7 other rules have semi-duplication of the rse_expression:
This dataset has 24 rules and 7 replicas
|
I have the feeling this concerns the same issue already discussed and fixed in production. @todor-ivanov since you are already on top of this investigation, can you please close that loop such that we can decide whether it's really a problem that we need to address, or just another observation from something that has already been resolved (thus, not an issue)? Thanks |
Hi @amaltaro I tried to find those rules in the MStransferor logs, but as one can guess, we do not keep logs for that far back in time. We keep them for the last 3 days only. Given the fact that the creation time of all rules @flgomezc reports are all at around the time the service have been re-deployed and no violations later during the month, I would say we may close that issue and in case we find to have another storm of those rules, we can reopen it or create a new one. |
Impact of the bug
Inform which systems get affected by this bug. Which agent(s)? Which central service(s)?
In the last couple of weeks, the Transfer Team detected an overload on Bristol_Disk. By this I mean that large amount of data was placed there and made the site run out of space. When we investigated the problem, we could detect that there where three containers whose rules were targeting Bristol, all of which were created by wmcore_transferor account. Those rules were stuck due to the exceeded quota at Bristol. The three datasets are: /EGamma/Run2018D-v1/RAW','/SingleElectron/Run2016E-v2/RAW','/SingleMuon/Run2018D-v1/RAW.
Then, we checked all the replication rules created by that account, and we matched the results with the name of those 3 datasets. We found these kind of results:
(u'534f256ab28a44ada9c06b8023c2f040', u'wmcore_transferor', datetime.datetime(2021, 1, 21, 15, 6, 45), u'/EGamma/Run2018D-v1/RAW#638532d4-3f8d-40a4-a603-ab199ce59df5', u'T1_ES_PIC_Disk|T2_CH_CSCS|T2_UA_KIPT|T1_FR_CCIN2P3_Disk|T2_US_Purdue|T2_TW_NCHC|T2_UK_SGrid_RALPP|T2_FR_GRIF_LLR|T2_DE_RWTH|T2_FR_IPHC|T2_IT_Legnaro|T2_US_Caltech|T1_DE_KIT_Disk|T2_UK_London_Brunel|T2_RU_JINR|T1_UK_RAL_Disk|T1_US_FNAL_Disk|T2_EE_Estonia|T2_IT_Rome|T2_US_Florida|T2_FR_GRIF_IRFU|T1_IT_CNAF_Disk|T1_RU_JINR_Disk|T2_UK_London_IC|T2_IT_Bari|T2_US_Nebraska|T2_US_UCSD|T2_ES_CIEMAT|T2_RU_IHEP|T2_US_Wisconsin|T2_HU_Budapest|T2_CN_Beijing|T2_US_MIT|T2_BE_IIHE|T2_KR_KISTI|T2_PT_NCG_Lisbon|T2_PL_Swierk|T2_US_Vanderbilt', u'OK')
This means that wmcor_transferor placed these datasets (rucio jargon) on many RSEs.
We also detected the case that for one dataset, there are several rules created by wmcore_transferor that place the dataset several times on the same disk, which caused wrong interpretations on this chart: https://ncsmith.web.cern.ch/ncsmith/phedex2rucio/rucio_summary_relative.pdf. That means that for dataset A, there are rule 1, rule 2 and rule 3 that place that dataset on Bristol (for example), at the same time.
We believe this is not only happening for the 3 datastest I posted earlier, but many more.
Describe the bug
A clear and concise description of what the bug is.
Seems to us that wmcore_transferor is placing rules with extensive RSEs expressions, provoking that one dataset is located in many sites. Also, we see that wmcore_transferor is placing several rules on the same dataset at the same time, provoking that one single dataset is placed at the same site more than once. Some disks are getting full because of this.
How to reproduce it
Steps to reproduce the behavior: We are running this script for getting the rules id and their RSEs expressions:
from rucio.client import Client from si_prefix import si_format from functools import partial client = Client() stuck = list(client.list_replication_rules({"account": "wmcore_transferor"})) for i in stuck: #print(i) if "/EGamma/Run2018D-v1/RAW" in i['name']: print(i['id'],i['name'],i['rse_expression'],i['state'])
Expected behavior
We would expect that wmcore_transferor places rules with RSEs expressions of no more than 3 RSEs. Also, one rule per dataset per account.
Additional context and error message
Add any other context about the problem here, like error message and/or traceback. You might want to use triple back ticks to properly format it.
More evidence will be provided on this issue.
The text was updated successfully, but these errors were encountered: