Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSTransferor created several rules for the same DID with slightly altered RSE expressions that overlap #10278

Closed
FernandoGarzon opened this issue Feb 11, 2021 · 18 comments

Comments

@FernandoGarzon
Copy link

Impact of the bug
Inform which systems get affected by this bug. Which agent(s)? Which central service(s)?

In the last couple of weeks, the Transfer Team detected an overload on Bristol_Disk. By this I mean that large amount of data was placed there and made the site run out of space. When we investigated the problem, we could detect that there where three containers whose rules were targeting Bristol, all of which were created by wmcore_transferor account. Those rules were stuck due to the exceeded quota at Bristol. The three datasets are: /EGamma/Run2018D-v1/RAW','/SingleElectron/Run2016E-v2/RAW','/SingleMuon/Run2018D-v1/RAW.

Then, we checked all the replication rules created by that account, and we matched the results with the name of those 3 datasets. We found these kind of results:

(u'534f256ab28a44ada9c06b8023c2f040', u'wmcore_transferor', datetime.datetime(2021, 1, 21, 15, 6, 45), u'/EGamma/Run2018D-v1/RAW#638532d4-3f8d-40a4-a603-ab199ce59df5', u'T1_ES_PIC_Disk|T2_CH_CSCS|T2_UA_KIPT|T1_FR_CCIN2P3_Disk|T2_US_Purdue|T2_TW_NCHC|T2_UK_SGrid_RALPP|T2_FR_GRIF_LLR|T2_DE_RWTH|T2_FR_IPHC|T2_IT_Legnaro|T2_US_Caltech|T1_DE_KIT_Disk|T2_UK_London_Brunel|T2_RU_JINR|T1_UK_RAL_Disk|T1_US_FNAL_Disk|T2_EE_Estonia|T2_IT_Rome|T2_US_Florida|T2_FR_GRIF_IRFU|T1_IT_CNAF_Disk|T1_RU_JINR_Disk|T2_UK_London_IC|T2_IT_Bari|T2_US_Nebraska|T2_US_UCSD|T2_ES_CIEMAT|T2_RU_IHEP|T2_US_Wisconsin|T2_HU_Budapest|T2_CN_Beijing|T2_US_MIT|T2_BE_IIHE|T2_KR_KISTI|T2_PT_NCG_Lisbon|T2_PL_Swierk|T2_US_Vanderbilt', u'OK')
This means that wmcor_transferor placed these datasets (rucio jargon) on many RSEs.

We also detected the case that for one dataset, there are several rules created by wmcore_transferor that place the dataset several times on the same disk, which caused wrong interpretations on this chart: https://ncsmith.web.cern.ch/ncsmith/phedex2rucio/rucio_summary_relative.pdf. That means that for dataset A, there are rule 1, rule 2 and rule 3 that place that dataset on Bristol (for example), at the same time.

We believe this is not only happening for the 3 datastest I posted earlier, but many more.

Describe the bug
A clear and concise description of what the bug is.

Seems to us that wmcore_transferor is placing rules with extensive RSEs expressions, provoking that one dataset is located in many sites. Also, we see that wmcore_transferor is placing several rules on the same dataset at the same time, provoking that one single dataset is placed at the same site more than once. Some disks are getting full because of this.

How to reproduce it
Steps to reproduce the behavior: We are running this script for getting the rules id and their RSEs expressions:

from rucio.client import Client from si_prefix import si_format from functools import partial client = Client() stuck = list(client.list_replication_rules({"account": "wmcore_transferor"})) for i in stuck: #print(i) if "/EGamma/Run2018D-v1/RAW" in i['name']: print(i['id'],i['name'],i['rse_expression'],i['state'])

Expected behavior
We would expect that wmcore_transferor places rules with RSEs expressions of no more than 3 RSEs. Also, one rule per dataset per account.

Additional context and error message
Add any other context about the problem here, like error message and/or traceback. You might want to use triple back ticks to properly format it.

More evidence will be provided on this issue.

@jhonatanamado
Copy link
Contributor

One last comment, most of the sites impacted with this issue are the sites where wmcore_transferor has no quota to continue with the input data placement. https://cmsweb.cern.ch/ms-transferor/data/status?detail=true

@flgomezc
Copy link

The link [1] provides the over-replicated datasets organized by RSE. This list includes datasets with more than 2 rules to simplify the query. Some RSEs 400+ datasets with multiple rules (up to 19 rules per dataset).
24 RSEs where checked at this moment.

[1] https://cernbox.cern.ch/index.php/s/m2iIEdlDNulchuk

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Feb 11, 2021

Hi @flgomezc @FernandoGarzon, Thank you for creating the issue and explaining what you observed.

Could you please tell what's the meaning of 'stuck' here:

Those rules were stuck due to the exceeded quota at Bristol

I can imagine (but need to be sure) this was a rule which was created, but lack of space or other limitation prevented the actual data transfer to happen. I am asking because this may be a sign of two things. Either the service is over committing, or it is not calculating/refreshing the site quota properly.

p.s. BTW The section Impact of the bug refers only to the name of the subsystem or component to which this bug is related. In this case it should be MSTransferror, while the whole long explanation should go in the next section Describe the bug. :) Nevertheless, thank you for the elaborate description of the problem.

@klannon
Copy link

klannon commented Feb 11, 2021

I may be very confused, but my understanding of what MSTransferor is trying to achieve with these long RSE expressions is to say "Dear Rucio, please place at this list of sites. We don't care where. You pick." We had discussed with the Rucio development team the desire not to have WMCoreTransferor checking quotas and deciding where to place data, but to let Rucio handle that. Also, I don't think there's any reason that we have to try to keep the data at only a small number of those sites. So, let me specifically ask:

  1. Why does MSTransferor need to limit its list of RSEs to just three? What problem does that solve?
  2. How is MSTransferor supposed to decide which three to pick?
  3. Are you saying that by putting so many sites into the RSE list, that Rucio is being triggered to make multiple replicas of individual data across multiple sites?
  4. Is the problem really being caused by not limiting the RSE expression to three RSEs, or is there some other problem that's really the root cause?

@nsmith-
Copy link

nsmith- commented Feb 11, 2021

I'm confused as well, and this appears to be an incorrect interpretation of the RSE expressions. What MSTransferor is doing is requesting one copy of the DID be placed at one of a choice of many RSEs. The issue that probably should be reported here is not that the RSE expression is large, but rather that MSTransferor created several rules for the same DID with slightly altered RSE expressions that have some overlap.

@klannon
Copy link

klannon commented Feb 11, 2021

Ah, yes! That does sound like a problem. If MSTransferor is relying on Rucio to complain about duplicated rules, but producing slightly different RSE expressions each time the rule is created, I can see how that would be a problem. On the other hand, you can imagine that MSTransferor might actually have different requests that need to place the data at different sites. Consider the following scenario:

Request A: Site White list: X, Y, Z ==> RSE expression: X|Y|Z
Request B: Site White list: V,W,X ==> RSE expression: V|W|X

In this case, it's not obvious to me that a single rule could satisfy. For example, if tasks for B cannot run on Y and Z, then relying on A's rule would effectively block that request from using 2/3 of the desired resources. If, on the other hand, it were possible to update Rule A to add V and W to the RSE expression (e.g. making it "V|W|X|Y|Z") then it's possible that some data needed by B would be placed at Y or Z and then not be accessible to B for processing.

@todor-ivanov: Can you investigate why MSTransferor is making multiple rules for the same dataset with different RSE expressions? Is this just a bug (i.e. code being sloppy) or is there a reason to do this (e.g. requests with different site whitelists)?

@nsmith-
Copy link

nsmith- commented Feb 11, 2021

I agree that there is a valid use case for such semi-duplicate rules. But I believe @flgomezc found cases of 19 such rules on one DID, which does seem a bit excessive. That said, we discussed this in slack and I believe @amaltaro identified the issue and found that it was already fixed since the incident. We should confirm here though.

@klannon
Copy link

klannon commented Feb 11, 2021

Agreed. Who is it that can confirm? @amaltaro? @todor-ivanov?

@FernandoGarzon FernandoGarzon changed the title Wmcore_transferor account is creating rules with large RSEs expressions MSTransferor created several rules for the same DID with slightly altered RSE expressions that overlap Feb 11, 2021
@FernandoGarzon
Copy link
Author

For the sake of my own education, I'm glad you guys have commented on this issue. I'd like to clarify some of your concerns expressed in this thread.

@todor-ivanov 'stuck' meant that lack of space prevented that the data transfer to happen. However, I'm not 100% sure if that's actually the case here.

@nsmith- I changed the name of the issue for more clarity.

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Feb 11, 2021

Hi Kevin @klannon,

Who is it that can confirm? @amaltaro? @todor-ivanov?

I am searching now for the exact channel the issue was previously communicated.

@todor-ivanov
Copy link
Contributor

Hi @klannon , I found several threads related to the story, all scattered among different channels (slack, github, etc.), but long story short: the strange behavior of MStransferor is triggered because it failed to send mail alerts for some workflows. Citing Alan @amaltaro here:

The reason for the repeated rules though, is that the service failed to send the email alert notification (new bug from the kubernetes environment/setup).
...
We have fixed this issue and it was deployed to production in the last Monday, Feb 1st.

And the full investigation Alan did, together with the provided solution he was talking about are [1] and [2].
What we need to confirm though, is if the rules Fernando is mentioning here are remnants from the pre-fix deployment era.

[1]
#10234

[2]
#10244

@klannon
Copy link

klannon commented Feb 11, 2021

Very good. How do we confirm that? Do you need to get the creation date of the problematic rules? @FernandoGarzon, can you provide that?

@todor-ivanov
Copy link
Contributor

Yes, the rule creation date would be quite identifying. I would also add: it is not a bad idea to confirm if there are any similar rules created after Feb 1st.

@flgomezc
Copy link

Yes, I confirm that 20 rules where created after Feb 1rst, the list is this one [1]:

There was a crazy two days period in wich more than 8000 rules where created:
https://cernbox.cern.ch/index.php/apps/gallery/preview/overreplicated_datasets/over-replicated-rules.png

and also, there are alive remanet rules from the last year (in logscale to show as well days with 1-2 rules per day and days with more than 100 rules per day)
https://cernbox.cern.ch/index.php/apps/gallery/preview/overreplicated_datasets/historic-over-replicated-rules.png

[1] 20 rules created since 2021-02-01:
0dbe88657c5f47f891ca153f0cb30664
659b204aaf894e9388f526ad2c8af44f
25b8e4d1f01f488b85e0e21f7655e4a2
888692100e3949d3b6f10310555186b2
7fc8a0581b2b4a87a345297acce70124
bae2b95b2a654ee2b171a99f13041325
a7e9249681d64191a5798e4edabc7cf2
9de3d038a7d34779b1504ae13134cec7
857508bd80b24c94bf9a8ba056445334
522ee20d88e843de8bd800b98dd819b3
f62dfd58380646dfa9f64f8f605c1f66
07568d6e7bab45d58278f9e4f60ebe53
44d08a6967384593a57bf3d2df5b9a40
25fba3baaaac448f89abfc9f17a2f92e
8965df30b9b14809b6aef3d21fac2e27
8ea3090e045a45fb8f4309e7ac6483f8
df826252a81346a081eba6e8d35d50af
58a15d8a3d494daabaf7245379fef1fa
02fcb5055cb6485c8bef9149ffbdbec8
b9129e60340045a4ba707205f46ea926

@nsmith-
Copy link

nsmith- commented Feb 12, 2021

Well we are making new rules all the time, are these rules where the RSE expression is a semi-duplicate of each other?

@flgomezc
Copy link

Sorry for not passing this out.
13 of these rules have a single rse_expression 'T1_US_FNAL_Disk', all of them for a miniAOD container.

The 7 other rules have semi-duplication of the rse_expression:

2021-01-27 21:24:39 a00f8f1f48624040997da55adc23822f	 T2_IN_TIFR|T2_DE_DESY|T1_FR_CCIN2P3_Disk|T2_CH_CERN|T1_RU_JINR_Disk|T2_BE_UCL
2021-02-02 14:29:27 25b8e4d1f01f488b85e0e21f7655e4a2	 T2_IN_TIFR|T2_DE_DESY|T2_IT_Legnaro|T1_FR_CCIN2P3_Disk|T2_KR_KISTI|T2_BE_UCL|T1_RU_JINR_Disk|T2_UK_London_IC|T1_UK_RAL_Disk|T1_US_FNAL_Disk
2021-02-07 14:32:39 659b204aaf894e9388f526ad2c8af44f	 T1_ES_PIC_Disk|T2_DE_DESY|T1_FR_CCIN2P3_Disk|T1_RU_JINR_Disk|T2_BE_UCL|T1_US_FNAL_Disk
2021-01-27 21:24:43 51a843e785b144d0be2a244686482cd2	 T2_IN_TIFR|T2_DE_DESY|T1_FR_CCIN2P3_Disk|T2_CH_CERN|T1_RU_JINR_Disk|T2_BE_UCL
2021-02-02 14:29:39 f62dfd58380646dfa9f64f8f605c1f66	 T2_IN_TIFR|T2_DE_DESY|T2_IT_Legnaro|T1_FR_CCIN2P3_Disk|T2_KR_KISTI|T2_BE_UCL|T1_RU_JINR_Disk|T2_UK_London_IC|T1_UK_RAL_Disk|T1_US_FNAL_Disk
2021-01-27 21:24:45 027c138bdfea4ee29bd989528cd8c186	 T2_IN_TIFR|T2_DE_DESY|T1_FR_CCIN2P3_Disk|T2_CH_CERN|T1_RU_JINR_Disk|T2_BE_UCL
2021-02-02 14:29:54 857508bd80b24c94bf9a8ba056445334	 T2_IN_TIFR|T2_DE_DESY|T2_IT_Legnaro|T1_FR_CCIN2P3_Disk|T2_KR_KISTI|T2_BE_UCL|T1_RU_JINR_Disk|T2_UK_London_IC|T1_UK_RAL_Disk|T1_US_FNAL_Disk
2021-01-27 21:24:44 d9e1e1998c1641919c2e364a3be7e121	 T2_IN_TIFR|T2_DE_DESY|T1_FR_CCIN2P3_Disk|T2_CH_CERN|T1_RU_JINR_Disk|T2_BE_UCL
2021-02-02 14:41:49 a7e9249681d64191a5798e4edabc7cf2	 T2_IN_TIFR|T2_DE_DESY|T2_IT_Legnaro|T1_FR_CCIN2P3_Disk|T2_KR_KISTI|T2_BE_UCL|T1_RU_JINR_Disk|T2_UK_London_IC|T1_UK_RAL_Disk|T1_US_FNAL_Disk
2021-01-27 21:24:46 8cafa8b8aafe41eca7f71de9ffc9bc3a	 T2_IN_TIFR|T2_DE_DESY|T1_FR_CCIN2P3_Disk|T2_CH_CERN|T1_RU_JINR_Disk|T2_BE_UCL
2021-02-02 14:41:56 df826252a81346a081eba6e8d35d50af	 T2_IN_TIFR|T2_DE_DESY|T2_IT_Legnaro|T1_FR_CCIN2P3_Disk|T2_KR_KISTI|T2_BE_UCL|T1_RU_JINR_Disk|T2_UK_London_IC|T1_UK_RAL_Disk|T1_US_FNAL_Disk

This dataset has 24 rules and 7 replicas
/SingleMuon/Run2018D-v1/RAW#6f5ce442-ece0-4f9b-8591-3e8219f8b224

2021-01-21 18:21:37 1d49ecd168f84f18919816a3b888d372	 T1_ES_PIC_Disk|T2_IT_Rome|T2_DE_DESY|T1_US_FNAL_Disk|T2_US_MIT|T2_BE_IIHE|T2_US_Caltech|T1_DE_KIT_Disk|T2_RU_INR|T2_US_Purdue|T2_PL_Swierk|T1_RU_JINR_Disk|T2_FI_HIP|T2_IT_Bari|T2_US_Nebraska|T1_UK_RAL_Disk|T2_US_Vanderbilt
2021-01-21 18:32:04 a0ebf6bca1c94c0fbe2353d846eafbe1	 T1_ES_PIC_Disk|T2_US_MIT|T2_BE_IIHE|T2_US_Caltech|T1_DE_KIT_Disk|T2_US_Purdue|T1_RU_JINR_Disk|T2_US_Vanderbilt|T2_US_Nebraska|T1_UK_RAL_Disk|T1_US_FNAL_Disk
2021-01-21 18:42:38 7c9b0ad16f2e45429462f4e610b387ec	 T2_US_MIT|T2_BE_IIHE|T2_US_Caltech|T1_DE_KIT_Disk|T2_US_Purdue|T1_RU_JINR_Disk|T2_US_Vanderbilt|T2_US_Nebraska|T1_UK_RAL_Disk|T1_US_FNAL_Disk
2021-01-21 18:53:02 bf27e1e2e3cc400fb41d9e067f8e042c	 T2_US_MIT|T2_US_Caltech|T1_DE_KIT_Disk|T2_US_Purdue|T2_US_Vanderbilt|T2_US_Nebraska|T1_UK_RAL_Disk|T1_US_FNAL_Disk
2021-01-21 19:03:33 3af1a33d66fa4edfb4acfe355ff413d6	 T2_US_Nebraska|T1_DE_KIT_Disk|T2_US_Purdue|T1_UK_RAL_Disk|T2_US_MIT
2021-01-21 19:14:02 1fad79fb01d34e58b175b9a9dc21a1ca	 T1_DE_KIT_Disk|T2_US_Nebraska
2021-01-21 19:24:33 c542f814c04f426998b7475bacc18a20	 T1_DE_KIT_Disk
2021-01-22 05:26:38 77af3463212c424586a053d5de13d789	 T1_DE_KIT_Disk|T2_CH_CERN
2021-01-22 05:37:04 328437a3cd314d6da8f2402313e0713b	 T2_CH_CERN
2021-01-22 15:02:03 49cbb74e7f66454b8c1cb853ea37c571	 T2_DE_RWTH|T1_US_FNAL_Disk
2021-01-29 22:25:06 64156a2b94354f948b706d9b10f532de	 T1_RU_JINR_Disk|T1_FR_CCIN2P3_Disk|T2_DE_DESY|T2_CH_CERN
2021-02-04 16:35:24 0dbe88657c5f47f891ca153f0cb30664	 T1_ES_PIC_Disk|T2_CH_CSCS|T2_UA_KIPT|T1_FR_CCIN2P3_Disk|T2_US_Purdue|T2_TW_NCHC|T2_FI_HIP|T2_UK_SGrid_RALPP|T2_FR_GRIF_LLR|T2_DE_RWTH|T2_FR_IPHC|T2_IT_Legnaro|T2_US_Caltech|T1_DE_KIT_Disk|T2_UK_London_Brunel|T2_RU_JINR|T1_UK_RAL_Disk|T1_US_FNAL_Disk|T2_EE_Estonia|T2_IT_Rome|T2_US_Florida|T2_FR_GRIF_IRFU|T1_IT_CNAF_Disk|T1_RU_JINR_Disk|T2_UK_London_IC|T2_IT_Bari|T2_US_Nebraska|T2_FR_CCIN2P3|T2_US_UCSD|T2_ES_CIEMAT|T2_RU_IHEP|T2_US_Wisconsin|T2_HU_Budapest|T2_CN_Beijing|T2_US_MIT|T2_BE_IIHE|T2_KR_KISTI|T2_RU_INR|T2_CH_CERN|T2_PT_NCG_Lisbon|T2_PL_Swierk|T2_US_Vanderbilt|T2_BR_SPRACE
2021-01-27 21:24:39 a00f8f1f48624040997da55adc23822f	 T2_IN_TIFR|T2_DE_DESY|T1_FR_CCIN2P3_Disk|T2_CH_CERN|T1_RU_JINR_Disk|T2_BE_UCL
2021-02-02 14:29:27 25b8e4d1f01f488b85e0e21f7655e4a2	 T2_IN_TIFR|T2_DE_DESY|T2_IT_Legnaro|T1_FR_CCIN2P3_Disk|T2_KR_KISTI|T2_BE_UCL|T1_RU_JINR_Disk|T2_UK_London_IC|T1_UK_RAL_Disk|T1_US_FNAL_Disk
2021-02-07 14:32:39 659b204aaf894e9388f526ad2c8af44f	 T1_ES_PIC_Disk|T2_DE_DESY|T1_FR_CCIN2P3_Disk|T1_RU_JINR_Disk|T2_BE_UCL|T1_US_FNAL_Disk

@amaltaro
Copy link
Contributor

I have the feeling this concerns the same issue already discussed and fixed in production.

@todor-ivanov since you are already on top of this investigation, can you please close that loop such that we can decide whether it's really a problem that we need to address, or just another observation from something that has already been resolved (thus, not an issue)? Thanks

@todor-ivanov
Copy link
Contributor

Hi @amaltaro I tried to find those rules in the MStransferor logs, but as one can guess, we do not keep logs for that far back in time. We keep them for the last 3 days only. Given the fact that the creation time of all rules @flgomezc reports are all at around the time the service have been re-deployed and no violations later during the month, I would say we may close that issue and in case we find to have another storm of those rules, we can reopen it or create a new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants