Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deal with duplicate DID exception when injecting multiple file replicas into Rucio #10026

Closed
amaltaro opened this issue Oct 27, 2020 · 16 comments

Comments

@amaltaro
Copy link
Contributor

amaltaro commented Oct 27, 2020

Impact of the bug
WMAgent

Describe the bug
We don't seem to have hit this exception - yet - in production, but there will certainly be a moment where we will try to add a list of file replicas (add_replicas Rucio client API) into Rucio, and some of those will already exist in the database, thus hitting a DataIdentifierAlreadyExists exception.
With the current code, we would simply try the same call (or same files + freshly created ones after the previous cycle) again and again, which would likely keep failing on the server side.

How to reproduce it
Inject a file replica against a block, then try to inject the same replica again (or the same replica plus a new file)

Expected behavior
Duplicated file replica exception should not be considered if we tried to add a single file replica, the component should deal with it as if it had a successful replica injection.

However, if there are multiple file replicas in the input, our wrapper API createReplicas should signal (or raise an exception) back to the caller and let the caller break that list of file replicas into single replicas, making a new createReplicas call for every replica file.

Additional context and error message
Initial handling (single file replica) in: #10024

@todor-ivanov
Copy link
Contributor

Hi @amaltaro, I am not 100% sure I fully understand your concern, mostly because the exception DataIdentifierAlreadyExists you are mentioning is a different one from the one - DuplicateRule we've already seen in a similar case [1]. But besides the exception difference, IIUC the rest of the case you describe here is exactly the same. And if I am not wrong ... then this has been already brought to the Rucio developers' attention [2].

[1]
#9772 (comment)

[2]
rucio/rucio#3807

@amaltaro amaltaro added Rucio and removed Rucio labels Jan 4, 2021
@amaltaro amaltaro added Rucio and removed Rucio labels Jan 6, 2021
@amaltaro amaltaro removed this from the November_2020 milestone Mar 23, 2022
@amaltaro
Copy link
Contributor Author

amaltaro commented Mar 23, 2022

I had a chat with Eric and Yuyi today, and indeed this code looks suspicious and could potentially be causing issues with data produced in central production.

This add_replicas Rucio python API:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/Rucio/Rucio.py#L464

actually performs two actions:

  • create a DID in the database
  • create a replica in the database (a file at a given location)

so, it could be that our datasets are ending without a replica, we are not sure.

The suggested approach for such exceptions would be to:

  1. iterate over every single file and
  2. check whether the DID exists in Rucio
  3. check whether a replica exists in Rucio

if a file passes these 2 checks, then indeed we can consider this data insertion operation complete.

BTW, here is a pointer to the code executed on the server side: https://github.com/rucio/rucio/blob/5e92ee1e6d323d7e2f9eeb6e97ecb4f6fc4c3abe/lib/rucio/core/replica.py#L1457

@yuyiguo @ericvaandering will provide us with some guidance on which Python APIs we could use for this error handling/check.

PS.: and here is a GH issue reporting this (possible) problem in October 2020 :(

@amaltaro
Copy link
Contributor Author

amaltaro commented Jun 7, 2022

And here is a fresh JIRA ticket reporting many inconsistencies between DBS and Rucio, apparently with files (or maybe only replicas!) missing in Rucio: https://its.cern.ch/jira/browse/CMSCOMPPR-25873

If we think that the Rucio client add_replicas API might be causing a problem here, maybe we should change our code to perform two different actions separately:

  1. first, create a file DID
  2. then, create a replica for the file DID above.

I'm not sure whether it would solve the problem though. And, it also adds extra complication on our side with additional bookkeeping and error handling.

@todor-ivanov please discuss this with the Rucio experts to see what is the best option to solve this.

PS.: the JIRA above reports problems with single file blocks as well. So likely not related to write operations in bulk performed by WMAgent.

@ericvaandering
Copy link
Member

OK, I don't know if it will help but I'm parsing the logs and looking for anything other than an exit 201 from POST /replicas from WMCore.

So far nothing.

@todor-ivanov
Copy link
Contributor

Thanks @ericvaandering I am also trying to find some records in the wmagent logs related to any of those files. I'll post my findings here as well.

@ericvaandering
Copy link
Member

So I don't know WHAT happened here, but:

server-rucio-server-fbb969f46-t78zd rucio-server [2022-06-08 17:41:40]      2001:4860::2:feca       10.100.126.128  131.225.188.200 YqDfVApIA8YS89GtvSrOJQAAAP4     401    1280     106     5258    "POST /replicas HTTP/1.1"       "wma_prod-/DC=org/DC=incommon/C=US/ST=Illinois/O=Fermi Research Alliance/OU=Fermilab/CN=wmagent.fnal.gov-unknown-a9c8afa879954d2da2451d1e7a632feb"      "wmcore-client/1.25.5"  wmcoreD::--start
server-rucio-server-fbb969f46-mnv5g rucio-server [2022-06-08 17:57:27]  2001:4860::4:57e0       10.100.20.128   137.138.150.248 YqDjB-YhRZR@I9VzycApUwAAABg     401     1259   106      3833    "POST /replicas HTTP/1.1"       "tier0_prod-/DC=ch/DC=cern/OU=computers/CN=tier0/vocms001.cern.ch-unknown-c6d5361f1bde4759ba5dd8422a921e7b"     "wmcore-client/1.25.5"  wmcoreD::--start

@todor-ivanov
Copy link
Contributor

I have no idea what has generated the above two records.
In the meantime, while checking the Rucio injector logs I do find two types of errors we experience lately:[1] && [2]. Strangely enough I do not see those files in the jira ticket listing the file mismatches.

[1]

2022-04-23 04:19:07,578:139769485022976:ERROR:Rucio:Failed to add replicas for: [{'name': '/store/data/Run2018A/EGamma/AOD/15Feb2022_UL2018-v1/28210000/B213CA40-C61E-314E-A443-323017D18A8A.root', 'scope': 'cms', 'bytes': 6262512647, 'state': 'A', 'adler32': 'e2145bf5'}, {'name': '/store/data/Run2018A/EGamma/AOD/15Feb2022_UL2018-v1/28210000/D1F43378-DADA-D147-8950-150A551F6744.root', 'scope': 'cms', 'bytes': 6343962651, 'state': 'A', 'adler32': '3a1886f6'}] and block: /EGamma/Run2018A-15Feb2022_UL2018-v1/AOD#48a7f226-b463-46e5-8386-5738b7656684. Error: An unknown exception occurred.
Details: no error information passed (http status code: 500)

[2]

2022-05-18 10:23:08,942:140444646516480:ERROR:RucioInjectorPoller:Failed to update file status in the database, reason: (cx_Oracle.DatabaseError) ORA-25408: can not safely replay call
[SQL: UPDATE dbsbuffer_file SET in_phedex = :status WHERE lfn = :lfn]
[parameters: [{'lfn': '/store/mc/RunIISummer20UL16RECOAPV/BdToPsi2sKpi_SoftQCDnonD_TuneCP5_13TeV-pythia8-evtgen/AODSIM/106X_mcRun2_asymptotic_preVFP_v8-v3/2830000/64CBE7B2-F05A-7547-AFF6-55D
EEC6BAA89.root', 'status': 1}, {'lfn': '/store/mc/RunIISummer20UL16RECOAPV/BdToPsi2sKpi_SoftQCDnonD_TuneCP5_13TeV-pythia8-evtgen/AODSIM/106X_mcRun2_asymptotic_preVFP_v8-v3/2830000/19F5E539-C
83C-0E48-8730-F26BAC838B28.root', 'status': 1}]]
(Background on this error at: http://sqlalche.me/e/4xp6)
2022-05-18 10:23:08,948:140444646516480:ERROR:RucioInjectorPoller:Caught unexpected exception in RucioInjector. Details:
<@========== WMException Start ==========@>
Exception Class: RucioInjectorException
Message: Failed to update file status in the database, reason: (cx_Oracle.DatabaseError) ORA-25408: can not safely replay call
[SQL: UPDATE dbsbuffer_file SET in_phedex = :status WHERE lfn = :lfn]
[parameters: [{'lfn': '/store/mc/RunIISummer20UL16RECOAPV/BdToPsi2sKpi_SoftQCDnonD_TuneCP5_13TeV-pythia8-evtgen/AODSIM/106X_mcRun2_asymptotic_preVFP_v8-v3/2830000/64CBE7B2-F05A-7547-AFF6-55D
EEC6BAA89.root', 'status': 1}, {'lfn': '/store/mc/RunIISummer20UL16RECOAPV/BdToPsi2sKpi_SoftQCDnonD_TuneCP5_13TeV-pythia8-evtgen/AODSIM/106X_mcRun2_asymptotic_preVFP_v8-v3/2830000/19F5E539-C
83C-0E48-8730-F26BAC838B28.root', 'status': 1}]]
(Background on this error at: http://sqlalche.me/e/4xp6)
        ClassName : None
        ModuleName : WMComponent.RucioInjector.RucioInjectorPoller
        MethodName : _updateLFNState
        ClassInstance : None
        FileName : /data/srv/wmagent/v2.0.2.patch1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.0.2.patch1/lib/python3.8/site-packages/WMComponent/RucioInjector/RucioInjectorPoller.py
        LineNumber : 308
        ErrorNr : 0

Traceback: 
  File "/data/srv/wmagent/v2.0.2.patch1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.0.2.patch1/lib/python3.8/site-packages/WMComponent/RucioInjector/RucioInjectorPoller.py", line 298, in _updateLF
NState
    self.setStatus.execute(listLfns, 1)

  File "/data/srv/wmagent/v2.0.2.patch1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.0.2.patch1/lib/python3.8/site-packages/WMComponent/DBS3Buffer/MySQL/DBSBufferFiles/SetPhEDExStatus.py", line 25,
 in execute
    self.dbi.processData(self.sql, bindVars, conn = conn,

  File "/data/srv/wmagent/v2.0.2.patch1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.0.2.patch1/lib/python3.8/site-packages/WMCore/Database/DBCore.py", line 159, in processData
    result.extend(self.executemanybinds(sqlstmt[0], subBinds,

  File "/data/srv/wmagent/v2.0.2.patch1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.0.2.patch1/lib/python3.8/site-packages/WMCore/Database/DBCore.py", line 113, in executemanybinds
    result = connection.execute(s, b)

  File "/data/srv/wmagent/v2.0.2.patch1/sw/slc7_amd64_gcc630/external/py3-sqlalchemy/1.3.3-comp/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 982, in execute
    return self._execute_text(object_, multiparams, params)
...

@todor-ivanov
Copy link
Contributor

And here is one ERROR which might be more related to the duplicate DID exception.

2022-05-31 11:43:32,158:140444646516480:ERROR:Rucio:Failed to add replicas for: [{'name': '/store/mc/RunIISummer20UL16RECOAPV/LambdaBTopKPsi2s_SoftQCDnonD_TuneCP5_13TeV-pythia8-evtgen/AODSIM/106X_mcRun2_asymptotic_preVFP_v8-v3/2830000/139D23C2-7D60-9743-83A0-180BC27A4948.root', 'scope': 'cms', 'bytes': 2445827759, 'state': 'A', 'adler32': '11e77647'}] and block: /LambdaBTopKPsi2s_SoftQCDnonD_TuneCP5_13TeV-pythia8-evtgen/RunIISummer20UL16RECOAPV-106X_mcRun2_asymptotic_preVFP_v8-v3/AODSIM#ef65006a-bf3f-4dfe-bc26-4957dceb0f43. Error: An unknown exception occurred.
Details: ('(cx_Oracle.IntegrityError) ORA-00001: unique constraint (CMS_RUCIO_PROD.DIDS_PK) violated',)

@todor-ivanov
Copy link
Contributor

Repeating the debugging information I've posted in the Jira ticket: https://its.cern.ch/jira/browse/CMSCOMPPR-25873 here as well:

Today I took one of the two workflows Hasan Ozturk was pointing as the latest ones suffering the issue and did the following debugging:

cmsgwms-submit7.fnal.gov
cmsgwms-submit8.fnal.gov
vocms0282.cern.ch
vocms0283.cern.ch 
  • Checking with dasgoclient, all of the files listed by Hasan belong to only two blocks:
/HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
/HighEGJet/Run2017G-09Aug2019_UL2017-v2/AOD#af7390e5-776b-49b9-a810-e368448384b7
  • By parsing the logs of all of the agents I've found, both of those blocks have been produced by agent vocms0283
  • Further more took one of the blocks and parsed the logs searching for the following steps:
    • opening the block
    • inserting into Rucio
    • file insertions into the block
    • closing the block
 cmst1@vocms0283:/data/srv/wmagent/current $ grep -rin /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105 install/wmagentpy3/RucioInjector/ComponentLog
354599:2022-06-03 10:37:22,923:140444646516480:INFO:RucioInjectorPoller:Block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105 inserted into Rucio
354622:2022-06-03 10:39:25,502:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
354633:2022-06-03 10:40:14,308:140444646516480:INFO:RucioInjectorPoller:Block rule created for block: /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105, at: T2_US_Nebraska, with rule id: 666e71aa0e1344a5b13a1c5a66f7ebd9
354708:2022-06-03 11:11:15,937:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
354846:2022-06-03 11:51:15,901:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
354868:2022-06-03 11:56:59,654:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 2 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
354976:2022-06-03 12:31:02,459:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
355054:2022-06-03 12:54:10,751:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
355142:2022-06-03 13:18:28,715:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
355264:2022-06-03 13:57:17,740:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 2 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
355378:2022-06-03 14:35:43,214:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 2 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
355443:2022-06-03 14:53:51,129:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
355508:2022-06-03 15:11:27,012:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
355716:2022-06-03 16:10:48,692:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
355784:2022-06-03 16:33:53,366:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
356041:2022-06-03 17:39:51,321:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 2 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
356153:2022-06-03 17:52:37,681:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
356290:2022-06-03 18:35:07,894:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
356374:2022-06-03 18:58:55,960:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
356403:2022-06-03 19:09:27,742:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
356427:2022-06-03 19:15:23,533:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
356507:2022-06-03 19:38:55,326:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
356575:2022-06-03 19:54:09,903:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 2 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
356648:2022-06-03 20:17:39,122:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 3 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
356791:2022-06-03 20:58:39,406:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 2 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
356847:2022-06-03 21:16:09,073:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
357048:2022-06-03 22:19:26,911:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
357108:2022-06-03 22:37:29,622:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 2 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
357166:2022-06-03 22:58:57,280:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 2 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
357316:2022-06-03 23:40:18,098:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
357399:2022-06-04 00:03:56,876:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
357447:2022-06-04 00:16:19,235:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
357592:2022-06-04 01:00:58,789:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
357648:2022-06-04 01:17:44,846:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
357726:2022-06-04 01:36:36,447:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 2 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
357821:2022-06-04 02:01:28,830:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
357892:2022-06-04 02:20:18,870:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
357944:2022-06-04 02:41:21,894:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 2 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
358112:2022-06-04 03:18:44,061:140444646516480:WARNING:Rucio:Dids: [{'scope': 'cms', 'name': '/HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105'}] already attached to: /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD
358113:2022-06-04 03:18:44,061:140444646516480:INFO:RucioInjectorPoller:Block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105 inserted into Rucio
358147:2022-06-04 03:20:58,205:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 3 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
358236:2022-06-04 03:44:26,378:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
358328:2022-06-04 04:04:15,027:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
358414:2022-06-04 04:23:48,662:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 2 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
358499:2022-06-04 04:43:34,505:140444646516480:INFO:RucioInjectorPoller:Successfully inserted 1 files on block /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
369537:2022-06-05 10:39:00,076:140444646516480:INFO:RucioInjectorPoller:Closing block: /HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105
  • Counting all successful insertions from the logs above I've got the number expected by DBS: 56. So according to the agent the whole block has been produced and uploaded to Rucio properly.
$ dasgoclient -query "file block=/HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105" |wc -l
56
  • Checking all the missing files at Rucio, gives exactly the result Hasan was telling about - Rucio knows about the files but basically only about their names
 $ for file  in `cat filemismatch.list`; do echo $file:; rucio list-file-replicas cms:$file ;done 
/store/data/Run2017G/HighEGJet/MINIAOD/09Aug2019_UL2017-v2/2830000/F642CFC6-1ECC-9747-B4E2-A1DFAA8D59E0.root:
+---------+--------+------------+-----------+----------------+
| SCOPE   | NAME   | FILESIZE   | ADLER32   | RSE: REPLICA   |
|---------+--------+------------+-----------+----------------|
+---------+--------+------------+-----------+----------------+
/store/data/Run2017G/HighEGJet/MINIAOD/09Aug2019_UL2017-v2/2830000/ECA4B03B-C808-1D40-8316-06CFA35B61C8.root:
+---------+--------+------------+-----------+----------------+
| SCOPE   | NAME   | FILESIZE   | ADLER32   | RSE: REPLICA   |
|---------+--------+------------+-----------+----------------|
+---------+--------+------------+-----------+----------------+
/store/data/Run2017G/HighEGJet/MINIAOD/09Aug2019_UL2017-v2/2830000/D626B272-5233-4646-8B46-84C438E6CC84.root:
+---------+--------+------------+-----------+----------------+
| SCOPE   | NAME   | FILESIZE   | ADLER32   | RSE: REPLICA   |
|---------+--------+------------+-----------+----------------|
+---------+--------+------------+-----------+----------------+
/store/data/Run2017G/HighEGJet/MINIAOD/09Aug2019_UL2017-v2/2830000/2256BCD5-F101-184F-9A0E-908CED2F2327.root:
+---------+--------+------------+-----------+----------------+
| SCOPE   | NAME   | FILESIZE   | ADLER32   | RSE: REPLICA   |
|---------+--------+------------+-----------+----------------|
+---------+--------+------------+-----------+----------------+
/store/data/Run2017G/HighEGJet/MINIAOD/09Aug2019_UL2017-v2/2830000/C1CED714-7175-734B-B218-5C5CB8836084.root:
+---------+--------+------------+-----------+----------------+
| SCOPE   | NAME   | FILESIZE   | ADLER32   | RSE: REPLICA   |
|---------+--------+------------+-----------+----------------|
+---------+--------+------------+-----------+----------------+
  • In contrast to a randomly taken file from the same block:
$ rucio list-file-replicas cms:/store/data/Run2017G/HighEGJet/MINIAOD/09Aug2019_UL2017-v2/2830000/1C9F45DC-A744-DC43-8392-E0CC08A90E24.root
+---------+--------------------------------------------------------------------------------------------------------------+------------+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| SCOPE   | NAME                                                                                                         | FILESIZE   | ADLER32   | RSE: REPLICA                                                                                                                                                                                      |
|---------+--------------------------------------------------------------------------------------------------------------+------------+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| cms     | /store/data/Run2017G/HighEGJet/MINIAOD/09Aug2019_UL2017-v2/2830000/1C9F45DC-A744-DC43-8392-E0CC08A90E24.root | 2.450 GB   | 470975b1  | T2_US_Nebraska: davs://xrootd-local.unl.edu:1094/store/data/Run2017G/HighEGJet/MINIAOD/09Aug2019_UL2017-v2/2830000/1C9F45DC-A744-DC43-8392-E0CC08A90E24.root                                      |
| cms     | /store/data/Run2017G/HighEGJet/MINIAOD/09Aug2019_UL2017-v2/2830000/1C9F45DC-A744-DC43-8392-E0CC08A90E24.root | 2.450 GB   | 470975b1  | T2_UK_London_Brunel: gsiftp://dc2-grid-64.brunel.ac.uk:2811/dpm/brunel.ac.uk/home/cms/store/data/Run2017G/HighEGJet/MINIAOD/09Aug2019_UL2017-v2/2830000/1C9F45DC-A744-DC43-8392-E0CC08A90E24.root |
+---------+--------------------------------------------------------------------------------------------------------------+------------+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 

So we need to understand:

  • Were those files actually successfuly inserted into Rucio at the first place.
  • If so, what has happened with those files in between the moment the request for inserting them into the block has been done and now
  • if not, what was the error we should have been expecting from Rucio and probably we have not handled properly

@todor-ivanov
Copy link
Contributor

So to answer my own previous two questions:

  • Rucio does know about the files and their names because at some point they have been present in the system, but currently the block length in Rucio is 51 files, which is exactly 5 files short from the number expected in DBS (and I did check, the missing files are indeed the ones reported):
In [17]: len(list(rcl.list_content('cms', '/HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105')))
Out[17]: 51
  • Checking the content history tells us exactly when the files has been deleted:
 In [18]: list(rcl.list_content_history('cms', '/HighEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD#ac7dc12c-bcde-4576-90de-276bcf1a2105'))
Out[18]: 
[{'scope': 'cms',
  'name': '/store/data/Run2017G/HighEGJet/MINIAOD/09Aug2019_UL2017-v2/2830000/2256BCD5-F101-184F-9A0E-908CED2F2327.root',
  'type': 'FILE',
  'bytes': 4106007445,
  'adler32': '7750106d',
  'md5': None,
  'deleted_at': datetime.datetime(2022, 6, 6, 6, 55, 4),
  'created_at': datetime.datetime(2022, 6, 3, 20, 58, 57),
  'updated_at': datetime.datetime(2022, 6, 3, 20, 59, 27)},
 {'scope': 'cms',
  'name': '/store/data/Run2017G/HighEGJet/MINIAOD/09Aug2019_UL2017-v2/2830000/D626B272-5233-4646-8B46-84C438E6CC84.root',
  'type': 'FILE',
  'bytes': 4241754550,
  'adler32': '3f70a19c',
  'md5': None,
  'deleted_at': datetime.datetime(2022, 6, 6, 6, 55, 6),
  'created_at': datetime.datetime(2022, 6, 4, 1, 20, 58),
  'updated_at': datetime.datetime(2022, 6, 4, 1, 21, 2)},
 {'scope': 'cms',
  'name': '/store/data/Run2017G/HighEGJet/MINIAOD/09Aug2019_UL2017-v2/2830000/F642CFC6-1ECC-9747-B4E2-A1DFAA8D59E0.root',
  'type': 'FILE',
  'bytes': 2297303340,
  'adler32': '21e29b2e',
  'md5': None,
  'deleted_at': datetime.datetime(2022, 6, 6, 6, 55, 6),
  'created_at': datetime.datetime(2022, 6, 3, 22, 16, 19),
  'updated_at': datetime.datetime(2022, 6, 3, 22, 16, 39)},
 {'scope': 'cms',
  'name': '/store/data/Run2017G/HighEGJet/MINIAOD/09Aug2019_UL2017-v2/2830000/C1CED714-7175-734B-B218-5C5CB8836084.root',
  'type': 'FILE',
  'bytes': 2624073373,
  'adler32': '5ae5c49c',
  'md5': None,
  'deleted_at': datetime.datetime(2022, 6, 6, 6, 55, 7),
  'created_at': datetime.datetime(2022, 6, 3, 20, 37, 29),
  'updated_at': datetime.datetime(2022, 6, 3, 20, 37, 53)},
 {'scope': 'cms',
  'name': '/store/data/Run2017G/HighEGJet/MINIAOD/09Aug2019_UL2017-v2/2830000/ECA4B03B-C808-1D40-8316-06CFA35B61C8.root',
  'type': 'FILE',
  'bytes': 3321797899,
  'adler32': 'e4456340',
  'md5': None,
  'deleted_at': datetime.datetime(2022, 6, 6, 6, 55, 6),
  'created_at': datetime.datetime(2022, 6, 4, 0, 20, 18),
  'updated_at': datetime.datetime(2022, 6, 4, 0, 20, 30)}]
  • Which for all of them seems to be happening simultaneously 2 or 3 days after their creation, at:
    'deleted_at': datetime.datetime(2022, 6, 6, 6, 55, 6) 

If someone from from Data Management could check the reason for these deletions, this would be of great help. Thanks!

Could that be because of a missing rule protecting them or because the insertion has not completed, or because they were not properly associated with a higher level data container...? The latest seems to be a negative answer, though, since at some point the block/(dataset in Rucio terms) did know about those files?

@ericvaandering
Copy link
Member

I looked into the undertaker, which I think is the thing deleting files. It's only been running since June 7 and only removed one DID:
Removing did cms:/store/mc/RunIISummer20UL17MiniAODv2/ZpToMuMu_M85_TuneCP5_13TeV-madgraph-pythia8/MINIAODSIM/106X_mc2017_realistic_v9-v2/2530000/A9E566C4-79B9-4941-9642-7B0F23469747.root

We have our Rucio meeting now and I will try to understand under what circumstances DIDs get deleted.

@todor-ivanov
Copy link
Contributor

Thanks @ericvaandering

@ericvaandering
Copy link
Member

The undertaker is supposed to delete DIDs which have an expiration date. Any chance files or datasets are being created with those? DIDs should never have expiration dates in CMS unless they are set afterwards to invalidate things. Only rules should have expiration dates.

@todor-ivanov
Copy link
Contributor

Hi @ericvaandering we did double check with @amaltaro and the answer is: No, we do not set did replicas with expiration date, neither for file nor for dataset. And actually if we were having those set for the dataset(blocks in CMS terms) this expiration data would have been affecting also the rest 51 files in the same dataset. There is something different with those 5 that has been deleted, and I cannot figure out what.

@amaltaro
Copy link
Contributor Author

amaltaro commented Jul 7, 2022

@todor-ivanov @ericvaandering have we concluded that there is nothing in WMCore to be fixed on this regard and that actually files get inserted into Rucio, but they (files/replicas) somehow get deleted by something in the Rucio consistency monitor eco-system?

@todor-ivanov
Copy link
Contributor

Hi @amaltaro I do not think there is anything we can do here from the WMCore side. Closing this issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants