Skip to content

trouble shooting

Todor Ivanov edited this page Jul 7, 2023 · 36 revisions

WMAgent trouble shoot

Updating DN mapping in the Rucio server

Whenever we change the DN used by the agents and/or by central services, we need to update the Rucio account mapping on the Rucio server side (very likely on both Rucio Int and Rucio Prod). The following instructions show how to add a new DN identity under a given Rucio account ID. Note that this needs to be executed either with the same Rucio account, or with an admin Rucio account.

  1. Depending on where you are running these steps, you might have to allow openssl to use older versions:
export CRYPTOGRAPHY_ALLOW_OPENSSL_102=1
  1. Now open a python interpreter and import the following libraries (this is either in a WMAgent box or in a WM container):
import os
from rucio.client import Client
from pprint import pprint
  1. (NOTE: command below needs to be updated accordingly) Set up the certificate credentials to be used and instantiate the Rucio client object:
creds = {"client_cert": os.getenv("X509_USER_CERT", "Unknown"),
         "client_key": os.getenv("X509_USER_KEY", "Unknown")}

client = Client(rucio_host='http://cms-rucio-int.cern.ch',
                auth_host='https://cms-rucio-auth-int.cern.ch',
                account="wma_test",
                ca_cert=False,
                auth_type="x509",
                timeout=30,
                creds=creds)
  1. (NOTE: you need to update the command below accordingly) Add the new DN to the Rucio account:
client.add_identity("wma_test", "/DC=org/ NEW DN BLAH BLAH", "X509", "cms-service-production-admins@cern.ch")
  1. You can check what are the current DNs mapped to a given Rucio account with this command:
pprint(list(client.list_identities("wma_test")))

Workqueue is not acquiring request: acquired status but not moving to running open status.

  1. Check WorkQueueManager Component Log.
  2. If the error message is showing json parse error, wq view is corrupted.
    ERROR:WorkQueueManagerWMBSFileFeeder:Error in wmbs inject loop: unterminated array starting at position 0:
    
    1. shutdown the agent, and shutdown couch server
    2. remove the view from /data1/database/.workqueue_design
    3. start the couch server and rebuild the view by triggering one of the view
      curl http://localhost:5984/workqueue/_design/WorkQueue/_view/availableByPriority
      
    4. If the view rebuild is finished, start the agent.

Note, however, in case this issue is happening to all the agents, then it probably is an issue with central workqueue.

Datasets were produced with None either for AcquisitionEra or ProcessingString.

  1. In these cases these blocks and files will bug DBS3Upload all the time since it does not pass the Lexicon validation in DBS. So, what we need to do is basically to close those blocks and mark them (and its files) as injected in dbsbuffer tables.
  • Shutdown PhEDExInjector and DBS3Upload
  • Gather a list of workflow names and their bad output dataset names:
    SELECT DISTINCT dbsbuffer_workflow.name,
       dbsbuffer_dataset.path
       FROM dbsbuffer_dataset
       INNER JOIN dbsbuffer_dataset_subscription ON
         dbsbuffer_dataset.id = dbsbuffer_dataset_subscription.dataset_id
       INNER JOIN dbsbuffer_block ON
         dbsbuffer_block.dataset_id = dbsbuffer_dataset_subscription.dataset_id
       INNER JOIN dbsbuffer_file ON
         dbsbuffer_file.block_id = dbsbuffer_block.id
       INNER JOIN dbsbuffer_workflow ON
         dbsbuffer_workflow.id = dbsbuffer_file.workflow
       WHERE dbsbuffer_block.blockname LIKE '/%/None-%'
       AND dbsbuffer_block.status != 'Closed';
  • Find the blocks and files that needs manual intervention (just to keep record)
    SELECT blockname FROM dbsbuffer_block WHERE blockname LIKE '/%/None-%' AND status!='Closed';
    SELECT lfn FROM dbsbuffer_file WHERE lfn LIKE '/store/%/None/%' AND status!='InDBS';
  • Close and mark them as injected
    UPDATE dbsbuffer_block SET status='Closed' WHERE blockname LIKE '/%/None-%' AND status!='Closed';
    UPDATE dbsbuffer_file SET status='InDBS', in_phedex='1' WHERE lfn LIKE '/store/%/None/%' AND status!='InDBS';
  • elog them in the workflow team.
  1. If there are children of these None samples (usually correctly named), they will fail DBS injection because their parent information is missing as well. Those will also have to be marked as injected, we better perform the following procedure for them:
  • Find the exact block name in the DBS3Upload logs

  • Get a list of files that belong to that block (replace BLOCKNAME in the query below):

    SELECT lfn FROM dbsbuffer_file 
      WHERE block_id=(SELECT id FROM dbsbuffer_block WHERE blockname='BLOCKNAME' AND status!='Closed')
      AND status!='InDBS';
    
  • Then we mark all its files as injected in DBS and PhEDEx (replace BLOCKNAME in the query below):

    UPDATE dbsbuffer_file SET status='InDBS', in_phedex='1'
      WHERE block_id=(SELECT id FROM dbsbuffer_block WHERE blockname='BLOCKNAME' AND status!='Closed')
      AND status!='InDBS';
    
  • Finally, we mark the same block as Closed (replace BLOCKNAME in the query below):

    UPDATE dbsbuffer_block SET status='Closed' WHERE blockname='BLOCKNAME' AND status!='Closed';
    

JobAccountant crashes with duplicate entry error (This seems to be caused by UUID collision)

1. Use Alan's script to find the fwjr which contains duplicate file
 https://github.com/amaltaro/ProductionTools/blob/master/removeDupJobAccountant.py
cmst1@vocms0304: curl https://raw.githubusercontent.com/amaltaro/ProductionTools/master/removeDupJobAccountant.py > ~/removeDupJobAccountant.py
cmst1@vocms0304: source /data/srv/wmagent/current/apps/wmagent/etc/profile.d/init.sh
cmst1@vocms0304: cd /data/srv/wmagent/current

cmst1@vocms0304:/data/srv/wmagent/current $ python ~/removeDupJobAccountant.py 
Found 300 pickle files to open
with a total of 200 output files
Retrieved 764442 lfns from wmbs_file_details

Found 1 duplicate files: ['/store/unmerged/RunIISummer15GSBackfill/InclusiveBtoJpsitoMuMu_JpsiPt8_TuneCUEP8M1_13TeV-pythia8-evtgen/GEN-SIM/BACKFILL-v4/90114/B205B040-DA76-E611-96E1-10983627C3C1.root']
The bad pkl files are:
/data/srv/wmagent/v1.0.19.patch1/install/wmagent/JobCreator/JobCache/vlimant_BPH-RunIISummer15GS-Backfill-00030_00212_v0__160907_112626_808/Production/JobCollection_74138_0/job_750147/Report.0.pkl
Remove them, restart the component and be happy!

2. By removing that it will fail the job.
rm /data/srv/wmagent/v1.0.19.patch1/install/wmagent/JobCreator/JobCache/vlimant_BPH-RunIISummer15GS-Backfill-00030_00212_v0__160907_112626_808/Production/JobCollection_74138_0/job_750147/Report.0.pkl
3. Restart JobAccountant.

Error message example

ERROR:BaseWorkerThread:Error in event loop (2): <WMComponent.JobAccountant.JobAccountantPoller.JobAccountantPoller instance at 0x26d4ab8> AccountantWorkerException Message: Error while adding files to WMBS! (IntegrityError) (1062, "Duplicate entry '/store/unmerged/logs/prod/2014/6/16/boudoul_RVCMSSW_6_2_0_SLHC14' for key 'lfn'") 'INSERT INTO wmbs_file_details (lfn, filesize, events,\n
first_event, merged)\n VALUES (%s, %s, %s, %s,\n %s)' [('/store/unmerged/logs/prod/2014/6/16/boudoul_RVCMSSW_6_2_0_SLHC14SingleElectronPt35Extended__UPG2023SHNoTaper_140615_171930_4391/SingleElectronPt35Extended_Extended2023SHCalNoTaper_GenSimFull/SingleElectronPt35Extended_Extended2023SHCalNoTaper_GenSimFullMergeFEVTDEBUGoutput/DigiFull_Extended2023SHCalNoTaper/DigiFull_Extended2023SHCalNoTaperMergeFEVTDEBUGHLToutput/RecoFull_Extended2023SHCalNoTaper/RecoFull_Extended2023SHCalNoTaperMergeDQMoutput/RecoFull_Extended2023SHCalNoTaperMergeDQMoutputEndOfRunDQMHarvestMerged/0000/1/0e6169c8-f542-11e3-b6db-003048c9c3fe-EndOfRun-Harvest-1-1-logArchive.tar.gz', 0, 0, 0, 0), ('/store/unmerged/CMSSW_6_2_0_SLHC14/RelValSingleElectronPt1000/GEN-SIM-RECO/DES23_62_V1_UPG2023SHNoTaper-v1/00000/C4C0AC11-46F5-E311-8999-0025905A60A6.root', 36213656, 50, 0, 0), ('/store/unmerged/CMSSW_6_2_0_SLHC14/RelValSingleElectronPt1000/DQM/DES23_62_V1_UPG2023SHNoTaper-v1/00000/C836D311-46F5-E311-8999-0025905A60A6.root', 6334400, 50, 0, 0), ('/store/unmerged/logs/prod/2014/6/16/boudoul_RVCMSSW_6_2_0_SLHC14SingleElectronPt1000__UPG2023SHNoTaper_140615_172119_1332/SingleElectronPt1000_Extended2023SHCalNoTaper_GenSimFull/SingleElectronPt1000_Extended2023SHCalNoTaper_GenSimFullMergeFEVTDEBUGoutput/DigiFull_Extended2023SHCalNoTaper/DigiFull_Extended2023SHCalNoTaperMergeFEVTDEBUGHLToutput/RecoFull_Extended2023SHCalNoTaper/0000/0/9be962e8-f545-11e3-b6db-003048c9c3fe-7-0-logArchive.tar.gz', 0, 0, 0, 0)] ModuleName : WMComponent.JobAccountant.AccountantWorker MethodName : handleWMBSFiles ClassInstance : None FileName : /data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/cms/wmagent/0.9.91/lib/python2.6/site-packages/WMComponent/JobAccountant/AccountantWorker.py ClassName : None LineNumber : 822 ErrorNr : 0

Traceback: File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/cms/wmagent/0.9.91/lib/python2.6/site-packages/WMComponent/JobAccountant/AccountantWorker.py", line 794, in handleWMBSFiles transaction = self.existingTransaction())

File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/cms/wmagent/0.9.91/lib/python2.6/site-packages/WMCore/WMBS/MySQL/Files/Add.py", line 40, in execute conn = conn, transaction = transaction)

File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/cms/wmagent/0.9.91/lib/python2.6/site-packages/WMCore/Database/DBCore.py", line 167, in processData returnCursor=returnCursor))

File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/cms/wmagent/0.9.91/lib/python2.6/site-packages/WMCore/Database/MySQLCore.py", line 140, in executemanybinds returnCursor)

File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/cms/wmagent/0.9.91/lib/python2.6/site-packages/WMCore/Database/DBCore.py", line 114, in executemanybinds result = connection.execute(s, b)

File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/external/py2-sqlalchemy/0.5.2-comp7/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 824, in execute return Connection.executors[c](self, object, multiparams, params)

File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/external/py2-sqlalchemy/0.5.2-comp7/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 888, in _execute_text return self.__execute_context(context)

File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/external/py2-sqlalchemy/0.5.2-comp7/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 894, in __execute_context self._cursor_executemany(context.cursor, context.statement, context.parameters, context=context)

File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/external/py2-sqlalchemy/0.5.2-comp7/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 960, in _cursor_executemany self._handle_dbapi_exception(e, statement, parameters, cursor, context)

File "/data/srv/wmagent/v0.9.91/sw/slc5_amd64_gcc461/external/py2-sqlalchemy/0.5.2-comp7/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 931, in _handle_dbapi_exception raise exc.DBAPIError.instance(statement, parameters, e, connection_invalidated=is_disconnect)

Jobs from a TaskChain workflow are running on a different site which is missing from the SiteWhiteList in the workflow description.

Sometimes Jobs from some tasks of a TaskChain workflow may end up in a site which is not listed in the SiteWhiteList. The reasons for this may be many, one in particular described bellow:

  • The output file/block from a previous task (which is considered an input for the current task) have been staged to a site considered in close proximity to the running one even though not listed in the list of possible running sites. In this case the following task will be redirected and the new place for execution of the consecutive jobs would be the new site where the data has been staged to (regardless of the fact it is missing from the SiteWhiteList).
    • This may lead to situations where the whole TaskChain may get stuck because of only few jobs not progressing and unable to negotiate resources at the new destination.
    • Typical example - the connection between T1_IT_CNAF and T2_IT_Legnaro: A workflow targeting T1_IT_CNAF and requiring ARM resources may stage a block from an initial task at T2_IT_Legnaro and never get resources to run there since T2_IT_Legnarohas no arm nodes:
    • The workflow: https://cmsweb.cern.ch/reqmgr2/fetch?rid=pdmvserv_RVCMSSW_13_0_0TTbar_14TeV__CNAFARM_2026D95noPU_230313_103242_2021
    • RucioInjector logs:
...
2023-03-20 16:03:14,162:140567001872128:INFO:RucioInjectorPoller:Block rule created for block: /RelValTTbar_14TeV/CMSSW_13_0_0-130X_mcRun4_realistic_v2_CNAFARM_2026D95noPU-v1/GEN-SIM-DIGI-RAW#70672412-b5f0-4394-992b-5dd70b428b43, at: T2_IT_Legnaro, with rule id: 8b75502c8e9a472190f7df7b722b59c2
...

The whole procedure for eventual debugging:

  • Login at the agent and fetch a job from the TaskJain sitting idle in condor:
$ condor_q -constraint 'JobStatus==1 && WMAgent_RequestName=="pdmvserv_RVCMSSW_13_0_0TTbar_14TeV__CNAFARM_2026D95noPU_230313_103242_2021"'


-- Schedd: vocms0258.cern.ch : <188.185.24.42:4080?... @ 03/22/23 18:30:51
 ID       OWNER            SUBMITTED     RUN_TIME ST PRI    SIZE CMD
10895.0   cmst1           3/19 22:56   0+00:00:00 I  500000  0.0 submit_py3.sh pdmvserv_RVCMSSW_13_0_0TTbar_14TeV__CNAFARM_2026D95noPU_230313_103242_2021-Sandbox.tar.bz2 296076 0
10895.1   cmst1           3/19 22:56   0+00:00:00 I  500000  0.0 submit_py3.sh pdmvserv_RVCMSSW_13_0_0TTbar_14TeV__CNAFARM_2026D95noPU_230313_103242_2021-Sandbox.tar.bz2 296077 0
10902.2   cmst1           3/19 23:30   0+00:00:00 I  500000  0.0 submit_py3.sh pdmvserv_RVCMSSW_13_0_0TTbar_14TeV__CNAFARM_2026D95noPU_230313_103242_2021-Sandbox.tar.bz2 296083 0

Total for query: 3 jobs; 0 completed, 0 removed, 3 idle, 0 running, 0 held, 0 suspended 
Total for all users: 2497 jobs; 0 completed, 0 removed, 2250 idle, 247 running, 0 held, 0 suspended
  • Get the running directory for the idle job:
$ condor_q 10895.0 -af iwd 
/data/srv/wmagent/v2.1.6.1/install/wmagentpy3/JobCreator/JobCache/pdmvserv_RVCMSSW_13_0_0TTbar_14TeV__CNAFARM_2026D95noPU_230313_103242_2021/GenSimFull/GenSimFullMergeFEVTDEBUGoutput/DigiTrigger_2026D95/DigiTrigger_2026D95MergeFEVTDEBUGHLToutput/RecoGlobal_2026D95/JobCollection_50600_0/job_296076
  • inspect the wmbs job description by opening the job.pkl file:
$ source /data/srv/wmagent/current/apps/wmagent/etc/profile.d/init.sh
$ python3
>>> import pickle
>>> from pprint import pprint

>>> fName = "/data/srv/wmagent/v2.1.6.1/install/wmagentpy3/JobCreator/JobCache/pdmvserv_RVCMSSW_13_0_0TTbar_14TeV__CNAFARM_2026D95noPU_230313_103242_2021/GenSimFull/GenSimFullMergeFEVTDEBUGoutput/DigiTrigger_2026D95/DigiTrigger_2026D95MergeFEVTDEBUGHLToutput/RecoGlobal_2026D95/JobCollection_50600_0/job_296076/job.pkl"

>>> with open(fName, 'rb') as pFile:
...     pObj = pickle.load(pFile)


>>> pprint(pObj) 
{'agentName': 'vocms0258.cern.ch',
 'agentNumber': 0,
 'allowOpportunistic': False,
 'attachments': {},
 'cache_dir': '/data/srv/wmagent/v2.1.6.1/install/wmagentpy3/JobCreator/JobCache/pdmvserv_RVCMSSW_13_0_0TTbar_14TeV__CNAFARM_2026D95noPU_230313_103242_2021/GenSimFull/GenSimFullMergeFEVTDEBUGoutput/DigiTrigger_2026D95/DigiTrigger_2026D95MergeFEVTDEBUGHLToutput/RecoGlobal_2026D95/JobCollection_50600_0/job_296076',
 'couch_record': None,
 'counter': 2,
 'custom': {},
 'estimatedDiskUsage': 1000000,
 'estimatedJobTime': 3600,
 'estimatedMemoryUsage': 8000.0,
 'fwjr': None,
 'fwjr_path': None,
 'gpuRequirements': {},
 'id': 296076,
 'inputDataset': None,
 'inputDatasetLocations': ['T2_IT_Legnaro'],
 'inputPileup': [],
 'input_files': [{'avgEvtsPerLumi': 100,
                  'checksums': {},
                  'events': 1000,
                  'first_event': 0,
                  'id': 580403,
                  'last_event': 0,
                  'lfn': '/store/relval/CMSSW_13_0_0/RelValTTbar_14TeV/GEN-SIM-DIGI-RAW/130X_mcRun4_realistic_v2_CNAFARM_2026D95noPU-v1/00000/618eb243-1176-4588-baa0-0efb175ce0ec.root',
                  'locations': set(),
                  'lowestRun': <WMCore.DataStructs.Run.Run object at 0x7f6d8da0ed30>,
                  'lumiCount': 10,
                  'merged': True,
                  'minrun': 1,
                  'newlocations': {'T2_IT_Legnaro'},
                  'parents': set(),
                  'runs': [<WMCore.DataStructs.Run.Run object at 0x7f6d8da0ed30>],
                  'size': 5645788248}],
 'jobType': 'Processing',
 'jobgroup': 50600,
 'location': None,
 'mask': {'FirstEvent': None,
          'FirstLumi': None,
          'FirstRun': None,
          'LastEvent': None,
          'LastLumi': None,
          'LastRun': None,
          'inclusivemask': True,
          'runAndLumis': {1: [[32, 34], [36, 42]]}},
 'name': '10063c6d-2d75-4e5e-9eb8-2f36a406d339-0',
 'numberOfCores': 4,
 'outcome': 'failure',
 'owner': 'pdmvserv',
 'ownerDN': '/DC=ch/DC=cern/OU=Organic '
            'Units/OU=Users/CN=pdmvserv/CN=860734/CN=Robot: PdmV Service',
 'ownerGroup': 'unknown',
 'ownerRole': 'unknown',
 'possiblePSN': {'T2_IT_Legnaro'},
 'requestType': 'TaskChain',
 'requiresGPU': 'forbidden',
 'retry_count': 0,
 'sandbox': '/data/srv/wmagent/v2.1.6.1/install/wmagentpy3/WorkQueueManager/cache/pdmvserv_RVCMSSW_13_0_0TTbar_14TeV__CNAFARM_2026D95noPU_230313_103242_2021/pdmvserv_RVCMSSW_13_0_0TTbar_14TeV__CNAFARM_2026D95noPU_230313_103242_2021-Sandbox.tar.bz2',
 'scramArch': ['el8_aarch64_gcc11'],
 'spec': '/data/srv/wmagent/v2.1.6.1/install/wmagentpy3/WorkQueueManager/cache/pdmvserv_RVCMSSW_13_0_0TTbar_14TeV__CNAFARM_2026D95noPU_230313_103242_2021/WMSandbox/WMWorkload.pkl',
 'state': 'new',
 'state_time': 1679262876,
 'swVersion': ['CMSSW_13_0_0'],
 'task': '/pdmvserv_RVCMSSW_13_0_0TTbar_14TeV__CNAFARM_2026D95noPU_230313_103242_2021/GenSimFull/GenSimFullMergeFEVTDEBUGoutput/DigiTrigger_2026D95/DigiTrigger_2026D95MergeFEVTDEBUGHLToutput/RecoGlobal_2026D95',
 'taskType': 'relval',
 'workflow': 'pdmvserv_RVCMSSW_13_0_0TTbar_14TeV__CNAFARM_2026D95noPU_230313_103242_2021'}

It becomes immedately obvious from the job dump from above the 'inputDatasetLocations': ['T2_IT_Legnaro'], and hence the 'possiblePSN': {'T2_IT_Legnaro'},

Unpikling a failed job Pset file from logreports

Sometimes, like the cases described in #9043:issuecomment-1625662166 one may need to look at the job PSetfile. In this case the information we ar eabout to search is the lumilist on which those failed jobs were workgin on. The rocedure is as followes:

  • First, if possible find the CMSSW release and arch for these jobs. For Production workflows that should come with the requests description at reqmgr. For RAW data that may not be trivial.
  • Second,you configure a CMSSW erea:
ssh lxplus.cern.ch
source /cvmfs/cms.cern.ch/cmsset_default.sh
cmsrel CMSSW_12_3_0
export SCRAM_ARCH=slc7_amd64_gcc11
cmsrel CMSSW_12_3_0
cd CMSSW_12_3_0/src/
cmsenv
  • Then one must fetch the log archive:
cp -v /eos/cms/store/logs/prod/recent/PromptReco/PromptReco_Run349840_Cosmics_Tier0_REPLAY_2022_ID220511165314_v429_220511_1703/Reco/vocms0500.cern.ch-22748-3-log.tar.gz .
tar -xzvf vocms0500.cern.ch-22748-3-log.tar.gz
  • Open a python console, unpikle the PSet.pkl file and write it to a .py file:
ipython3

In [1]: import pickle

In [2]: with open('job/WMTaskSpace/cmsRun1/PSet.pkl', 'rb') as psetPkl:
   ...:     pset = pickle.load(psetPkl)

In [3]: with open('t0PSet.py', 'w') as f:
   ...:     f.write(pset.dumpPython())
  • And finally read the so written file:
cat t0PSet.py 

import FWCore.ParameterSet.Config as cms
from HeterogeneousCore.CUDACore.SwitchProducerCUDA import SwitchProducerCUDA

process = cms.Process("RECO")

process.source = cms.Source("PoolSource",
    fileNames = cms.untracked.vstring('/store/backfill/1/data/Tier0_REPLAY_2022/Cosmics/RAW/v429/000/349/840/00000/8b5bc0d2-2a06-485c-91c3-046bc5a4c995.root'),
    lumisToProcess = cms.untracked.VLuminosityBlockRange("349840:575-349840:575"),
    overrideCatalog = cms.untracked.string('trivialcatalog_file:/cvmfs/cms.cern.ch/SITECONF/T0_CH_CERN/PhEDEx/storage.xml?protocol=eos')
)

Clone this wiki locally