Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subscriptions not finished in a drained agent (0 jobs) #9568

Closed
amaltaro opened this issue Mar 19, 2020 · 2 comments
Closed

Subscriptions not finished in a drained agent (0 jobs) #9568

amaltaro opened this issue Mar 19, 2020 · 2 comments

Comments

@amaltaro
Copy link
Contributor

amaltaro commented Mar 19, 2020

Impact of the bug
WMAgent on vocms0283 (old 1.2.8 agent)

Describe the bug
The agent is apparently completely drained, with no jobs in condor and only cleanout jobs in WMBS. Even though, this workflow
pgunnell_Run2018D-v1-SingleMuon-12Nov2019_UL2018_1064p1_200113_212951_2532
has 26 GQE/LQE in the Running state (since end of January).

The draining script tells me that there are 2 subscriptions with files available:

*** SUBSCRIPTIONS: found 2 files available in WMBS (waiting for job creation):
[{'count(*)': 5, 'subscription': 5996}, {'count(*)': 2389, 'subscription': 5995}]

and 1 with files in the acquired table:

*** SUBSCRIPTIONS: found 1 files acquired in WMBS (waiting for jobs to finish):
[{'count(*)': 5, 'subscription': 5994}]

How to reproduce it
no clue

Expected behavior
Subscriptions should never get stuck :) We can use this issue to keep track of the whole debugging and figure out what the root cause was.

Additional context and error message
none

@amaltaro
Copy link
Contributor Author

So, first looking into the files sitting stuck in the WMBS acquired table.

First useful information to have is, to get a map of the subscriptions to workflow task and filesets. For that, we can use this query filtering subscriptions not finished:

SELECT wmbs_subscription.id AS subId, wmbs_workflow.task AS wfName, wmbs_workflow.injected As wfInj, wmbs_fileset.name AS fsetName,
  wmbs_fileset.open AS fsetOpen, wmbs_subscription.finished AS subFin, wmbs_sub_types.name AS subType 
  FROM wmbs_subscription
    INNER JOIN wmbs_sub_types ON wmbs_subscription.subtype = wmbs_sub_types.id
    INNER JOIN wmbs_workflow ON wmbs_subscription.workflow = wmbs_workflow.id
    INNER JOIN wmbs_fileset ON wmbs_subscription.fileset = wmbs_fileset.id
  WHERE wmbs_subscription.finished = 0 ORDER BY subId;

response:
subId	wfName																																										wfInj	fsetName																																		fsetOpen	subFin	subType
5994	/pgunnell_Run2018D-v1-SingleMuon-12Nov2019_UL2018_1064p1_200113_212951_2532/DataProcessing/DataProcessingMergeSKIMStreamMuTau												1		/pgunnell_Run2018D-v1-SingleMuon-12Nov2019_UL2018_1064p1_200113_212951_2532/DataProcessing/unmerged-SKIMStreamMuTauUSER							0			0		Merge
5995	/pgunnell_Run2018D-v1-SingleMuon-12Nov2019_UL2018_1064p1_200113_212951_2532/DataProcessing/DataProcessingMergeSKIMStreamMuTau/DataProcessingSKIMStreamMuTauMergeLogCollect	1		/pgunnell_Run2018D-v1-SingleMuon-12Nov2019_UL2018_1064p1_200113_212951_2532/DataProcessing/DataProcessingMergeSKIMStreamMuTau/merged-logArchive	1			0		LogCollect
5996	/pgunnell_Run2018D-v1-SingleMuon-12Nov2019_UL2018_1064p1_200113_212951_2532/DataProcessing/DataProcessingCleanupUnmergedSKIMStreamMuTau										1		/pgunnell_Run2018D-v1-SingleMuon-12Nov2019_UL2018_1064p1_200113_212951_2532/DataProcessing/unmerged-SKIMStreamMuTauUSER							0			0		Cleanup

Given that we know the agent is drained and the only files in the acquired table are actually stuck, we can execute the following query to list file ids, lfn and their pnn (given the workflow name found in the previous query):

SELECT wmbs_sub_files_acquired.fileid, wmbs_sub_files_acquired.subscription, wmbs_file_details.lfn AS lfn, wpnn.pnn AS pnn
  FROM wmbs_sub_files_acquired
  INNER JOIN wmbs_subscription ON wmbs_sub_files_acquired.subscription = wmbs_subscription.id
  INNER JOIN wmbs_workflow ON wmbs_subscription.workflow = wmbs_workflow.id
  INNER JOIN wmbs_file_location ON wmbs_sub_files_acquired.fileid = wmbs_file_location.fileid
  INNER JOIN wmbs_file_details ON wmbs_file_details.id = wmbs_sub_files_acquired.fileid
  INNER JOIN wmbs_pnns wpnn ON wmbs_file_location.pnn = wpnn.id
  WHERE wmbs_workflow.name='pgunnell_Run2018D-v1-SingleMuon-12Nov2019_UL2018_1064p1_200113_212951_2532' ORDER BY pnn;

response:
3456843	5994	/store/unmerged/Run2018D/SingleMuon/USER/MuTau-12Nov2019_UL2018-v3/130005/BC5FFDAE-27A7-7744-B2B0-15CA1AAB4DCF.root	T1_US_FNAL_Disk
3456827	5994	/store/unmerged/Run2018D/SingleMuon/USER/MuTau-12Nov2019_UL2018-v3/130005/D0A464D5-E556-8546-A75D-C2141A390005.root	T1_US_FNAL_Disk
3337968	5994	/store/unmerged/Run2018D/SingleMuon/USER/MuTau-12Nov2019_UL2018-v3/130005/98E5D3E9-34FE-E843-B197-46BEF04E9A50.root	T1_US_FNAL_Disk
3412119	5994	/store/unmerged/Run2018D/SingleMuon/USER/MuTau-12Nov2019_UL2018-v3/130005/A6DAF490-5EEE-FD44-87F0-B58C09B4C57C.root	T1_US_FNAL_Disk
3456859	5994	/store/unmerged/Run2018D/SingleMuon/USER/MuTau-12Nov2019_UL2018-v3/130005/B33DF2B8-E646-B14E-89A4-0FB1A07EBFC9.root	T1_US_FNAL_Disk

with these LFNs in hand, we could for instance, check the ACDC collection to see whether they have been uploaded to be recovered, under:
https://cmsweb.cern.ch/couchdb/acdcserver/_design/ACDC/_view/byCollectionName?key=%22pgunnell_Run2018D-v1-SingleMuon-12Nov2019_UL2018_1064p1_200113_212951_2532%22&include_docs=true&reduce=false
and they did not!

Again, given that the agent is drained and only those files are stuck in the acquired wmbs table, we can find which job(s) and what their status is:

SELECT wsfa.fileid, wja.job, wj.jobgroup, wj.state, wj.outcome, wj.state_time, wj.cache_dir
  FROM wmbs_sub_files_acquired wsfa
  INNER JOIN wmbs_job_assoc wja ON wsfa.fileid = wja.fileid
  INNER JOIN wmbs_job wj ON wj.id = wja.job;

response:
3337968	1271395	146408	4	0	1579894326	/data/srv/wmagent/v1.2.8/install/wmagent/JobCreator/JobCache/pgunnell_Run2018D-v1-SingleMuon-12Nov2019_UL2018_1064p1_200113_212951_2532/DataProcessing/DataProcessingMergeSKIMStreamMuTau/JobCollection_146408_0/job_1271395
3412119	1271395	146408	4	0	1579894326	/data/srv/wmagent/v1.2.8/install/wmagent/JobCreator/JobCache/pgunnell_Run2018D-v1-SingleMuon-12Nov2019_UL2018_1064p1_200113_212951_2532/DataProcessing/DataProcessingMergeSKIMStreamMuTau/JobCollection_146408_0/job_1271395
3456827	1271395	146408	4	0	1579894326	/data/srv/wmagent/v1.2.8/install/wmagent/JobCreator/JobCache/pgunnell_Run2018D-v1-SingleMuon-12Nov2019_UL2018_1064p1_200113_212951_2532/DataProcessing/DataProcessingMergeSKIMStreamMuTau/JobCollection_146408_0/job_1271395
3456843	1271395	146408	4	0	1579894326	/data/srv/wmagent/v1.2.8/install/wmagent/JobCreator/JobCache/pgunnell_Run2018D-v1-SingleMuon-12Nov2019_UL2018_1064p1_200113_212951_2532/DataProcessing/DataProcessingMergeSKIMStreamMuTau/JobCollection_146408_0/job_1271395
3456859	1271395	146408	4	0	1579894326	/data/srv/wmagent/v1.2.8/install/wmagent/JobCreator/JobCache/pgunnell_Run2018D-v1-SingleMuon-12Nov2019_UL2018_1064p1_200113_212951_2532/DataProcessing/DataProcessingMergeSKIMStreamMuTau/JobCollection_146408_0/job_1271395

and all those files have been processed by the same job wmbs id 1271395. Which is in cleanout state (state=4) with an outcome=0 (so, it failed).

Provided that those files have not been uploaded to the ACDC server to be recovered, I think it makes sense to mark that job as complete in wmbs_job such that the components can go through it and upload that failure to the ACDC server.

UPDATE wmbs_job SET state=(select id from wmbs_job_state where name='complete') WHERE id=1271395;
commit;

and voilà. ErrorHandler has processed that job and it's now uploaded to the ACDC server:

2020-03-19 15:56:42,429:140606120216320:INFO:ErrorHandlerPoller:Agent is in speed drain mode. Not retrying any jobs.
2020-03-19 15:56:42,435:140606120216320:INFO:ErrorHandlerPoller:Found 0 failed jobs in state createfailed
2020-03-19 15:56:42,437:140606120216320:INFO:ErrorHandlerPoller:Found 0 failed jobs in state submitfailed
2020-03-19 15:56:43,037:140606120216320:INFO:ErrorHandlerPoller:Found 1 failed jobs in state jobfailed
2020-03-19 15:56:43,052:140606120216320:INFO:ErrorHandlerPoller:About to process 1 failures
2020-03-19 15:56:43,052:140606120216320:INFO:ErrorHandlerPoller:Processing retries for 1 failed jobs of type jobfailed
2020-03-19 15:56:43,412:140606120216320:INFO:ErrorHandlerPoller:Found 1 jobs done with all retries
2020-03-19 15:56:43,416:140606120216320:INFO:ErrorHandlerPoller:About to process 1 retry done jobs
2020-03-19 15:56:43,945:140606120216320:INFO:ErrorHandlerPoller:Starting to build ACDC with 1 jobs
2020-03-19 15:56:43,945:140606120216320:INFO:ErrorHandlerPoller:This operation will take some time...

this might release the other files that were sitting in wmbs available table. Let's see....

@amaltaro
Copy link
Contributor Author

We do not have logging in a job basis in the components, so it's going to be very hard (likely impossible) to figure out why the input files of that job didn't get to the wmbs complete/failed table.

Anyhow, marking that cleanout job to complete, made it go through the job finalization cycle in the agent (including moving those files from wmbs_acquired table to wmbs_failed table). Which also triggered some other final jobs for the subscriptions, then subscriptions and LQEs completion.

The workflow is now in completed status and the agent (vocms0283) completely drained.
I'm closing this ticket

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant