-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subscriptions not finished in a drained agent (0 jobs) #9568
Comments
So, first looking into the files sitting stuck in the WMBS acquired table. First useful information to have is, to get a map of the subscriptions to workflow task and filesets. For that, we can use this query filtering subscriptions not finished:
Given that we know the agent is drained and the only files in the acquired table are actually stuck, we can execute the following query to list file ids, lfn and their pnn (given the workflow name found in the previous query):
with these LFNs in hand, we could for instance, check the ACDC collection to see whether they have been uploaded to be recovered, under: Again, given that the agent is drained and only those files are stuck in the acquired wmbs table, we can find which job(s) and what their status is:
and all those files have been processed by the same job wmbs id Provided that those files have not been uploaded to the ACDC server to be recovered, I think it makes sense to mark that job as
and voilà. ErrorHandler has processed that job and it's now uploaded to the ACDC server:
this might release the other files that were sitting in wmbs available table. Let's see.... |
We do not have logging in a job basis in the components, so it's going to be very hard (likely impossible) to figure out why the input files of that job didn't get to the wmbs complete/failed table. Anyhow, marking that The workflow is now in |
Impact of the bug
WMAgent on vocms0283 (old 1.2.8 agent)
Describe the bug
The agent is apparently completely drained, with no jobs in condor and only
cleanout
jobs in WMBS. Even though, this workflowpgunnell_Run2018D-v1-SingleMuon-12Nov2019_UL2018_1064p1_200113_212951_2532
has 26 GQE/LQE in the
Running
state (since end of January).The draining script tells me that there are 2 subscriptions with files available:
and 1 with files in the acquired table:
How to reproduce it
no clue
Expected behavior
Subscriptions should never get stuck :) We can use this issue to keep track of the whole debugging and figure out what the root cause was.
Additional context and error message
none
The text was updated successfully, but these errors were encountered: