You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
we are in the process setting up AUDITOR instances per CondorCE with a shared central DB between all CondorCE+AUDITOR instances.
We tried to parse all historic LRMS HTCondor jobs on a CondorCE, i.e., parsing all jobs present in a schedd's history. However, only a fraction of the jobs got processes.
Trying to re-run on all the history, we removed the HTCondor state db as to force an a re-parsing all job records ranging from the newest to the oldest job record.
Followingly, the htcondor-collector fails reproducible while trying to insert a record [1] - so that we assume that the local htcondor state db is not "stateless" as such but depending on the status of the upstream shared auditor DB.
Is there a way to re-parse all job events on a CondorCE and update upstream records?
Since we plan to run the htcondor collector as service unit, is there a way to gracefully pass broken job records, i.e., to avoid a collector constantly failing at a broken record but gracefully passing on (and keeping track of such error cases) - as to avoid a completely stalling job accounting?
Installed versions are as [2] on a RHEL 9.4 installation running on 5.14.0-362.24.1.el9_3.x86_64
Cheers and thanks,
Thomas
[1]
[root@grid-htc-ce04 auditor]# auditor-htcondor-collector -c /etc/auditor/htcondor-collector.yaml -l DEBUG -n grid-htc-ce04.desy.de
2024-06-10 12:26:42,611 - auditor.collectors.htcondor - INFO - Using AUDITOR client at localhost:8000.
2024-06-10 12:26:42,613 - auditor.collectors.htcondor - INFO - Starting collector run.
2024-06-10 12:26:42,613 - auditor.collectors.htcondor - INFO - Collecting jobs for schedd 'grid-htc-ce04.desy.de'.
2024-06-10 12:26:42,613 - auditor.collectors.htcondor - DEBUG - Using job id (691382, 0).
2024-06-10 12:26:42,613 - auditor.collectors.htcondor - DEBUG - Querying HTCondor history for 'grid-htc-ce04.desy.de' starting from job (691382, 0).
2024-06-10 12:26:42,614 - auditor.collectors.htcondor - DEBUG - Running command: 'condor_history -backwards -wide -name grid-htc-ce04.desy.de -since 691382.0 -af:, MachineAttrApelSpecs0.HEPscore23 ProcId ClusterId RemoteUserCpu+RemoteSysCpu RemoteWallClockTime MemoryProvisioned x509UserProxyFirstFQAN MaxHosts x509UserProxySubject CpusProvisioned AcctGroup GlobalJobId EnteredCurrentStatus MachineAttrApelSpecs0.HEPSPEC Owner MinHosts LastMatchTime -constraint "JobStatus == 3 || JobStatus == 4"'
2024-06-10 12:26:42,673 - auditor.collectors.htcondor - DEBUG - Generating record for job 'grid-htc-ce04.desy.de#691626.0#1718014383'.
2024-06-10 12:26:42,674 - auditor.collectors.htcondor - WARNING - Could not find meta value for 'voms' for job 'grid-htc-ce04.desy.de#691626.0#1718014383'.
2024-06-10 12:26:42,675 - auditor.collectors.htcondor - DEBUG - Got amount 1 (<class 'int'>) for component {'name': 'Cores', 'key': 'CpusProvisioned', 'scores': [{'name': 'HEPSPEC', 'key': 'MachineAttrApelSpecs0.HEPSPEC'}, {'name': 'HEPscore23', 'key': 'MachineAttrApelSpecs0.HEPscore23'}]}.
2024-06-10 12:26:42,675 - auditor.collectors.htcondor - DEBUG - Got amount 2048 (<class 'int'>) for component {'name': 'Memory', 'key': 'MemoryProvisioned'}.
2024-06-10 12:26:42,675 - auditor.collectors.htcondor - DEBUG - Got amount 110.0 (<class 'float'>) for component {'name': 'CPUTime', 'key': 'RemoteUserCpu+RemoteSysCpu'}.
2024-06-10 12:26:42,676 - auditor.collectors.htcondor - DEBUG - Got amount 240.0 (<class 'float'>) for component {'name': 'Wallclocktime', 'key': 'RemoteWallClockTime'}.
2024-06-10 12:26:42,676 - auditor.collectors.htcondor - DEBUG - Got amount 1 (<class 'int'>) for component {'name': 'MinHosts', 'key': 'MinHosts'}.
2024-06-10 12:26:42,676 - auditor.collectors.htcondor - DEBUG - Got amount 1 (<class 'int'>) for component {'name': 'MaxHosts', 'key': 'MaxHosts'}.
2024-06-10 12:26:42,677 - auditor.collectors.htcondor - DEBUG - Generated record for job 'grid-htc-ce04.desy.de#691626.0#1718014383'.
Traceback (most recent call last):
File "/usr/local/bin/auditor-htcondor-collector", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.9/site-packages/auditor_htcondor_collector/main.py", line 17, in main
asyncio.run(collector.run())
File "/usr/lib64/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib64/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/usr/local/lib/python3.9/site-packages/auditor_htcondor_collector/collector.py", line 75, in run
await self._collect(schedd_name, job_id=id)
File "/usr/local/lib/python3.9/site-packages/auditor_htcondor_collector/collector.py", line 107, in _collect
await self.client.add(record)
RuntimeError: RECORD_EXISTS
Dear AUDITOR/HTCondor Collector developers,
we are in the process setting up AUDITOR instances per CondorCE with a shared central DB between all CondorCE+AUDITOR instances.
We tried to parse all historic LRMS HTCondor jobs on a CondorCE, i.e., parsing all jobs present in a schedd's history. However, only a fraction of the jobs got processes.
Trying to re-run on all the history, we removed the HTCondor state db as to force an a re-parsing all job records ranging from the newest to the oldest job record.
Followingly, the htcondor-collector fails reproducible while trying to insert a record [1] - so that we assume that the local htcondor state db is not "stateless" as such but depending on the status of the upstream shared auditor DB.
Is there a way to re-parse all job events on a CondorCE and update upstream records?
Since we plan to run the htcondor collector as service unit, is there a way to gracefully pass broken job records, i.e., to avoid a collector constantly failing at a broken record but gracefully passing on (and keeping track of such error cases) - as to avoid a completely stalling job accounting?
Installed versions are as [2] on a RHEL 9.4 installation running on 5.14.0-362.24.1.el9_3.x86_64
Cheers and thanks,
Thomas
[1]
[2]
"org.opencontainers.image.revision": "ed96f63e3ed4408f20337aef0fc0bd027c67960e",
"org.opencontainers.image.source": "https://github.com/ALU-Schumacher/AUDITOR",
"org.opencontainers.image.title": "AUDITOR",
"org.opencontainers.image.url": "https://github.com/ALU-Schumacher/AUDITOR",
"org.opencontainers.image.version": "edge",
auditor_apel_plugin 0.5.0
auditor-htcondor-collector 0.5.0
python-auditor 0.5.0
The text was updated successfully, but these errors were encountered: