htcondor collector crashes with RECORD_EXISTS RuntimeError #846

thdesy · 2024-06-10T10:41:06Z

Dear AUDITOR/HTCondor Collector developers,

we are in the process setting up AUDITOR instances per CondorCE with a shared central DB between all CondorCE+AUDITOR instances.

We tried to parse all historic LRMS HTCondor jobs on a CondorCE, i.e., parsing all jobs present in a schedd's history. However, only a fraction of the jobs got processes.
Trying to re-run on all the history, we removed the HTCondor state db as to force an a re-parsing all job records ranging from the newest to the oldest job record.
Followingly, the htcondor-collector fails reproducible while trying to insert a record [1] - so that we assume that the local htcondor state db is not "stateless" as such but depending on the status of the upstream shared auditor DB.

Is there a way to re-parse all job events on a CondorCE and update upstream records?

Since we plan to run the htcondor collector as service unit, is there a way to gracefully pass broken job records, i.e., to avoid a collector constantly failing at a broken record but gracefully passing on (and keeping track of such error cases) - as to avoid a completely stalling job accounting?

Installed versions are as [2] on a RHEL 9.4 installation running on 5.14.0-362.24.1.el9_3.x86_64

Cheers and thanks,
Thomas

[1]

[root@grid-htc-ce04 auditor]# auditor-htcondor-collector -c  /etc/auditor/htcondor-collector.yaml  -l  DEBUG -n grid-htc-ce04.desy.de
2024-06-10 12:26:42,611 - auditor.collectors.htcondor - INFO     - Using AUDITOR client at localhost:8000.
2024-06-10 12:26:42,613 - auditor.collectors.htcondor - INFO     - Starting collector run.
2024-06-10 12:26:42,613 - auditor.collectors.htcondor - INFO     - Collecting jobs for schedd 'grid-htc-ce04.desy.de'.
2024-06-10 12:26:42,613 - auditor.collectors.htcondor - DEBUG    - Using job id (691382, 0).
2024-06-10 12:26:42,613 - auditor.collectors.htcondor - DEBUG    - Querying HTCondor history for 'grid-htc-ce04.desy.de' starting from job (691382, 0).
2024-06-10 12:26:42,614 - auditor.collectors.htcondor - DEBUG    - Running command: 'condor_history -backwards -wide -name grid-htc-ce04.desy.de -since 691382.0 -af:, MachineAttrApelSpecs0.HEPscore23 ProcId ClusterId RemoteUserCpu+RemoteSysCpu RemoteWallClockTime MemoryProvisioned x509UserProxyFirstFQAN MaxHosts x509UserProxySubject CpusProvisioned AcctGroup GlobalJobId EnteredCurrentStatus MachineAttrApelSpecs0.HEPSPEC Owner MinHosts LastMatchTime -constraint "JobStatus == 3 || JobStatus == 4"'
2024-06-10 12:26:42,673 - auditor.collectors.htcondor - DEBUG    - Generating record for job 'grid-htc-ce04.desy.de#691626.0#1718014383'.
2024-06-10 12:26:42,674 - auditor.collectors.htcondor - WARNING  - Could not find meta value for 'voms' for job 'grid-htc-ce04.desy.de#691626.0#1718014383'.
2024-06-10 12:26:42,675 - auditor.collectors.htcondor - DEBUG    - Got amount 1 (<class 'int'>) for component {'name': 'Cores', 'key': 'CpusProvisioned', 'scores': [{'name': 'HEPSPEC', 'key': 'MachineAttrApelSpecs0.HEPSPEC'}, {'name': 'HEPscore23', 'key': 'MachineAttrApelSpecs0.HEPscore23'}]}.
2024-06-10 12:26:42,675 - auditor.collectors.htcondor - DEBUG    - Got amount 2048 (<class 'int'>) for component {'name': 'Memory', 'key': 'MemoryProvisioned'}.
2024-06-10 12:26:42,675 - auditor.collectors.htcondor - DEBUG    - Got amount 110.0 (<class 'float'>) for component {'name': 'CPUTime', 'key': 'RemoteUserCpu+RemoteSysCpu'}.
2024-06-10 12:26:42,676 - auditor.collectors.htcondor - DEBUG    - Got amount 240.0 (<class 'float'>) for component {'name': 'Wallclocktime', 'key': 'RemoteWallClockTime'}.
2024-06-10 12:26:42,676 - auditor.collectors.htcondor - DEBUG    - Got amount 1 (<class 'int'>) for component {'name': 'MinHosts', 'key': 'MinHosts'}.
2024-06-10 12:26:42,676 - auditor.collectors.htcondor - DEBUG    - Got amount 1 (<class 'int'>) for component {'name': 'MaxHosts', 'key': 'MaxHosts'}.
2024-06-10 12:26:42,677 - auditor.collectors.htcondor - DEBUG    - Generated record for job 'grid-htc-ce04.desy.de#691626.0#1718014383'.
Traceback (most recent call last):
  File "/usr/local/bin/auditor-htcondor-collector", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.9/site-packages/auditor_htcondor_collector/main.py", line 17, in main
    asyncio.run(collector.run())
  File "/usr/lib64/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib64/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.9/site-packages/auditor_htcondor_collector/collector.py", line 75, in run
    await self._collect(schedd_name, job_id=id)
  File "/usr/local/lib/python3.9/site-packages/auditor_htcondor_collector/collector.py", line 107, in _collect
    await self.client.add(record)
RuntimeError: RECORD_EXISTS

[2]
"org.opencontainers.image.revision": "ed96f63e3ed4408f20337aef0fc0bd027c67960e",
"org.opencontainers.image.source": "https://github.com/ALU-Schumacher/AUDITOR",
"org.opencontainers.image.title": "AUDITOR",
"org.opencontainers.image.url": "https://github.com/ALU-Schumacher/AUDITOR",
"org.opencontainers.image.version": "edge",

auditor_apel_plugin 0.5.0
auditor-htcondor-collector 0.5.0
python-auditor 0.5.0

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

htcondor collector crashes with RECORD_EXISTS RuntimeError #846

htcondor collector crashes with RECORD_EXISTS RuntimeError #846

thdesy commented Jun 10, 2024

htcondor collector crashes with RECORD_EXISTS RuntimeError #846

htcondor collector crashes with RECORD_EXISTS RuntimeError #846

Comments

thdesy commented Jun 10, 2024