Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

htcondor collector crashes with RECORD_EXISTS RuntimeError #846

Open
thdesy opened this issue Jun 10, 2024 · 0 comments
Open

htcondor collector crashes with RECORD_EXISTS RuntimeError #846

thdesy opened this issue Jun 10, 2024 · 0 comments

Comments

@thdesy
Copy link

thdesy commented Jun 10, 2024

Dear AUDITOR/HTCondor Collector developers,

we are in the process setting up AUDITOR instances per CondorCE with a shared central DB between all CondorCE+AUDITOR instances.

We tried to parse all historic LRMS HTCondor jobs on a CondorCE, i.e., parsing all jobs present in a schedd's history. However, only a fraction of the jobs got processes.
Trying to re-run on all the history, we removed the HTCondor state db as to force an a re-parsing all job records ranging from the newest to the oldest job record.
Followingly, the htcondor-collector fails reproducible while trying to insert a record [1] - so that we assume that the local htcondor state db is not "stateless" as such but depending on the status of the upstream shared auditor DB.

Is there a way to re-parse all job events on a CondorCE and update upstream records?

Since we plan to run the htcondor collector as service unit, is there a way to gracefully pass broken job records, i.e., to avoid a collector constantly failing at a broken record but gracefully passing on (and keeping track of such error cases) - as to avoid a completely stalling job accounting?

Installed versions are as [2] on a RHEL 9.4 installation running on 5.14.0-362.24.1.el9_3.x86_64

Cheers and thanks,
Thomas

[1]

[root@grid-htc-ce04 auditor]# auditor-htcondor-collector -c  /etc/auditor/htcondor-collector.yaml  -l  DEBUG -n grid-htc-ce04.desy.de
2024-06-10 12:26:42,611 - auditor.collectors.htcondor - INFO     - Using AUDITOR client at localhost:8000.
2024-06-10 12:26:42,613 - auditor.collectors.htcondor - INFO     - Starting collector run.
2024-06-10 12:26:42,613 - auditor.collectors.htcondor - INFO     - Collecting jobs for schedd 'grid-htc-ce04.desy.de'.
2024-06-10 12:26:42,613 - auditor.collectors.htcondor - DEBUG    - Using job id (691382, 0).
2024-06-10 12:26:42,613 - auditor.collectors.htcondor - DEBUG    - Querying HTCondor history for 'grid-htc-ce04.desy.de' starting from job (691382, 0).
2024-06-10 12:26:42,614 - auditor.collectors.htcondor - DEBUG    - Running command: 'condor_history -backwards -wide -name grid-htc-ce04.desy.de -since 691382.0 -af:, MachineAttrApelSpecs0.HEPscore23 ProcId ClusterId RemoteUserCpu+RemoteSysCpu RemoteWallClockTime MemoryProvisioned x509UserProxyFirstFQAN MaxHosts x509UserProxySubject CpusProvisioned AcctGroup GlobalJobId EnteredCurrentStatus MachineAttrApelSpecs0.HEPSPEC Owner MinHosts LastMatchTime -constraint "JobStatus == 3 || JobStatus == 4"'
2024-06-10 12:26:42,673 - auditor.collectors.htcondor - DEBUG    - Generating record for job 'grid-htc-ce04.desy.de#691626.0#1718014383'.
2024-06-10 12:26:42,674 - auditor.collectors.htcondor - WARNING  - Could not find meta value for 'voms' for job 'grid-htc-ce04.desy.de#691626.0#1718014383'.
2024-06-10 12:26:42,675 - auditor.collectors.htcondor - DEBUG    - Got amount 1 (<class 'int'>) for component {'name': 'Cores', 'key': 'CpusProvisioned', 'scores': [{'name': 'HEPSPEC', 'key': 'MachineAttrApelSpecs0.HEPSPEC'}, {'name': 'HEPscore23', 'key': 'MachineAttrApelSpecs0.HEPscore23'}]}.
2024-06-10 12:26:42,675 - auditor.collectors.htcondor - DEBUG    - Got amount 2048 (<class 'int'>) for component {'name': 'Memory', 'key': 'MemoryProvisioned'}.
2024-06-10 12:26:42,675 - auditor.collectors.htcondor - DEBUG    - Got amount 110.0 (<class 'float'>) for component {'name': 'CPUTime', 'key': 'RemoteUserCpu+RemoteSysCpu'}.
2024-06-10 12:26:42,676 - auditor.collectors.htcondor - DEBUG    - Got amount 240.0 (<class 'float'>) for component {'name': 'Wallclocktime', 'key': 'RemoteWallClockTime'}.
2024-06-10 12:26:42,676 - auditor.collectors.htcondor - DEBUG    - Got amount 1 (<class 'int'>) for component {'name': 'MinHosts', 'key': 'MinHosts'}.
2024-06-10 12:26:42,676 - auditor.collectors.htcondor - DEBUG    - Got amount 1 (<class 'int'>) for component {'name': 'MaxHosts', 'key': 'MaxHosts'}.
2024-06-10 12:26:42,677 - auditor.collectors.htcondor - DEBUG    - Generated record for job 'grid-htc-ce04.desy.de#691626.0#1718014383'.
Traceback (most recent call last):
  File "/usr/local/bin/auditor-htcondor-collector", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.9/site-packages/auditor_htcondor_collector/main.py", line 17, in main
    asyncio.run(collector.run())
  File "/usr/lib64/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib64/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.9/site-packages/auditor_htcondor_collector/collector.py", line 75, in run
    await self._collect(schedd_name, job_id=id)
  File "/usr/local/lib/python3.9/site-packages/auditor_htcondor_collector/collector.py", line 107, in _collect
    await self.client.add(record)
RuntimeError: RECORD_EXISTS

[2]
"org.opencontainers.image.revision": "ed96f63e3ed4408f20337aef0fc0bd027c67960e",
"org.opencontainers.image.source": "https://github.com/ALU-Schumacher/AUDITOR",
"org.opencontainers.image.title": "AUDITOR",
"org.opencontainers.image.url": "https://github.com/ALU-Schumacher/AUDITOR",
"org.opencontainers.image.version": "edge",

auditor_apel_plugin 0.5.0
auditor-htcondor-collector 0.5.0
python-auditor 0.5.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant