Gigantic scores of DB12 recorded by DIRAC Pilots (from 50 up to 2250) #7760

IgorPelevanyuk · 2024-08-22T14:00:01Z

Hello,

Recently I have noticed unrealisticly big values of DB12 benchmark. Normally it shold be somewhere between 0 and 45. But I noticed that during this summer we had pilots with jobs that recorded values up to 2250.

So far it is observed only on one computing resource with AlmaLinux and CPU Model Intel Gold 6242R.
Here are zoomed version from our local monitoring. Every dot is a job, color is CPU model, X is DB12 value, Y is job walltime.

I have checked the databases:

mysql> use JobDB;
mysql> select JobID, CONVERT(Value USING utf8) from JobParameters where Name="CPUNormalizationFactor" and CONVERT(Value USING utf8)>45;
+---------+---------------------------+
| JobID   | CONVERT(Value USING utf8) |
+---------+---------------------------+
| 4298295 | 60.7                      |
| 4298419 | 45.8                      |
| 4298478 | 48.8                      |
| 4298494 | 48.8                      |
+---------+---------------------------+
4 rows in set (0.48 sec)

Unfortunately, jobs with the craziest DB12 benchmark was cleand from DIRAC and there are only resent jobs with wrong DB12 benchmark. But I was able to grep DIRAC Pilot logs on the computing resource to get this:

2024-06-28T08:38:56.134271Z INFO [ConfigureCPURequirements] Command ConfigureCPURequirements instantiated from pilotCommands
2024-06-28T08:38:56.135707Z INFO [ConfigureCPURequirements] Executing command dirac-wms-cpu-normalization -U  -R pilot.cfg --cfg pilot.cfg -d
Computing benchmark measurements on 1 processor(s)...
Applying a correction on the CPU power: 1.0
Estimated CPU power is 2272.7 HS06
Configuration file dumped 'pilot.cfg'

2024-06-28T08:39:17.640561Z INFO [ConfigureCPURequirements] Current normalized CPU as determined by 'dirac-wms-cpu-normalization' is 2272.700000

Does anybody saw with this before?

The text was updated successfully, but these errors were encountered:

IgorPelevanyuk · 2024-08-23T13:24:27Z

I also have to notice that the working nodes are virtualized with KVM.
I did small research about this problem. Benchmark runs from CVMFS version of DIRAC. And sometimes, when CVMFS is doing something as a process on the worknode, the benchmark processes start to consume more than 100% of CPU in htop.
I created simple script to run 40 dirac-wms-cpu-normalization processes in a background. This scrip was executed consequently several times. On the first screenshot there are previous results of normalization (left) and htop monitoring of current script run.

On the second screenshot there are results of 40 dirac-wms-cpu-normalization which were received after the test.

As you can see, the normal benchmark is within 35-37 range. When "something" happens, it may grow up to 55. And even more when we have longer test done by real dirac pilots.

IgorPelevanyuk · 2024-08-23T14:48:52Z

And I also created a test version of DB12 benchmark to test it. And so far it looks like the os.times function works wrong:

The norm calculated as:

To record times os.times() function is used. Start values are recorded in the beginnig of benchmark workload iterations. End values are recorded after benchmark workload. In the following listing CPU represents exactly that: (end u+s+cu+cs) - (start u+s+cu+cs). And I was able to record really strange values:

7 CPU: 9.65     WALL: 12.94     NORM: 25.91     ST: u: 7.14 s: 0.01 cu:0.0 cs:0.0 E: u: 16.79 s: 0.01 cu:0.0 cs:0.0
10 CPU: 9.19    WALL: 13.21     NORM: 27.20     ST: u: 7.18 s: 0.01 cu:0.0 cs:0.0 E: u: 16.37 s: 0.01 cu:0.0 cs:0.0
20 CPU: 7.13    WALL: 13.63     NORM: 35.06     ST: u: 7.11 s: 0.01 cu:0.0 cs:0.0 E: u: 14.1 s: 0.15 cu:0.0 cs:0.0
 4 CPU: 6.66    WALL: 12.87     NORM: 37.54     ST: u: 7.26 s: 0.01 cu:0.0 cs:0.0 E: u: 13.92 s: 0.01 cu:0.0 cs:0.0
 3 CPU: 8.71    WALL: 13.29     NORM: 28.70     ST: u: 7.12 s: 0.0 cu:0.0 cs:0.0 E: u: 15.83 s: 0.0 cu:0.0 cs:0.0
 1 CPU: 6.79    WALL: 13.04     NORM: 36.82     ST: u: 7.22 s: 0.01 cu:0.0 cs:0.0 E: u: 14.01 s: 0.01 cu:0.0 cs:0.0
31 CPU: 4.75    WALL: 13.82     NORM: 52.63     ST: u: 7.15 s: 0.0 cu:0.0 cs:0.0 E: u: 11.9 s: 0.0 cu:0.0 cs:0.0
17 CPU: 8.74    WALL: 12.91     NORM: 28.60     ST: u: 6.63 s: 0.0 cu:0.0 cs:0.0 E: u: 15.37 s: 0.0 cu:0.0 cs:0.0
19 CPU: 6.92    WALL: 13.51     NORM: 36.13     ST: u: 7.51 s: 0.01 cu:0.0 cs:0.0 E: u: 14.42 s: 0.02 cu:0.0 cs:0.0
39 CPU: 3.92    WALL: 13.40     NORM: 63.78     ST: u: 7.24 s: 0.01 cu:0.0 cs:0.0 E: u: 11.16 s: 0.01 cu:0.0 cs:0.0
22 CPU: 4.09    WALL: 14.45     NORM: 61.12     ST: u: 7.14 s: 0.0 cu:0.0 cs:0.0 E: u: 11.23 s: 0.0 cu:0.0 cs:0.0
24 CPU: 7.44    WALL: 12.40     NORM: 33.60     ST: u: 7.08 s: 0.49 cu:0.0 cs:0.0 E: u: 14.51 s: 0.5 cu:0.0 cs:0.0
23 CPU: 2.45    WALL: 12.56     NORM: 102.04    ST: u: 7.31 s: 0.01 cu:0.0 cs:0.0 E: u: 9.76 s: 0.01 cu:0.0 cs:0.0
35 CPU: 2.93    WALL: 12.25     NORM: 85.32     ST: u: 7.56 s: 0.03 cu:0.0 cs:0.0 E: u: 10.49 s: 0.03 cu:0.0 cs:0.0
32 CPU: 12.03   WALL: 12.41     NORM: 20.78     ST: u: 7.38 s: 0.01 cu:0.0 cs:0.0 E: u: 19.41 s: 0.01 cu:0.0 cs:0.0
21 CPU: 6.72    WALL: 12.76     NORM: 37.20     ST: u: 6.89 s: 0.0 cu:0.0 cs:0.0 E: u: 13.61 s: 0.0 cu:0.0 cs:0.0
 5 CPU: 2.70    WALL: 13.76     NORM: 92.59     ST: u: 7.52 s: 0.01 cu:0.0 cs:0.0 E: u: 10.22 s: 0.01 cu:0.0 cs:0.0
26 CPU: 6.89    WALL: 12.13     NORM: 36.28     ST: u: 7.36 s: 0.01 cu:0.0 cs:0.0 E: u: 14.25 s: 0.01 cu:0.0 cs:0.0
12 CPU: 2.45    WALL: 15.09     NORM: 102.04    ST: u: 6.73 s: 0.01 cu:0.0 cs:0.0 E: u: 9.18 s: 0.01 cu:0.0 cs:0.0
38 CPU: 1.69    WALL: 12.61     NORM: 147.93    ST: u: 7.18 s: 0.0 cu:0.0 cs:0.0 E: u: 8.87 s: 0.0 cu:0.0 cs:0.0
11 CPU: 6.73    WALL: 12.72     NORM: 37.15     ST: u: 7.53 s: 0.0 cu:0.0 cs:0.0 E: u: 14.25 s: 0.01 cu:0.0 cs:0.0
33 CPU: 6.84    WALL: 12.10     NORM: 36.55     ST: u: 6.38 s: 0.0 cu:0.0 cs:0.0 E: u: 13.22 s: 0.0 cu:0.0 cs:0.0
36 CPU: 6.74    WALL: 12.25     NORM: 37.09     ST: u: 7.42 s: 0.01 cu:0.0 cs:0.0 E: u: 14.16 s: 0.01 cu:0.0 cs:0.0
37 CPU: 6.66    WALL: 12.16     NORM: 37.54     ST: u: 7.44 s: 0.01 cu:0.0 cs:0.0 E: u: 14.1 s: 0.01 cu:0.0 cs:0.0
 6 CPU: 6.53    WALL: 12.13     NORM: 38.28     ST: u: 7.37 s: 0.0 cu:0.0 cs:0.0 E: u: 13.9 s: 0.0 cu:0.0 cs:0.0
30 CPU: 6.62    WALL: 11.91     NORM: 37.76     ST: u: 6.87 s: 0.01 cu:0.0 cs:0.0 E: u: 13.49 s: 0.01 cu:0.0 cs:0.0
34 CPU: 12.15   WALL: 13.94     NORM: 20.58     ST: u: 6.9 s: 0.0 cu:0.0 cs:0.0 E: u: 19.04 s: 0.01 cu:0.0 cs:0.0
25 CPU: 6.76    WALL: 12.28     NORM: 36.98     ST: u: 6.85 s: 0.0 cu:0.0 cs:0.0 E: u: 13.61 s: 0.0 cu:0.0 cs:0.0
13 CPU: 3.49    WALL: 6.64      NORM: 71.63     ST: u: 12.65 s: 0.01 cu:0.0 cs:0.0 E: u: 16.14 s: 0.01 cu:0.0 cs:0.0
 9 CPU: 10.24   WALL: 17.11     NORM: 24.41     ST: u: 7.19 s: 0.01 cu:0.0 cs:0.0 E: u: 17.43 s: 0.01 cu:0.0 cs:0.0
40 CPU: 12.53   WALL: 16.63     NORM: 19.95     ST: u: 7.47 s: 0.0 cu:0.0 cs:0.0 E: u: 19.99 s: 0.01 cu:0.0 cs:0.0
28 CPU: 3.93    WALL: 6.94      NORM: 63.61     ST: u: 10.16 s: 0.38 cu:0.0 cs:0.0 E: u: 14.09 s: 0.38 cu:0.0 cs:0.0
18 CPU: 4.12    WALL: 6.76      NORM: 60.68     ST: u: 6.86 s: 0.0 cu:0.0 cs:0.0 E: u: 10.98 s: 0.0 cu:0.0 cs:0.0
15 CPU: 2.18    WALL: 7.31      NORM: 114.68    ST: u: 7.37 s: 0.0 cu:0.0 cs:0.0 E: u: 9.55 s: 0.0 cu:0.0 cs:0.0
16 CPU: 9.85    WALL: 17.35     NORM: 25.38     ST: u: 7.2 s: 0.01 cu:0.0 cs:0.0 E: u: 17.05 s: 0.01 cu:0.0 cs:0.0
 8 CPU: 5.07    WALL: 6.71      NORM: 49.31     ST: u: 9.48 s: 0.01 cu:0.0 cs:0.0 E: u: 14.55 s: 0.01 cu:0.0 cs:0.0
14 CPU: 5.66    WALL: 8.05      NORM: 44.17     ST: u: 6.68 s: 0.01 cu:0.0 cs:0.0 E: u: 12.34 s: 0.01 cu:0.0 cs:0.0
 2 CPU: 6.11    WALL: 6.82      NORM: 40.92     ST: u: 7.66 s: 0.01 cu:0.0 cs:0.0 E: u: 13.77 s: 0.01 cu:0.0 cs:0.0
29 CPU: 6.15    WALL: 6.37      NORM: 40.65     ST: u: 12.18 s: 0.01 cu:0.0 cs:0.0 E: u: 18.33 s: 0.01 cu:0.0 cs:0.0
27 CPU: 5.55    WALL: 6.31      NORM: 45.05     ST: u: 11.84 s: 0.01 cu:0.0 cs:0.0 E: u: 17.39 s: 0.01 cu:0.0 cs:0.0

That is listing when estimations are done sequentially:

  1 CPU: 6.29    WALL: 6.31      NORM: 39.75     ST: u: 6.24 s: 0.0 cu:0.0 cs:0.0 E: u: 12.53 s: 0.0 cu:0.0 cs:0.0
 2 CPU: 6.29    WALL: 6.31      NORM: 39.75     ST: u: 6.26 s: 0.0 cu:0.0 cs:0.0 E: u: 12.55 s: 0.0 cu:0.0 cs:0.0
 3 CPU: 6.42    WALL: 6.44      NORM: 38.94     ST: u: 6.21 s: 0.0 cu:0.0 cs:0.0 E: u: 12.63 s: 0.0 cu:0.0 cs:0.0
 4 CPU: 6.21    WALL: 6.24      NORM: 40.26     ST: u: 6.2 s: 0.0 cu:0.0 cs:0.0 E: u: 12.41 s: 0.0 cu:0.0 cs:0.0
 5 CPU: 6.21    WALL: 6.23      NORM: 40.26     ST: u: 6.18 s: 0.0 cu:0.0 cs:0.0 E: u: 12.39 s: 0.0 cu:0.0 cs:0.0
 6 CPU: 6.23    WALL: 6.25      NORM: 40.13     ST: u: 6.43 s: 0.0 cu:0.0 cs:0.0 E: u: 12.66 s: 0.0 cu:0.0 cs:0.0
 7 CPU: 6.24    WALL: 6.27      NORM: 40.06     ST: u: 6.25 s: 0.0 cu:0.0 cs:0.0 E: u: 12.49 s: 0.0 cu:0.0 cs:0.0
 8 CPU: 6.29    WALL: 6.31      NORM: 39.75     ST: u: 6.3 s: 0.0 cu:0.0 cs:0.0 E: u: 12.59 s: 0.0 cu:0.0 cs:0.0
 9 CPU: 6.24    WALL: 6.26      NORM: 40.06     ST: u: 6.2 s: 0.0 cu:0.0 cs:0.0 E: u: 12.44 s: 0.0 cu:0.0 cs:0.0
10 CPU: 6.23    WALL: 6.25      NORM: 40.13     ST: u: 6.2 s: 0.0 cu:0.0 cs:0.0 E: u: 12.43 s: 0.0 cu:0.0 cs:0.0
11 CPU: 6.24    WALL: 6.27      NORM: 40.06     ST: u: 6.26 s: 0.0 cu:0.0 cs:0.0 E: u: 12.5 s: 0.0 cu:0.0 cs:0.0
12 CPU: 6.25    WALL: 6.26      NORM: 40.00     ST: u: 6.22 s: 0.0 cu:0.0 cs:0.0 E: u: 12.47 s: 0.0 cu:0.0 cs:0.0

Normally CPU is around 6.25. But for the biggest benchmark from the first listing the CPU value may be 2. Which means that the benchmark workload was executed 3 times faster than it should, which is impossible.
I believe that this situation occures only when there are more processes than the cores. As shown on the screenshot of my previous post, the CVMFS process started to run alongside with benchmark tests.

It is the first time I observe this within DIRAC environment. And it is only on one particular resource. It is counter intuitive, because benchmark results are used to calculate normalized CPU time. But looks that if on some resource more than 1 job running on one core, the benchmark could be higher!

I will try to make more tests on the other server which is neither KVM virtualized, nor Alma Linux.

BTW: DIRAC version v8.0.36

fstagni · 2024-08-23T15:32:47Z

Thank you @IgorPelevanyuk for the notification and the extensive tests. At a minimum, they show that there some fixes to do. Please report back your findings and your suggestion.

EDIT: we do not have much development workforce at the moment, so you are of course welcome to take care of what will need to be done.

IgorPelevanyuk · 2024-08-26T12:41:29Z

Hello, @fstagni!

So far, I think that the main culprit of the issues is ProxmoxVE on Debian. No similar issues found on the same cluster but with Libvirt on Fedora. We will try to make more tests to understand the reason(in two weeks), why Proxmox behaves so strange.

It is hard to tell somethin about solution. The issue appears only in a very specific virtual environment. I hope that it can be fixed by some setting in Proxmox.

fstagni · 2024-09-13T11:45:28Z

Just for info, are you also using cgroups on these cluster?

IgorPelevanyuk · 2024-09-16T08:00:45Z

Slurm on this particular resource is configured to use cgroups. But other Slurm resources also use it.

IgorPelevanyuk · 2024-10-17T08:20:19Z

After several other tests the issue still persist.
And I can propose only one thing - move from os.time(), which in that particular case is innacurate.

Right now, advantage (or not?) of DIRACBenchmark is the fact, that you can run it even on 100% loaded cluster, and because the benchmark do not rely on walltime during its estimations, you can get rather accurate result.

Example: on a server that are free from load you may get DB12 result for 30 seconds. After that you run some processes and overload the server so that 100% of all cores are used and after that start DB12 benchmark. In that case it would take longer for benchmark to perform test because from time to time it will be interupted by other processes on the server, so the benchmark test would last for 60 seconds. But in the end, you will get the same results of DB12. In this example I am neglecting the fact that frequency of an empty processor is usually much higher.

If we would like to use walltime in Python there are time.perf_counter() function. In that case we would get benchmark results more related to the real speed of the server. But it will not be possible to get more or less accurate result in case if you want to benchmark server that is already 100% loaded.

Honestly, I do not fully understand why using cputime is better for benchmarking comparing with walltime. The only advantage is a possibility to run benchmark in parallel with some other processes running on the same CPU core.

I think that in our DIRAC installation we may try to tweak DIRACBenchmark script in CVMFS to use walltime and see how it works.

What may be done for the DIRAC in general is writing to JobParametersDB not only DB12 calculated with cputime, but also DB12 calculated with walltime and call it DB12WT or something like this. In that case it will be possible to make some extansive analysis on different DIRAC installations and computing resources. And after that decide if the move from one approach to another is justified.

fstagni · 2024-11-18T09:36:17Z

Hi Igor, as discussed in Thursday's meeting: we can insert an alternate CPU power in the Job parameters, please start with this commit: 2de8586

On the DB12 side, you define a new method or a new flag, I let it up to you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gigantic scores of DB12 recorded by DIRAC Pilots (from 50 up to 2250) #7760

Gigantic scores of DB12 recorded by DIRAC Pilots (from 50 up to 2250) #7760

IgorPelevanyuk commented Aug 22, 2024

IgorPelevanyuk commented Aug 23, 2024

IgorPelevanyuk commented Aug 23, 2024

fstagni commented Aug 23, 2024 •

edited

Loading

IgorPelevanyuk commented Aug 26, 2024

fstagni commented Sep 13, 2024

IgorPelevanyuk commented Sep 16, 2024

IgorPelevanyuk commented Oct 17, 2024

fstagni commented Nov 18, 2024

Gigantic scores of DB12 recorded by DIRAC Pilots (from 50 up to 2250) #7760

Gigantic scores of DB12 recorded by DIRAC Pilots (from 50 up to 2250) #7760

Comments

IgorPelevanyuk commented Aug 22, 2024

IgorPelevanyuk commented Aug 23, 2024

IgorPelevanyuk commented Aug 23, 2024

fstagni commented Aug 23, 2024 • edited Loading

IgorPelevanyuk commented Aug 26, 2024

fstagni commented Sep 13, 2024

IgorPelevanyuk commented Sep 16, 2024

IgorPelevanyuk commented Oct 17, 2024

fstagni commented Nov 18, 2024

fstagni commented Aug 23, 2024 •

edited

Loading