Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Profiler takes 60 minutes to process and generate output on integration tests eventlog #1382

Closed
Tracked by #367
tgravescs opened this issue Oct 10, 2024 · 2 comments · Fixed by #1391
Closed
Tracked by #367
Assignees
Labels
bug Something isn't working core_tools Scope the core module (scala)

Comments

@tgravescs
Copy link
Collaborator

Describe the bug
The ci/cd integration tests can take up to 60 minutes to run the profiler on a single eventlog. The eventlog is generated from the spark-rapids integration tests. We should investigate why it takes so long for that eventlog.

See our CI/CD logs on how to get that file.

@tgravescs tgravescs added ? - Needs Triage bug Something isn't working labels Oct 10, 2024
@amahussein amahussein added the core_tools Scope the core module (scala) label Oct 14, 2024
@amahussein amahussein self-assigned this Oct 17, 2024
@amahussein
Copy link
Collaborator

amahussein commented Oct 18, 2024

I executed the profiler locally. This is relativelly large eventlog compared to other ones we have in our testing enviornments.
The problem is triggered by GC optimization issue.
The allocation rate of the core tools is considered high compared to the amount of processing. This is a very corner case when it comes to GC heuristics. This is well known problem in GC heap heuristics ("automatically decide the beast heap size")
By default, we assume this JDK8 and we want to use G1GC to improve performance. However, G1GC by default assigns 5% of the heap size to the new generation. Thgerefore, with a Xmx=100G, the new generation is as little as 12G.
During rumtime, the heap won't exceed the limit of ~16GB. Everytime the Heap reaches its local maxima, it frees enough memory thinking that there is no need to increase the heap size. The entire runtime, the JVM is using only 16GB out of 100GB causing as many GCs as would be triggered if we set the max heap size to 16GB.

I explored some G1GC tuning, but I don't think this is going to be optimum.
For example:


-Xms32G -Xmx64G -XX:+UseG1GC
# took 18 minutes -> heap can grow bigger reaching up to 20 GB and it performed faster than 100GB max heap

-Xms10G -Xmx100G -XX:+UseG1GC
# took 28 minutes

# Adjusted the new object sizes to grow to 60%
-Xms10G -Xmx64g -XX:+UseG1GC -XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=60
# took 31 minutes
# heap did not exceed 16 GB

-Xms32G -Xmx64g -XX:+UseG1GC -XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=60
# took 31 minutes

-Xms16G -Xmx24g -XX:+UseG1GC
# took 21 minutes -> again faster than -Xmx 64g

Then I experimented without G1GC. My conclusion that the Xms could be a factor in the performance to work around the corner case.

-Xmx64g
# 31 minutes -> heap was constantly around 8GB

-Xms16G -Xmx64g
# 16 minutes -> achieved the same execution time as 100GB

-Xms32G -Xmx64g
# 16 minutes -> almost no improvement. from xms 16GB

-Xms16G -Xmx24g
# 21 minutes -> better than just setting Xmxx without Xms

@amahussein
Copy link
Collaborator

amahussein commented Oct 22, 2024

This issue needs two parts:

  • user-tools wrapper should set the -Xms to be 0.5 of the -XmX
  • Fix the internal CI/CD scripts to set both Xmx and Xms

amahussein added a commit to amahussein/spark-rapids-tools that referenced this issue Oct 23, 2024
Signed-off-by: Ahmed Hussein <ahussein@nvidia.com>

Fixes NVIDIA#1382

Upon investigation, it was revealed that the min heap size could impact
the runtime significantly.
This code change aims at setting the xms java argument to 50% of the max
heap size.

- pass xms to the java cmd
- update the runtime report to list jvm info along with jvm arguments
  related to heap:
  - `runtime.jvm.*`
  - `runtime.jvm.arg*`
amahussein added a commit that referenced this issue Oct 23, 2024
* user-tools should add xms argument to java cmd

Signed-off-by: Ahmed Hussein <ahussein@nvidia.com>

Fixes #1382

Upon investigation, it was revealed that the min heap size could impact
the runtime significantly.
This code change aims at setting the xms java argument to 50% of the max
heap size.

- pass xms to the java cmd
- update the runtime report to list jvm info along with jvm arguments
  related to heap:
  - `runtime.jvm.*`
  - `runtime.jvm.arg*`

* remove debugging code

Signed-off-by: Ahmed Hussein <ahussein@nvidia.com>

---------

Signed-off-by: Ahmed Hussein <ahussein@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core_tools Scope the core module (scala)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants