-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU load increased 10x with Quarkus 2.1.* (native image) #19359
Comments
Do you reproduce the issue with a JVM run? In this case, you could easily get a profile using Async profiler (see https://github.com/quarkusio/quarkus/blob/main/TROUBLESHOOTING.md for information about it). Another question: are you building both with the same version of GraalVM. /cc @johnaohara maybe you can help on this one? |
@gsmet sure i'll take a look at this one @holledauer as Guillaume mentioned, if you can reproduce in JVM mode a flamegraph from async profiler would be really useful. If that is not possible, a reproducer based on your application would also be a great help Thanks |
Both were built with GraalVM 21.2.0 Java 11 CE |
In JVM-Mode I can't see a difference in the CPU load. |
Are you able to reproduce it? I think it should be reproducable with any quarkus app because there is nothing special in my service (it uses Kotlin, is is built with Gradle, it uses resteasy-reactive) |
@holledauer i am not able to reproduce with a simple app;
What extensions do you have installed? |
I added a minimal app that reproduces the cpu issue. It is built with |
Thanks for sharing, the application is very similar to the app that I tested locally. Are you deploying the application to a vanilla K8's environment? do you know what container runtime is used in the cluster you are using? thanks |
AWS EKS with docker-shim and dockerd
requests: cpu 10m, mem 30Mi. limit: mem 30Mi |
still the same with 2.2.0 |
I have the same issue. Still the same with 2.2.2. |
I tried recreating this issue running on openshift 4.7, and i can not re-create. I tried with graalvm CE 21.1 and 21.2. Please see below; I suspect issue depends on the environment, at present I do not have access to an AWS EKS instance. One option to investigate this issue would be to try recording CPU samples with JFR for the different quarkus versions: https://www.graalvm.org/reference-manual/native-image/JFR/ |
Are you also running on AWS EKS, or a different environment? |
Hello @holledauer,
Given that you have access to this setup plus are able to test both 2.0.2.Final and 2.1.2.Final to reproduce this, perhaps you could generate 3-4 thread dumps (using maybe |
I am running on a self-managed Kubernetes cluster (on AWS platform). |
How can I create a thread dump with a native image built with GraalVM CE? |
Native images will still dump the stack traces if you send a SIGQUIT signal to the process. |
That only works with the Enterprise Edition (see docs) |
Stack traces work in CE, heap dumps require EE. I think the way that doc is worded is misleading. |
Thread dumps are attached |
Hi, is there any update on this? It seems we face the same issue with quarkus 2.1+ running as native image in AWS as Docker container |
From the thread dumps attached so far, I don't see anything obvious that looks like an issue. |
I could not see anything obvious from the stack traces either. I have obtained some AWS credits to try recreate this issue in the AWS env |
Any updates on this? |
any update on this? |
If someone could come up with a small reproducer (a small app reproducing the issue and ideally information about the AWS environment you are using), that would be helpful. Right now, we are shooting in the dark. |
Hmmm, I just saw that @holledauer already provided some information. @u6f6o would you be able to do the same? That might be interesting to have information about your setup too to see if we can find a common ground. |
I have attempted to recreate this issue with the supplied sample application. I have tried to recreate on baremetal, OCP 4.7, AWS ECS and minikube 1.23. With these environments, I am unable to recreate the increase in CPU usage that people are seeing. The graph below is running on AWS ECS; The different native images versions that I tested were; A) quarkus 2.0.2.Final; graalvm 21.1-java11 After building the images with;
and for graalVM 21.2-java11;
The docker images were built with this Dockerfile;
If you would like to test the images in your environment that I used, I have pushed the pre-built images to: https://gallery.ecr.aws/k3j0e3f7/johara-quarkus Please could someone who is experiencing this issue, please;
|
I did some further testing in our k8s cluster with the native images. The images from https://gallery.ecr.aws/k3j0e3f7/johara-quarkus did not result in any cpu spikes. As a next step, I created a small sample project that offers a single rest easy hello world controller. I tested this service with different quarkus versions: 2.0.2, 2.4.2 and 2.7.5. I experienced the same behaviour and starting with versions >2.0.2 the cpu load was much higher compared to earlier versions. Due to the fact, that there were no cpu spikes with the vanilla images from https://gallery.ecr.aws/k3j0e3f7/johara-quarkus , I tried to get rid of all quarkus dependencies that are not absolutely mandatory for my sample project and surprisingly the cpu spikes were gone even for the latest quarkus version: Probably one of the dependencies is causing the cpu spikes. As a next step, I'd try to start from the vanilla version again and to add one dependency at a time and check wether this has impact on the cpu consumption or not. |
@u6f6o thanks for the detective work! |
@u6f6o I would start with the Micrometer extensions if I were you. |
And then the |
@u6f6o thanks for the update, pinning it down to an extensions/extensions would be a massive help, thanks |
There is a good chance, that I found the culprit. I took vanilla quarkus and added |
Yeah, that was my natural intuition. Now we need to figure out what's going on :). |
2.0.2-Final
vs.
Pinning |
We decided to pin the micrometer version to 1.7.0 in the build.gradle file:
So far we did not see any issues with this approach (running in production for two days) and the cpu spikes are gone as well. Use at your own risk though :-) |
@u6f6o I wonder if even if you don't see a different in JVM mode, it would be interesting to get a flamegraph of what your app is doing. If you end up doing that, you can send the obtained flamegraphs (the ideal would be one without your workaround and one with your Micrometer version pinning workaround so both code are as close as possible), you can send them to me at gsmet at redhat dot com. Thanks! |
@gsmet : I suppose you are interested in the cpu flame graphs, right? We have kubectl-flame in place, which under the hood uses async-profiler. There is one thing so, I can only use -itimer for the flame graphs:
Is this a problem? Which versions would you be interested in:
? |
CPU and allocations would be great if you can get them. |
I did some more testing in the past few days. CPU allocation graphs did not show anything obvious but I found out that the cpu spikes are related to a change in micrometers reflection-config.json. This file was not present before 1.7.2 (or maybe 1.7.1) and building my application with the latest micrometer version (1.8.x) but without this file also solved the cpu spikes for me. The issue only seems to show up in our k8s cluster (on aws) though. Running the same images in a local docker did not cause any cpu spikes. 😩 |
Another workaround that allows you to use the latest quarkus version (2.8.3-Final in my case) is to disable the file descriptor metrics that cause the cpu spikes:
Imo this is the nicer approach as it allows us to use the latest micrometer libs. |
@ebullient @maxandersen I think we need to do something about this. Executive summary: on AWS, in native, some metrics that were not enabled previously (because of missing native config) are now enabled and they are causing CPU spikes. |
@jonatan-ivanov and @shakuzen ... here. I've opened the above issue to reference this one. |
Hello all, I'm one of the Micrometer maintainers. Thank you for all the great investigation into this. For file descriptor metrics, we're using the JMX I suspect there is something about the implementation there and the file descriptor configuration or operating system when running on AWS that is causing it to use a significant amount of CPU. @u6f6o could you try compiling your app with a 22.2 dev build of GraalVM to see if the issue goes away without workarounds? Unfortunately, I'm not sure there is much we can do in Quarkus or Micrometer to address this, since we don't control the implementation of |
@shakuzen : I tried to build our service with the dev build you mentioned and the cpu spikes vanished: |
@u6f6o thank you for quickly trying that out. I think that hopefully confirms this will be fixed in the 22.2 GraalVM release, which is scheduled for July. If it is important to fix this without the above mentioned workaround prior to the 22.2 release, I would recommend opening an issue in the GraalVM repo. This should be reproducible without Quarkus or Micrometer like the following, but I haven't tried deploying it to AWS. public static void main(String[] args) {
OperatingSystemMXBean operatingSystemMXBean = ManagementFactory.getOperatingSystemMXBean();
ScheduledExecutorService scheduler = Executors.newSingleThreadScheduledExecutor();
if (operatingSystemMXBean instanceof UnixOperatingSystemMXBean) {
UnixOperatingSystemMXBean unixOsBean = (UnixOperatingSystemMXBean) operatingSystemMXBean;
scheduler.scheduleAtFixedRate(() -> {
System.out.println("File descriptor metrics; open: " + unixOsBean.getOpenFileDescriptorCount() + ", max: " + unixOsBean.getMaxFileDescriptorCount());
}, 0, 30, TimeUnit.SECONDS);
} else {
System.err.println("Cannot get file descriptor metrics on non-unix-like systems.");
}
} |
I am personally fine with the 2nd workaround for the time being as our k8s setup also publishes these metrics on pod and instance level. Thus leaving these out while still being able to use the latest micrometer libs is good enough in our case. |
I'll close this, as the issue is fixed w/ GraalVM 22.2, and there are several noted workarounds. |
Describe the bug
When upgrading to any Quarkus 2.1.* version the cpu usage increases 10x (just the base load without any requests).
Expected behavior
There shouldn't be an increase in cpu usage
Actual behavior
In the image you can see the cpu load for Quarkus 2.0.2.Final and 2.1.2.Final (low cpu usage is 2.0.2.Final)
How to Reproduce?
No response
Output of
uname -a
orver
No response
Output of
java -version
GraalVM 21.2.0 Java 11 CE
GraalVM version (if different from Java)
GraalVM 21.2.0 Java 11 CE
Quarkus version or git rev
2.1.2.Final
Build tool (ie. output of
mvnw --version
orgradlew --version
)Gradle 6.6.1
Additional information
No response
The text was updated successfully, but these errors were encountered: