CPU load increased 10x with Quarkus 2.1.* (native image) #19359

holledauer · 2021-08-12T06:30:54Z

Describe the bug

When upgrading to any Quarkus 2.1.* version the cpu usage increases 10x (just the base load without any requests).

Expected behavior

There shouldn't be an increase in cpu usage

Actual behavior

In the image you can see the cpu load for Quarkus 2.0.2.Final and 2.1.2.Final (low cpu usage is 2.0.2.Final)

How to Reproduce?

No response

Output of `uname -a` or `ver`

No response

Output of `java -version`

GraalVM 21.2.0 Java 11 CE

GraalVM version (if different from Java)

GraalVM 21.2.0 Java 11 CE

Quarkus version or git rev

2.1.2.Final

Build tool (ie. output of `mvnw --version` or `gradlew --version`)

Gradle 6.6.1

Additional information

No response

The text was updated successfully, but these errors were encountered:

gsmet · 2021-08-12T08:43:59Z

Do you reproduce the issue with a JVM run? In this case, you could easily get a profile using Async profiler (see https://github.com/quarkusio/quarkus/blob/main/TROUBLESHOOTING.md for information about it).

Another question: are you building both with the same version of GraalVM.

/cc @johnaohara maybe you can help on this one?

johnaohara · 2021-08-12T08:49:59Z

@gsmet sure i'll take a look at this one

@holledauer as Guillaume mentioned, if you can reproduce in JVM mode a flamegraph from async profiler would be really useful. If that is not possible, a reproducer based on your application would also be a great help

Thanks

holledauer · 2021-08-12T11:25:24Z

Both were built with GraalVM 21.2.0 Java 11 CE

holledauer · 2021-08-12T11:27:32Z

In JVM-Mode I can't see a difference in the CPU load.

holledauer · 2021-08-12T12:47:52Z

Are you able to reproduce it? I think it should be reproducable with any quarkus app because there is nothing special in my service (it uses Kotlin, is is built with Gradle, it uses resteasy-reactive)

johnaohara · 2021-08-12T16:24:32Z

@holledauer i am not able to reproduce with a simple app;

CONTAINER ID	NAME	CPU %	MEM USAGE / LIMIT	MEM %	NET I/O	BLOCK I/O	PIDS
7dd58131eec5	quarkus-native-2.1.2.Final	0.00%	12.71MiB / 62.81GiB	0.02%	0B / 0B	0B / 0B	69
c599c9d0e6a3	quarkus-native-2.0.2.Final	0.00%	12MiB / 62.81GiB	0.02%	0B / 0B	0B / 0B	69

What extensions do you have installed?
.e.g from the console Installed features: [cdi, kotlin, resteasy-reactive, smallrye-context-propagation]

holledauer · 2021-08-13T07:34:38Z

I added a minimal app that reproduces the cpu issue.

It is built with ./gradlew build -x test -Dquarkus.package.type=native -Dquarkus.container-image.build=true

quarkuscpu.zip

holledauer · 2021-08-13T07:40:46Z

Here you can see what happens when I deploy this small sample app on k8s (compared to the full blown app with Quarkus 2.0 )

johnaohara · 2021-08-13T08:52:20Z

Thanks for sharing, the application is very similar to the app that I tested locally. Are you deploying the application to a vanilla K8's environment? do you know what container runtime is used in the cluster you are using? thanks

holledauer · 2021-08-13T09:10:38Z

AWS EKS with docker-shim and dockerd

NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ZONE NODEGROUP-IMAGE NodeA Ready <none> 8d v1.21.2-13+d2965f0db10712 10.21.22.131 1.2.3.4 Amazon Linux 2 5.4.129-63.229.amzn2.x86_64 docker://19.3.13 eu-central-1a ami-06f3aa48664a0203d

requests: cpu 10m, mem 30Mi. limit: mem 30Mi

holledauer · 2021-08-25T13:11:30Z

still the same with 2.2.0

MikkelVestergaard · 2021-09-14T09:56:30Z

I have the same issue. Still the same with 2.2.2.
I also tested other versions and version 2.0.2 was the youngest version I did not get the issue with. So I also have the issue with version 2.0.3. So it must be something changed from 2.0.2->2.0.3.

johnaohara · 2021-09-16T15:11:25Z

I tried recreating this issue running on openshift 4.7, and i can not re-create. I tried with graalvm CE 21.1 and 21.2. Please see below;

I suspect issue depends on the environment, at present I do not have access to an AWS EKS instance.

One option to investigate this issue would be to try recording CPU samples with JFR for the different quarkus versions: https://www.graalvm.org/reference-manual/native-image/JFR/

johnaohara · 2021-09-16T15:12:07Z

@MikkelVestergaard

I have the same issue. Still the same with 2.2.2.

Are you also running on AWS EKS, or a different environment?

jaikiran · 2021-09-18T04:52:47Z

Hello @holledauer,

In the image you can see the cpu load for Quarkus 2.0.2.Final and 2.1.2.Final (low cpu usage is 2.0.2.Final)

Given that you have access to this setup plus are able to test both 2.0.2.Final and 2.1.2.Final to reproduce this, perhaps you could generate 3-4 thread dumps (using maybe jstack) against each of these instances? Try to generate each thread dump with a gap of around 5 seconds each and do this against both 2.0.2.Final and 2.1.2.Final versions. Maybe that will have some hint on what's going on during that period, when even requests aren't being fired/processed. You can then attach those to this issue.

MikkelVestergaard · 2021-09-20T07:57:36Z

Are you also running on AWS EKS, or a different environment?

I am running on a self-managed Kubernetes cluster (on AWS platform).

holledauer · 2021-09-20T11:35:25Z

Hello @holledauer,

In the image you can see the cpu load for Quarkus 2.0.2.Final and 2.1.2.Final (low cpu usage is 2.0.2.Final)

Given that you have access to this setup plus are able to test both 2.0.2.Final and 2.1.2.Final to reproduce this, perhaps you could generate 3-4 thread dumps (using maybe jstack) against each of these instances? Try to generate each thread dump with a gap of around 5 seconds each and do this against both 2.0.2.Final and 2.1.2.Final versions. Maybe that will have some hint on what's going on during that period, when even requests aren't being fired/processed. You can then attach those to this issue.

How can I create a thread dump with a native image built with GraalVM CE?

johnaohara · 2021-09-20T11:46:55Z

How can I create a thread dump with a native image built with GraalVM CE?

Native images will still dump the stack traces if you send a SIGQUIT signal to the process.
$ kill -3 {PID} from within the container will print the stack trace to the container console

holledauer · 2021-09-20T12:38:57Z

How can I create a thread dump with a native image built with GraalVM CE?

Native images will still dump the stack traces if you send a SIGQUIT signal to the process.
$ kill -3 {PID} from within the container will print the stack trace to the container console

That only works with the Enterprise Edition (see docs)

johnaohara · 2021-09-20T12:45:39Z

Stack traces work in CE, heap dumps require EE. I think the way that doc is worded is misleading.

holledauer · 2021-09-20T13:55:26Z

Thread dumps are attached
dump_2_0_2.txt
dump_2_1_2.txt

AndreiYu · 2021-09-22T15:26:00Z

Hi, is there any update on this? It seems we face the same issue with quarkus 2.1+ running as native image in AWS as Docker container

jaikiran · 2021-09-22T16:19:49Z

From the thread dumps attached so far, I don't see anything obvious that looks like an issue.

johnaohara · 2021-09-24T11:04:47Z

I could not see anything obvious from the stack traces either. I have obtained some AWS credits to try recreate this issue in the AWS env

holledauer · 2021-10-11T06:24:51Z

Any updates on this?

MikkelVestergaard · 2022-01-28T11:15:27Z

any update on this?
I tried upgrading to 2.6.3 (running on AWS EKS cluster) but the issue still exist.
I also upgraded from Java 11 to Java 17, but that did also not solve the issue.

gsmet · 2022-01-28T11:20:36Z

@johnaohara ^

u6f6o · 2022-02-24T10:16:04Z

Any update on this?

We ran into the same issue as well. Currently we have the option to either stick with 2.0.2.Final in native mode or use the jvm mode instead. Our services run in k8s on aws.

The image below shows jvm-mode vs. native mode (2.7.1)

gsmet · 2022-02-24T12:22:44Z

If someone could come up with a small reproducer (a small app reproducing the issue and ideally information about the AWS environment you are using), that would be helpful.

Right now, we are shooting in the dark.

gsmet · 2022-02-24T12:25:57Z

Hmmm, I just saw that @holledauer already provided some information.

@u6f6o would you be able to do the same? That might be interesting to have information about your setup too to see if we can find a common ground.

johnaohara · 2022-02-25T12:09:37Z

I have attempted to recreate this issue with the supplied sample application. I have tried to recreate on baremetal, OCP 4.7, AWS ECS and minikube 1.23. With these environments, I am unable to recreate the increase in CPU usage that people are seeing.

The graph below is running on AWS ECS;

The different native images versions that I tested were;

A) quarkus 2.0.2.Final; graalvm 21.1-java11
B) quarkus 2.0.3.Final; graalvm 21.1-java11
C) quarkus 2.1.2.Final; graalvm 21.1-java11
D) quarkus 2.1.2.Final; graalvm 21.2-java11

After building the images with;

./gradlew clean build -x test -Dquarkus.package.type=native -Dquarkus.container-image.build=true

and for graalVM 21.2-java11;

./gradlew clean build -x test -Dquarkus.package.type=native -Dquarkus.container-image.build=true -Dquarkus.native.builder-image=quay.io/quarkus/ubi-quarkus-native-image:21.2-java11

The docker images were built with this Dockerfile;

FROM registry.access.redhat.com/ubi8/ubi-minimal:8.5-230
WORKDIR /work/
RUN chown 1001 /work \
    && chmod "g+rwX" /work \
    && chown 1001:root /work
COPY --chown=1001:root build/quarkuscpu-1.0-runner /work/application

EXPOSE 8080
USER 1001

CMD ["./application", "-Dquarkus.http.host=0.0.0.0"]

If you would like to test the images in your environment that I used, I have pushed the pre-built images to: https://gallery.ecr.aws/k3j0e3f7/johara-quarkus

Please could someone who is experiencing this issue, please;

Share the Docker file that you are using to build the docker image
Test the images that I have published for testing (https://gallery.ecr.aws/k3j0e3f7/johara-quarkus) in your environemnt to see if you see a cpu increase with those images
When you see the increase in CPU usage, is there also an increase in memory usage? I not have seen this behaviour myself, but am wondering if the container limits are increasing GC

u6f6o · 2022-03-21T12:45:49Z

I did some further testing in our k8s cluster with the native images. The images from https://gallery.ecr.aws/k3j0e3f7/johara-quarkus did not result in any cpu spikes.

As a next step, I created a small sample project that offers a single rest easy hello world controller. I tested this service with different quarkus versions: 2.0.2, 2.4.2 and 2.7.5. I experienced the same behaviour and starting with versions >2.0.2 the cpu load was much higher compared to earlier versions.

Due to the fact, that there were no cpu spikes with the vanilla images from https://gallery.ecr.aws/k3j0e3f7/johara-quarkus , I tried to get rid of all quarkus dependencies that are not absolutely mandatory for my sample project and surprisingly the cpu spikes were gone even for the latest quarkus version:

Probably one of the dependencies is causing the cpu spikes. As a next step, I'd try to start from the vanilla version again and to add one dependency at a time and check wether this has impact on the cpu consumption or not.

gsmet · 2022-03-21T12:49:41Z

@u6f6o thanks for the detective work!

gsmet · 2022-03-21T12:50:38Z

@u6f6o I would start with the Micrometer extensions if I were you.

gsmet · 2022-03-21T12:51:13Z

And then the quarkus-logging-ecs extension.

johnaohara · 2022-03-21T12:51:58Z

@u6f6o thanks for the update, pinning it down to an extensions/extensions would be a massive help, thanks

u6f6o · 2022-03-21T15:23:42Z

There is a good chance, that I found the culprit. I took vanilla quarkus and added quarkus-micrometer but it had no negative impact. Afterwards I added the io.micrometer:micrometer-registry-prometheus dependency and the cpu spikes appeared again. After that, I added all dependencies except io.micrometer:micrometer-registry-prometheus and the cpu usage did not spike.

gsmet · 2022-03-21T15:29:24Z

Yeah, that was my natural intuition. Now we need to figure out what's going on :).

u6f6o · 2022-03-21T15:33:37Z

2.0.2-Final

+--- io.micrometer:micrometer-registry-prometheus -> 1.7.0
|    +--- io.micrometer:micrometer-core:1.7.0 (*)
|    \--- io.prometheus:simpleclient_common:0.10.0
|         \--- io.prometheus:simpleclient:0.10.0

vs.
2.7.5-Final


+--- io.micrometer:micrometer-registry-prometheus -> 1.8.3
|    +--- io.micrometer:micrometer-core:1.8.3 (*)
|    \--- io.prometheus:simpleclient_common:0.12.0
|         \--- io.prometheus:simpleclient:0.12.0
|              +--- io.prometheus:simpleclient_tracer_otel:0.12.0
|              |    \--- io.prometheus:simpleclient_tracer_common:0.12.0
|              \--- io.prometheus:simpleclient_tracer_otel_agent:0.12.0
|                   \--- io.prometheus:simpleclient_tracer_common:0.12.0

Pinning io.micrometer:micrometer-registry-prometheus and 'io.micrometer:micrometer-core to 1.7.0 instead of 1.8.3 in quarkus-2.7.5-Final solved the cpu spikes as well. Did not check though, if all metrics and the quarkus integration are fully working with these versions as well.

u6f6o · 2022-03-31T13:15:07Z

We decided to pin the micrometer version to 1.7.0 in the build.gradle file:

configurations.all {    
    resolutionStrategy {
        eachDependency { DependencyResolveDetails details ->
            if (details.requested.group == 'io.micrometer') {
                details.useVersion "1.7.0"
            }
        }
    }

So far we did not see any issues with this approach (running in production for two days) and the cpu spikes are gone as well. Use at your own risk though :-)

gsmet · 2022-03-31T13:25:25Z

@u6f6o I wonder if even if you don't see a different in JVM mode, it would be interesting to get a flamegraph of what your app is doing.
We have some interesting guidance here for getting flamegraphs: https://github.com/quarkusio/quarkus/blob/main/TROUBLESHOOTING.md .

If you end up doing that, you can send the obtained flamegraphs (the ideal would be one without your workaround and one with your Micrometer version pinning workaround so both code are as close as possible), you can send them to me at gsmet at redhat dot com.

Thanks!

u6f6o · 2022-03-31T14:47:24Z

@gsmet : I suppose you are interested in the cpu flame graphs, right? We have kubectl-flame in place, which under the hood uses async-profiler. There is one thing so, I can only use -itimer for the flame graphs:

If changing the configuration is not possible, you may fall back to -e itimer profiling mode. It is similar to cpu mode, but does not require perf_events support. As a drawback, there will be no kernel stack traces.

Is this a problem? Which versions would you be interested in:

quarkus 2.7.5 with micrometer 1.7.0
quarkus 2.7.5 with micrometer 1.7.2 (I belive this is the first one that shows cpu spikes in native mode)

?

gsmet · 2022-03-31T14:55:43Z

CPU and allocations would be great if you can get them.
Ideally closer versions of everything that exhibits a working and non-working behavior so your proposal looks good if you observe the issue with quarkus 2.7.5 with micrometer 1.7.2.

u6f6o · 2022-04-08T11:49:26Z

I did some more testing in the past few days. CPU allocation graphs did not show anything obvious but I found out that the cpu spikes are related to a change in micrometers reflection-config.json. This file was not present before 1.7.2 (or maybe 1.7.1) and building my application with the latest micrometer version (1.8.x) but without this file also solved the cpu spikes for me.

The issue only seems to show up in our k8s cluster (on aws) though. Running the same images in a local docker did not cause any cpu spikes. 😩

u6f6o · 2022-05-11T17:29:23Z

Another workaround that allows you to use the latest quarkus version (2.8.3-Final in my case) is to disable the file descriptor metrics that cause the cpu spikes:

import io.micrometer.core.instrument.Meter
import io.micrometer.core.instrument.config.MeterFilter
import io.micrometer.core.instrument.config.MeterFilterReply
import javax.enterprise.inject.Produces
import javax.inject.Singleton

@Singleton
class MicrometerConfig {

    @Produces
    @Singleton
    fun disableFileDescriptorMetrics() = object : MeterFilter {
        override fun accept(id: Meter.Id): MeterFilterReply =
            if (id.name.startsWith("process.files")) {
                MeterFilterReply.DENY
            } else
                super.accept(id)
            }
}

Imo this is the nicer approach as it allows us to use the latest micrometer libs.

gsmet · 2022-05-11T17:37:05Z

@ebullient @maxandersen I think we need to do something about this.

Executive summary: on AWS, in native, some metrics that were not enabled previously (because of missing native config) are now enabled and they are causing CPU spikes.

ebullient · 2022-05-12T00:28:10Z

@jonatan-ivanov and @shakuzen ... here.

I've opened the above issue to reference this one.

shakuzen · 2022-05-12T03:45:52Z

Hello all, I'm one of the Micrometer maintainers. Thank you for all the great investigation into this. For file descriptor metrics, we're using the JMX OperatingSystemMXBean, which in native images up to GraalVM 22.1 results in PosixSubstrateOperatingSystemMXBean being used at runtime:
https://github.com/oracle/graal/blob/da6a62d607552fc958ccb63f4ca1d81e1817cadc/substratevm/src/com.oracle.svm.core.posix/src/com/oracle/svm/core/posix/PosixSubstrateOperatingSystemMXBean.java#L66-L85

I suspect there is something about the implementation there and the file descriptor configuration or operating system when running on AWS that is causing it to use a significant amount of CPU.
If you are not seeing the issue when running on the JVM, though, I think there's a good chance the issue will be resolved with GraalVM 22.2, where the JVM's OperatingSystemMXBean implementations will be used instead of substrate-specific substitutions - see oracle/graal#4383.

@u6f6o could you try compiling your app with a 22.2 dev build of GraalVM to see if the issue goes away without workarounds?

Unfortunately, I'm not sure there is much we can do in Quarkus or Micrometer to address this, since we don't control the implementation of OperatingSystemMXBean that will be available at runtime.

u6f6o · 2022-05-12T15:00:30Z

@shakuzen : I tried to build our service with the dev build you mentioned and the cpu spikes vanished:

shakuzen · 2022-05-13T04:33:23Z

@u6f6o thank you for quickly trying that out. I think that hopefully confirms this will be fixed in the 22.2 GraalVM release, which is scheduled for July. If it is important to fix this without the above mentioned workaround prior to the 22.2 release, I would recommend opening an issue in the GraalVM repo. This should be reproducible without Quarkus or Micrometer like the following, but I haven't tried deploying it to AWS.

public static void main(String[] args) {
    OperatingSystemMXBean operatingSystemMXBean = ManagementFactory.getOperatingSystemMXBean();
    ScheduledExecutorService scheduler = Executors.newSingleThreadScheduledExecutor();

    if (operatingSystemMXBean instanceof UnixOperatingSystemMXBean) {
        UnixOperatingSystemMXBean unixOsBean = (UnixOperatingSystemMXBean) operatingSystemMXBean;
        scheduler.scheduleAtFixedRate(() -> {
            System.out.println("File descriptor metrics; open: " + unixOsBean.getOpenFileDescriptorCount() + ", max: " + unixOsBean.getMaxFileDescriptorCount());
        }, 0, 30, TimeUnit.SECONDS);
    } else {
        System.err.println("Cannot get file descriptor metrics on non-unix-like systems.");
    }
}

u6f6o · 2022-05-13T06:04:25Z

I am personally fine with the 2nd workaround for the time being as our k8s setup also publishes these metrics on pod and instance level. Thus leaving these out while still being able to use the latest micrometer libs is good enough in our case.

ebullient · 2022-05-13T20:50:40Z

I'll close this, as the issue is fixed w/ GraalVM 22.2, and there are several noted workarounds.

holledauer added the kind/bug Something isn't working label Aug 12, 2021

quarkus-bot bot added the triage/needs-triage label Aug 12, 2021

gsmet removed the triage/needs-triage label Aug 12, 2021

ebullient mentioned this issue May 12, 2022

CPU load spikes in native images using file descriptor metrics in some environments micrometer-metrics/micrometer#3170

Closed

ebullient closed this as completed May 13, 2022

CPU load increased 10x with Quarkus 2.1.* (native image) #19359

CPU load increased 10x with Quarkus 2.1.* (native image) #19359

Comments

holledauer commented Aug 12, 2021

Describe the bug

Expected behavior

Actual behavior

How to Reproduce?

Output of uname -a or ver

Output of java -version

GraalVM version (if different from Java)

Quarkus version or git rev

Build tool (ie. output of mvnw --version or gradlew --version)

Additional information

gsmet commented Aug 12, 2021

johnaohara commented Aug 12, 2021

holledauer commented Aug 12, 2021

holledauer commented Aug 12, 2021

holledauer commented Aug 12, 2021

johnaohara commented Aug 12, 2021

holledauer commented Aug 13, 2021

holledauer commented Aug 13, 2021

johnaohara commented Aug 13, 2021

holledauer commented Aug 13, 2021 • edited Loading

holledauer commented Aug 25, 2021

MikkelVestergaard commented Sep 14, 2021

johnaohara commented Sep 16, 2021

johnaohara commented Sep 16, 2021

jaikiran commented Sep 18, 2021

MikkelVestergaard commented Sep 20, 2021

holledauer commented Sep 20, 2021

johnaohara commented Sep 20, 2021

holledauer commented Sep 20, 2021

johnaohara commented Sep 20, 2021

holledauer commented Sep 20, 2021

AndreiYu commented Sep 22, 2021

jaikiran commented Sep 22, 2021

johnaohara commented Sep 24, 2021

holledauer commented Oct 11, 2021

MikkelVestergaard commented Jan 28, 2022

gsmet commented Jan 28, 2022

u6f6o commented Feb 24, 2022 • edited Loading

gsmet commented Feb 24, 2022

gsmet commented Feb 24, 2022

johnaohara commented Feb 25, 2022

u6f6o commented Mar 21, 2022

gsmet commented Mar 21, 2022

gsmet commented Mar 21, 2022

gsmet commented Mar 21, 2022

johnaohara commented Mar 21, 2022

u6f6o commented Mar 21, 2022

gsmet commented Mar 21, 2022

u6f6o commented Mar 21, 2022 • edited Loading

u6f6o commented Mar 31, 2022 • edited Loading

gsmet commented Mar 31, 2022

u6f6o commented Mar 31, 2022

gsmet commented Mar 31, 2022

u6f6o commented Apr 8, 2022

u6f6o commented May 11, 2022 • edited Loading

gsmet commented May 11, 2022

ebullient commented May 12, 2022

shakuzen commented May 12, 2022 • edited Loading

u6f6o commented May 12, 2022

shakuzen commented May 13, 2022

u6f6o commented May 13, 2022 • edited Loading

ebullient commented May 13, 2022

Output of `uname -a` or `ver`

Output of `java -version`

Build tool (ie. output of `mvnw --version` or `gradlew --version`)

holledauer commented Aug 13, 2021 •

edited

Loading

u6f6o commented Feb 24, 2022 •

edited

Loading

u6f6o commented Mar 21, 2022 •

edited

Loading

u6f6o commented Mar 31, 2022 •

edited

Loading

u6f6o commented May 11, 2022 •

edited

Loading

shakuzen commented May 12, 2022 •

edited

Loading

u6f6o commented May 13, 2022 •

edited

Loading