Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU load increased 10x with Quarkus 2.1.* (native image) #19359

Closed
holledauer opened this issue Aug 12, 2021 · 52 comments
Closed

CPU load increased 10x with Quarkus 2.1.* (native image) #19359

holledauer opened this issue Aug 12, 2021 · 52 comments
Labels
kind/bug Something isn't working

Comments

@holledauer
Copy link

Describe the bug

When upgrading to any Quarkus 2.1.* version the cpu usage increases 10x (just the base load without any requests).

Expected behavior

There shouldn't be an increase in cpu usage

Actual behavior

In the image you can see the cpu load for Quarkus 2.0.2.Final and 2.1.2.Final (low cpu usage is 2.0.2.Final)
image

How to Reproduce?

No response

Output of uname -a or ver

No response

Output of java -version

GraalVM 21.2.0 Java 11 CE

GraalVM version (if different from Java)

GraalVM 21.2.0 Java 11 CE

Quarkus version or git rev

2.1.2.Final

Build tool (ie. output of mvnw --version or gradlew --version)

Gradle 6.6.1

Additional information

No response

@holledauer holledauer added the kind/bug Something isn't working label Aug 12, 2021
@gsmet
Copy link
Member

gsmet commented Aug 12, 2021

Do you reproduce the issue with a JVM run? In this case, you could easily get a profile using Async profiler (see https://github.com/quarkusio/quarkus/blob/main/TROUBLESHOOTING.md for information about it).

Another question: are you building both with the same version of GraalVM.

/cc @johnaohara maybe you can help on this one?

@johnaohara
Copy link
Member

@gsmet sure i'll take a look at this one

@holledauer as Guillaume mentioned, if you can reproduce in JVM mode a flamegraph from async profiler would be really useful. If that is not possible, a reproducer based on your application would also be a great help

Thanks

@holledauer
Copy link
Author

Both were built with GraalVM 21.2.0 Java 11 CE

@holledauer
Copy link
Author

In JVM-Mode I can't see a difference in the CPU load.

@holledauer
Copy link
Author

Are you able to reproduce it? I think it should be reproducable with any quarkus app because there is nothing special in my service (it uses Kotlin, is is built with Gradle, it uses resteasy-reactive)

@johnaohara
Copy link
Member

@holledauer i am not able to reproduce with a simple app;

CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
7dd58131eec5 quarkus-native-2.1.2.Final 0.00% 12.71MiB / 62.81GiB 0.02% 0B / 0B 0B / 0B 69
c599c9d0e6a3 quarkus-native-2.0.2.Final 0.00% 12MiB / 62.81GiB 0.02% 0B / 0B 0B / 0B 69

What extensions do you have installed?
.e.g from the console Installed features: [cdi, kotlin, resteasy-reactive, smallrye-context-propagation]

@holledauer
Copy link
Author

I added a minimal app that reproduces the cpu issue.

It is built with ./gradlew build -x test -Dquarkus.package.type=native -Dquarkus.container-image.build=true

quarkuscpu.zip

@holledauer
Copy link
Author

Here you can see what happens when I deploy this small sample app on k8s (compared to the full blown app with Quarkus 2.0 )
image

@johnaohara
Copy link
Member

Thanks for sharing, the application is very similar to the app that I tested locally. Are you deploying the application to a vanilla K8's environment? do you know what container runtime is used in the cluster you are using? thanks

@holledauer
Copy link
Author

holledauer commented Aug 13, 2021

AWS EKS with docker-shim and dockerd

NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ZONE NODEGROUP-IMAGE NodeA Ready <none> 8d v1.21.2-13+d2965f0db10712 10.21.22.131 1.2.3.4 Amazon Linux 2 5.4.129-63.229.amzn2.x86_64 docker://19.3.13 eu-central-1a ami-06f3aa48664a0203d

requests: cpu 10m, mem 30Mi. limit: mem 30Mi

@holledauer
Copy link
Author

still the same with 2.2.0

@MikkelVestergaard
Copy link

I have the same issue. Still the same with 2.2.2.
I also tested other versions and version 2.0.2 was the youngest version I did not get the issue with. So I also have the issue with version 2.0.3. So it must be something changed from 2.0.2->2.0.3.

@johnaohara
Copy link
Member

I tried recreating this issue running on openshift 4.7, and i can not re-create. I tried with graalvm CE 21.1 and 21.2. Please see below;

cpu-native-2 0 2 Final
cpu-native-2 1 2 Final
cpu-native-2 2 2 Final
cpu-native-2 2 2 Final-grl-21 2

I suspect issue depends on the environment, at present I do not have access to an AWS EKS instance.

One option to investigate this issue would be to try recording CPU samples with JFR for the different quarkus versions: https://www.graalvm.org/reference-manual/native-image/JFR/

@johnaohara
Copy link
Member

@MikkelVestergaard

I have the same issue. Still the same with 2.2.2.

Are you also running on AWS EKS, or a different environment?

@jaikiran
Copy link
Member

Hello @holledauer,

In the image you can see the cpu load for Quarkus 2.0.2.Final and 2.1.2.Final (low cpu usage is 2.0.2.Final)

Given that you have access to this setup plus are able to test both 2.0.2.Final and 2.1.2.Final to reproduce this, perhaps you could generate 3-4 thread dumps (using maybe jstack) against each of these instances? Try to generate each thread dump with a gap of around 5 seconds each and do this against both 2.0.2.Final and 2.1.2.Final versions. Maybe that will have some hint on what's going on during that period, when even requests aren't being fired/processed. You can then attach those to this issue.

@MikkelVestergaard
Copy link

Are you also running on AWS EKS, or a different environment?

I am running on a self-managed Kubernetes cluster (on AWS platform).

@holledauer
Copy link
Author

Hello @holledauer,

In the image you can see the cpu load for Quarkus 2.0.2.Final and 2.1.2.Final (low cpu usage is 2.0.2.Final)

Given that you have access to this setup plus are able to test both 2.0.2.Final and 2.1.2.Final to reproduce this, perhaps you could generate 3-4 thread dumps (using maybe jstack) against each of these instances? Try to generate each thread dump with a gap of around 5 seconds each and do this against both 2.0.2.Final and 2.1.2.Final versions. Maybe that will have some hint on what's going on during that period, when even requests aren't being fired/processed. You can then attach those to this issue.

How can I create a thread dump with a native image built with GraalVM CE?

@johnaohara
Copy link
Member

How can I create a thread dump with a native image built with GraalVM CE?

Native images will still dump the stack traces if you send a SIGQUIT signal to the process.
$ kill -3 {PID} from within the container will print the stack trace to the container console

@holledauer
Copy link
Author

How can I create a thread dump with a native image built with GraalVM CE?

Native images will still dump the stack traces if you send a SIGQUIT signal to the process.
$ kill -3 {PID} from within the container will print the stack trace to the container console

That only works with the Enterprise Edition (see docs)

@johnaohara
Copy link
Member

Stack traces work in CE, heap dumps require EE. I think the way that doc is worded is misleading.

@holledauer
Copy link
Author

Thread dumps are attached
dump_2_0_2.txt
dump_2_1_2.txt

@AndreiYu
Copy link

Hi, is there any update on this? It seems we face the same issue with quarkus 2.1+ running as native image in AWS as Docker container

@jaikiran
Copy link
Member

From the thread dumps attached so far, I don't see anything obvious that looks like an issue.

@johnaohara
Copy link
Member

I could not see anything obvious from the stack traces either. I have obtained some AWS credits to try recreate this issue in the AWS env

@holledauer
Copy link
Author

Any updates on this?

@MikkelVestergaard
Copy link

any update on this?
I tried upgrading to 2.6.3 (running on AWS EKS cluster) but the issue still exist.
I also upgraded from Java 11 to Java 17, but that did also not solve the issue.

@gsmet
Copy link
Member

gsmet commented Jan 28, 2022

@johnaohara ^

@u6f6o
Copy link

u6f6o commented Feb 24, 2022

Any update on this?

We ran into the same issue as well. Currently we have the option to either stick with 2.0.2.Final in native mode or use the jvm mode instead. Our services run in k8s on aws.

The image below shows jvm-mode vs. native mode (2.7.1)
image

@gsmet
Copy link
Member

gsmet commented Feb 24, 2022

If someone could come up with a small reproducer (a small app reproducing the issue and ideally information about the AWS environment you are using), that would be helpful.

Right now, we are shooting in the dark.

@gsmet
Copy link
Member

gsmet commented Feb 24, 2022

Hmmm, I just saw that @holledauer already provided some information.

@u6f6o would you be able to do the same? That might be interesting to have information about your setup too to see if we can find a common ground.

@johnaohara
Copy link
Member

I have attempted to recreate this issue with the supplied sample application. I have tried to recreate on baremetal, OCP 4.7, AWS ECS and minikube 1.23. With these environments, I am unable to recreate the increase in CPU usage that people are seeing.

The graph below is running on AWS ECS;

Screenshot from 2022-02-24 19-37-27

The different native images versions that I tested were;

A) quarkus 2.0.2.Final; graalvm 21.1-java11
B) quarkus 2.0.3.Final; graalvm 21.1-java11
C) quarkus 2.1.2.Final; graalvm 21.1-java11
D) quarkus 2.1.2.Final; graalvm 21.2-java11

After building the images with;

./gradlew clean build -x test -Dquarkus.package.type=native -Dquarkus.container-image.build=true

and for graalVM 21.2-java11;

./gradlew clean build -x test -Dquarkus.package.type=native -Dquarkus.container-image.build=true -Dquarkus.native.builder-image=quay.io/quarkus/ubi-quarkus-native-image:21.2-java11

The docker images were built with this Dockerfile;

FROM registry.access.redhat.com/ubi8/ubi-minimal:8.5-230
WORKDIR /work/
RUN chown 1001 /work \
    && chmod "g+rwX" /work \
    && chown 1001:root /work
COPY --chown=1001:root build/quarkuscpu-1.0-runner /work/application

EXPOSE 8080
USER 1001

CMD ["./application", "-Dquarkus.http.host=0.0.0.0"]

If you would like to test the images in your environment that I used, I have pushed the pre-built images to: https://gallery.ecr.aws/k3j0e3f7/johara-quarkus

Please could someone who is experiencing this issue, please;

  1. Share the Docker file that you are using to build the docker image
  2. Test the images that I have published for testing (https://gallery.ecr.aws/k3j0e3f7/johara-quarkus) in your environemnt to see if you see a cpu increase with those images
  3. When you see the increase in CPU usage, is there also an increase in memory usage? I not have seen this behaviour myself, but am wondering if the container limits are increasing GC

@u6f6o
Copy link

u6f6o commented Mar 21, 2022

I did some further testing in our k8s cluster with the native images. The images from https://gallery.ecr.aws/k3j0e3f7/johara-quarkus did not result in any cpu spikes.

As a next step, I created a small sample project that offers a single rest easy hello world controller. I tested this service with different quarkus versions: 2.0.2, 2.4.2 and 2.7.5. I experienced the same behaviour and starting with versions >2.0.2 the cpu load was much higher compared to earlier versions.

Due to the fact, that there were no cpu spikes with the vanilla images from https://gallery.ecr.aws/k3j0e3f7/johara-quarkus , I tried to get rid of all quarkus dependencies that are not absolutely mandatory for my sample project and surprisingly the cpu spikes were gone even for the latest quarkus version:

image

image

Probably one of the dependencies is causing the cpu spikes. As a next step, I'd try to start from the vanilla version again and to add one dependency at a time and check wether this has impact on the cpu consumption or not.

@gsmet
Copy link
Member

gsmet commented Mar 21, 2022

@u6f6o thanks for the detective work!

@gsmet
Copy link
Member

gsmet commented Mar 21, 2022

@u6f6o I would start with the Micrometer extensions if I were you.

@gsmet
Copy link
Member

gsmet commented Mar 21, 2022

And then the quarkus-logging-ecs extension.

@johnaohara
Copy link
Member

@u6f6o thanks for the update, pinning it down to an extensions/extensions would be a massive help, thanks

@u6f6o
Copy link

u6f6o commented Mar 21, 2022

There is a good chance, that I found the culprit. I took vanilla quarkus and added quarkus-micrometer but it had no negative impact. Afterwards I added the io.micrometer:micrometer-registry-prometheus dependency and the cpu spikes appeared again. After that, I added all dependencies except io.micrometer:micrometer-registry-prometheus and the cpu usage did not spike.

image

@gsmet
Copy link
Member

gsmet commented Mar 21, 2022

Yeah, that was my natural intuition. Now we need to figure out what's going on :).

@u6f6o
Copy link

u6f6o commented Mar 21, 2022

2.0.2-Final

+--- io.micrometer:micrometer-registry-prometheus -> 1.7.0
|    +--- io.micrometer:micrometer-core:1.7.0 (*)
|    \--- io.prometheus:simpleclient_common:0.10.0
|         \--- io.prometheus:simpleclient:0.10.0

vs.
2.7.5-Final


+--- io.micrometer:micrometer-registry-prometheus -> 1.8.3
|    +--- io.micrometer:micrometer-core:1.8.3 (*)
|    \--- io.prometheus:simpleclient_common:0.12.0
|         \--- io.prometheus:simpleclient:0.12.0
|              +--- io.prometheus:simpleclient_tracer_otel:0.12.0
|              |    \--- io.prometheus:simpleclient_tracer_common:0.12.0
|              \--- io.prometheus:simpleclient_tracer_otel_agent:0.12.0
|                   \--- io.prometheus:simpleclient_tracer_common:0.12.0

Pinning io.micrometer:micrometer-registry-prometheus and 'io.micrometer:micrometer-core to 1.7.0 instead of 1.8.3 in quarkus-2.7.5-Final solved the cpu spikes as well. Did not check though, if all metrics and the quarkus integration are fully working with these versions as well.

@u6f6o
Copy link

u6f6o commented Mar 31, 2022

We decided to pin the micrometer version to 1.7.0 in the build.gradle file:

configurations.all {    
    resolutionStrategy {
        eachDependency { DependencyResolveDetails details ->
            if (details.requested.group == 'io.micrometer') {
                details.useVersion "1.7.0"
            }
        }
    }

So far we did not see any issues with this approach (running in production for two days) and the cpu spikes are gone as well. Use at your own risk though :-)

@gsmet
Copy link
Member

gsmet commented Mar 31, 2022

@u6f6o I wonder if even if you don't see a different in JVM mode, it would be interesting to get a flamegraph of what your app is doing.
We have some interesting guidance here for getting flamegraphs: https://github.com/quarkusio/quarkus/blob/main/TROUBLESHOOTING.md .

If you end up doing that, you can send the obtained flamegraphs (the ideal would be one without your workaround and one with your Micrometer version pinning workaround so both code are as close as possible), you can send them to me at gsmet at redhat dot com.

Thanks!

@u6f6o
Copy link

u6f6o commented Mar 31, 2022

@gsmet : I suppose you are interested in the cpu flame graphs, right? We have kubectl-flame in place, which under the hood uses async-profiler. There is one thing so, I can only use -itimer for the flame graphs:

If changing the configuration is not possible, you may fall back to -e itimer profiling mode. It is similar to cpu mode, but does not require perf_events support. As a drawback, there will be no kernel stack traces.

Is this a problem? Which versions would you be interested in:

  • quarkus 2.7.5 with micrometer 1.7.0
  • quarkus 2.7.5 with micrometer 1.7.2 (I belive this is the first one that shows cpu spikes in native mode)

?

@gsmet
Copy link
Member

gsmet commented Mar 31, 2022

CPU and allocations would be great if you can get them.
Ideally closer versions of everything that exhibits a working and non-working behavior so your proposal looks good if you observe the issue with quarkus 2.7.5 with micrometer 1.7.2.

@u6f6o
Copy link

u6f6o commented Apr 8, 2022

I did some more testing in the past few days. CPU allocation graphs did not show anything obvious but I found out that the cpu spikes are related to a change in micrometers reflection-config.json. This file was not present before 1.7.2 (or maybe 1.7.1) and building my application with the latest micrometer version (1.8.x) but without this file also solved the cpu spikes for me.

The issue only seems to show up in our k8s cluster (on aws) though. Running the same images in a local docker did not cause any cpu spikes. 😩

@u6f6o
Copy link

u6f6o commented May 11, 2022

Another workaround that allows you to use the latest quarkus version (2.8.3-Final in my case) is to disable the file descriptor metrics that cause the cpu spikes:

import io.micrometer.core.instrument.Meter
import io.micrometer.core.instrument.config.MeterFilter
import io.micrometer.core.instrument.config.MeterFilterReply
import javax.enterprise.inject.Produces
import javax.inject.Singleton

@Singleton
class MicrometerConfig {

    @Produces
    @Singleton
    fun disableFileDescriptorMetrics() = object : MeterFilter {
        override fun accept(id: Meter.Id): MeterFilterReply =
            if (id.name.startsWith("process.files")) {
                MeterFilterReply.DENY
            } else
                super.accept(id)
            }
}

Imo this is the nicer approach as it allows us to use the latest micrometer libs.

@gsmet
Copy link
Member

gsmet commented May 11, 2022

@ebullient @maxandersen I think we need to do something about this.

Executive summary: on AWS, in native, some metrics that were not enabled previously (because of missing native config) are now enabled and they are causing CPU spikes.

@ebullient
Copy link
Member

@jonatan-ivanov and @shakuzen ... here.

I've opened the above issue to reference this one.

@shakuzen
Copy link

shakuzen commented May 12, 2022

Hello all, I'm one of the Micrometer maintainers. Thank you for all the great investigation into this. For file descriptor metrics, we're using the JMX OperatingSystemMXBean, which in native images up to GraalVM 22.1 results in PosixSubstrateOperatingSystemMXBean being used at runtime:
https://github.com/oracle/graal/blob/da6a62d607552fc958ccb63f4ca1d81e1817cadc/substratevm/src/com.oracle.svm.core.posix/src/com/oracle/svm/core/posix/PosixSubstrateOperatingSystemMXBean.java#L66-L85

I suspect there is something about the implementation there and the file descriptor configuration or operating system when running on AWS that is causing it to use a significant amount of CPU.
If you are not seeing the issue when running on the JVM, though, I think there's a good chance the issue will be resolved with GraalVM 22.2, where the JVM's OperatingSystemMXBean implementations will be used instead of substrate-specific substitutions - see oracle/graal#4383.

@u6f6o could you try compiling your app with a 22.2 dev build of GraalVM to see if the issue goes away without workarounds?

Unfortunately, I'm not sure there is much we can do in Quarkus or Micrometer to address this, since we don't control the implementation of OperatingSystemMXBean that will be available at runtime.

@u6f6o
Copy link

u6f6o commented May 12, 2022

@shakuzen : I tried to build our service with the dev build you mentioned and the cpu spikes vanished:
image

@shakuzen
Copy link

@u6f6o thank you for quickly trying that out. I think that hopefully confirms this will be fixed in the 22.2 GraalVM release, which is scheduled for July. If it is important to fix this without the above mentioned workaround prior to the 22.2 release, I would recommend opening an issue in the GraalVM repo. This should be reproducible without Quarkus or Micrometer like the following, but I haven't tried deploying it to AWS.

public static void main(String[] args) {
    OperatingSystemMXBean operatingSystemMXBean = ManagementFactory.getOperatingSystemMXBean();
    ScheduledExecutorService scheduler = Executors.newSingleThreadScheduledExecutor();

    if (operatingSystemMXBean instanceof UnixOperatingSystemMXBean) {
        UnixOperatingSystemMXBean unixOsBean = (UnixOperatingSystemMXBean) operatingSystemMXBean;
        scheduler.scheduleAtFixedRate(() -> {
            System.out.println("File descriptor metrics; open: " + unixOsBean.getOpenFileDescriptorCount() + ", max: " + unixOsBean.getMaxFileDescriptorCount());
        }, 0, 30, TimeUnit.SECONDS);
    } else {
        System.err.println("Cannot get file descriptor metrics on non-unix-like systems.");
    }
}

@u6f6o
Copy link

u6f6o commented May 13, 2022

I am personally fine with the 2nd workaround for the time being as our k8s setup also publishes these metrics on pod and instance level. Thus leaving these out while still being able to use the latest micrometer libs is good enough in our case.

@ebullient
Copy link
Member

I'll close this, as the issue is fixed w/ GraalVM 22.2, and there are several noted workarounds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants