High CPU usage after update #1000

wodCZ · 2021-06-21T15:50:26Z

Version Information

Version: branch 21.6.1

Steps to Reproduce

We were running on a version roughly 2-3 months old, on master, I believe Sentry showed 21.6.dev0?
I've checked out 21.6.1 and ran ./install.sh. There was an error with a docker network having an active endpoint, so I restarted docker, then install script finished and sentry started.
The CPU usage went up noticeably. We're running on aws, t3a.large instance (2vcpu, 8gb ram). The server began to burn through the CPU credits it saved up.
At first I thought there was some new processing added and maybe Sentry is chewing through the old issues, but I couldn't really find any process that would confirm this.
Using Netdata I noticed a regular spikes in CPU usage for redis, memcached and postgres containers. Looking through commits, feat: Add healthchecks for redis, memcached and postgres #975 matched the description. I commented out the added lines in my docker-compose and the issue went away.

Expected Result

Relatively low cpu usage even though the instance isn't a massive beast. We have ~890 events per 24hrs.

Actual Result

The CPU was constantly spiking,

Logs

What you saw along the way, e.g.:

Latest install logs

▶ Parsing command line ...

▶ Setting up error handling ...

▶ Checking minimum requirements ...
WARN: Recommended minimum CPU cores available to Docker is 4, found 2

▶ Creating volumes for persistent storage ...
Created sentry-clickhouse.
Created sentry-data.
Created sentry-kafka.
Created sentry-postgres.
Created sentry-redis.
Created sentry-symbolicator.
Created sentry-zookeeper.

▶ Ensuring files from examples ...
../sentry/sentry.conf.py already exists, skipped creation.
../sentry/config.yml already exists, skipped creation.
../symbolicator/config.yml already exists, skipped creation.
../sentry/requirements.txt already exists, skipped creation.

▶ Generating secret key ...

▶ Replacing TSDB ...

▶ Fetching and updating Docker images ...
Some service image(s) must be built from source by running:
    docker-compose build snuba-transactions-cleanup snuba-cleanup symbolicator-cleanup sentry-cleanup
nightly: Pulling from getsentry/sentry
Digest: sha256:46a338a8e784066e9e5791da3083f992d3cd30513947399d59416d0e74083abb
Status: Image is up to date for getsentry/sentry:nightly
docker.io/getsentry/sentry:nightly

▶ Building and tagging Docker images ...

smtp uses an image, skipping
memcached uses an image, skipping
redis uses an image, skipping
postgres uses an image, skipping
zookeeper uses an image, skipping
kafka uses an image, skipping
clickhouse uses an image, skipping
geoipupdate uses an image, skipping
snuba-api uses an image, skipping
snuba-consumer uses an image, skipping
snuba-outcomes-consumer uses an image, skipping
snuba-sessions-consumer uses an image, skipping
snuba-transactions-consumer uses an image, skipping
snuba-replacer uses an image, skipping
snuba-subscription-consumer-events uses an image, skipping
snuba-subscription-consumer-transactions uses an image, skipping
symbolicator uses an image, skipping
web uses an image, skipping
cron uses an image, skipping
worker uses an image, skipping
ingest-consumer uses an image, skipping
post-process-forwarder uses an image, skipping
subscription-consumer-events uses an image, skipping
subscription-consumer-transactions uses an image, skipping
relay uses an image, skipping
nginx uses an image, skipping
Building snuba-cleanup
Step 1/5 : ARG BASE_IMAGE
Step 2/5 : FROM ${BASE_IMAGE}
 ---> d88d98e289f8
Step 3/5 : RUN apt-get update && apt-get install -y --no-install-recommends cron &&     rm -r /var/lib/apt/lists/*
 ---> Using cache
 ---> 358117da8fd4
Step 4/5 : COPY entrypoint.sh /entrypoint.sh
 ---> Using cache
 ---> f62a846a565e
Step 5/5 : ENTRYPOINT ["/entrypoint.sh"]
 ---> Using cache
 ---> e845cbc5b471

Successfully built e845cbc5b471
Successfully tagged snuba-cleanup-onpremise-local:latest
Building snuba-transactions-cleanup
Step 1/5 : ARG BASE_IMAGE
Step 2/5 : FROM ${BASE_IMAGE}
 ---> d88d98e289f8
Step 3/5 : RUN apt-get update && apt-get install -y --no-install-recommends cron &&     rm -r /var/lib/apt/lists/*
 ---> Using cache
 ---> 358117da8fd4
Step 4/5 : COPY entrypoint.sh /entrypoint.sh
 ---> Using cache
 ---> f62a846a565e
Step 5/5 : ENTRYPOINT ["/entrypoint.sh"]
 ---> Using cache
 ---> e845cbc5b471

Successfully built e845cbc5b471
Successfully tagged snuba-cleanup-onpremise-local:latest
Building symbolicator-cleanup
Step 1/5 : ARG BASE_IMAGE
Step 2/5 : FROM ${BASE_IMAGE}
 ---> 3d7425d7b53f
Step 3/5 : RUN apt-get update && apt-get install -y --no-install-recommends cron &&     rm -r /var/lib/apt/lists/*
 ---> Using cache
 ---> 2f627c3d4a61
Step 4/5 : COPY entrypoint.sh /entrypoint.sh
 ---> Using cache
 ---> 7ea338ba7ecc
Step 5/5 : ENTRYPOINT ["/entrypoint.sh"]
 ---> Using cache
 ---> 28fc8186aef1

Successfully built 28fc8186aef1
Successfully tagged symbolicator-cleanup-onpremise-local:latest
Building sentry-cleanup
Step 1/5 : ARG BASE_IMAGE
Step 2/5 : FROM ${BASE_IMAGE}
 ---> 54a53ba83510
Step 3/5 : RUN apt-get update && apt-get install -y --no-install-recommends cron &&     rm -r /var/lib/apt/lists/*
 ---> Using cache
 ---> f5abb0673180
Step 4/5 : COPY entrypoint.sh /entrypoint.sh
 ---> Using cache
 ---> 5d11e01c10f4
Step 5/5 : ENTRYPOINT ["/entrypoint.sh"]
 ---> Using cache
 ---> 7e01dd084207

Successfully built 7e01dd084207
Successfully tagged sentry-cleanup-onpremise-local:latest

Docker images built.

▶ Turning things off ...
Removing network onpremise_default
Network onpremise_default not found.
Removing network sentry_onpremise_default
Network sentry_onpremise_default not found.

▶ Setting up Zookeeper ...
Creating network "sentry_onpremise_default" with the default driver
Creating sentry_onpremise_zookeeper_run ...
Creating sentry_onpremise_zookeeper_run ... done
Creating sentry_onpremise_zookeeper_run ...
Creating sentry_onpremise_zookeeper_run ... done
Creating sentry_onpremise_zookeeper_run ...
Creating sentry_onpremise_zookeeper_run ... done

▶ Downloading and installing wal2json ...
Unable to find image 'curlimages/curl:latest' locally
latest: Pulling from curlimages/curl
339de151aab4: Pulling fs layer
9576f9a419e9: Pulling fs layer
5b67e0e9bce9: Pulling fs layer
7c0640a6c519: Pulling fs layer
1d8f7d50664a: Pulling fs layer
3ce74da83aeb: Pulling fs layer
7094fabf5e16: Pulling fs layer
2f4d0b649c1f: Pulling fs layer
110e7f874674: Pulling fs layer
7c0640a6c519: Waiting
1d8f7d50664a: Waiting
3ce74da83aeb: Waiting
7094fabf5e16: Waiting
2f4d0b649c1f: Waiting
110e7f874674: Waiting
9576f9a419e9: Verifying Checksum
9576f9a419e9: Download complete
5b67e0e9bce9: Verifying Checksum
5b67e0e9bce9: Download complete
339de151aab4: Verifying Checksum
339de151aab4: Pull complete
1d8f7d50664a: Verifying Checksum
1d8f7d50664a: Download complete
7c0640a6c519: Verifying Checksum
7c0640a6c519: Download complete
3ce74da83aeb: Verifying Checksum
3ce74da83aeb: Download complete
9576f9a419e9: Pull complete
5b67e0e9bce9: Pull complete
7c0640a6c519: Pull complete
7094fabf5e16: Verifying Checksum
7094fabf5e16: Download complete
2f4d0b649c1f: Verifying Checksum
2f4d0b649c1f: Download complete
110e7f874674: Verifying Checksum
110e7f874674: Download complete
1d8f7d50664a: Pull complete
3ce74da83aeb: Pull complete
7094fabf5e16: Pull complete
2f4d0b649c1f: Pull complete
110e7f874674: Pull complete
Digest: sha256:6e0a786e3e5181df00eaf3a0a1749c18a6bb20b01c9bd192ea72176ce8a1c94b
Status: Downloaded newer image for curlimages/curl:latest
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  5416  100  5416    0     0  35922      0 --:--:-- --:--:-- --:--:-- 36106
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   638  100   638    0     0   3656      0 --:--:-- --:--:-- --:--:--  3666
100  101k  100  101k    0     0   290k      0 --:--:-- --:--:-- --:--:--  290k

▶ Bootstrapping and migrating Snuba ...
Creating sentry_onpremise_redis_1 ...
Creating sentry_onpremise_clickhouse_1 ...
Creating sentry_onpremise_zookeeper_1  ...
Creating sentry_onpremise_clickhouse_1 ... done
Creating sentry_onpremise_redis_1      ... done
Creating sentry_onpremise_zookeeper_1  ... done
Creating sentry_onpremise_kafka_1      ...
Creating sentry_onpremise_kafka_1      ... done
Creating sentry_onpremise_snuba-api_run ...
Creating sentry_onpremise_snuba-api_run ... done
2021-06-21 10:29:44,125 Attempting to connect to Kafka (attempt 0)...
2021-06-21 10:29:46,145 Attempting to connect to Kafka (attempt 1)...
2021-06-21 10:29:48,149 Attempting to connect to Kafka (attempt 2)...
2021-06-21 10:29:50,152 Attempting to connect to Kafka (attempt 3)...
2021-06-21 10:29:51,143 Connected to Kafka on attempt 3
2021-06-21 10:29:51,144 Creating Kafka topics...
2021-06-21 10:29:51,936 Topic event-replacements-legacy created
2021-06-21 10:29:51,936 Topic ingest-metrics created
2021-06-21 10:29:51,937 Topic snuba-queries created
Starting sentry_onpremise_zookeeper_1 ...
Starting sentry_onpremise_clickhouse_1 ...
Starting sentry_onpremise_redis_1      ...
Starting sentry_onpremise_redis_1      ... done
Starting sentry_onpremise_clickhouse_1 ... done
Starting sentry_onpremise_zookeeper_1  ... done
Starting sentry_onpremise_kafka_1      ...
Starting sentry_onpremise_kafka_1      ... done
Creating sentry_onpremise_snuba-api_run ...
Creating sentry_onpremise_snuba-api_run ... done
Starting migration from 2021-06-21
Migrated 2021-06-21. (1 of 13 partitions done)
Migrated 2021-06-14. (2 of 13 partitions done)
Migrated 2021-06-07. (3 of 13 partitions done)
Migrated 2021-05-31. (4 of 13 partitions done)
Migrated 2021-05-24. (5 of 13 partitions done)
Migrated 2021-05-17. (6 of 13 partitions done)
Migrated 2021-05-10. (7 of 13 partitions done)
Migrated 2021-05-03. (8 of 13 partitions done)
Migrated 2021-04-26. (9 of 13 partitions done)
Migrated 2021-04-19. (10 of 13 partitions done)
Migrated 2021-04-12. (11 of 13 partitions done)
Migrated 2021-04-05. (12 of 13 partitions done)
Migrated 2021-03-29. (13 of 13 partitions done)
Done. Optimizing.
Finished running migrations

▶ Creating additional Kafka topics ...

▶ Ensuring proper PostgreSQL version ...

▶ Setting up / migrating database ...
Creating sentry_onpremise_postgres_1 ...
Starting sentry_onpremise_redis_1    ...
Starting sentry_onpremise_zookeeper_1 ...
Creating sentry_onpremise_symbolicator_1 ...
Starting sentry_onpremise_zookeeper_1    ... done
Starting sentry_onpremise_redis_1        ... done
Starting sentry_onpremise_clickhouse_1   ...
Starting sentry_onpremise_clickhouse_1   ... done
Creating sentry_onpremise_memcached_1    ...
Starting sentry_onpremise_kafka_1        ...
Starting sentry_onpremise_kafka_1        ... done
Creating sentry_onpremise_smtp_1         ...
Creating sentry_onpremise_snuba-api_1    ...
Creating sentry_onpremise_snuba-consumer_1 ...
Creating sentry_onpremise_snuba-replacer_1 ...
Creating sentry_onpremise_snuba-subscription-consumer-events_1 ...
Creating sentry_onpremise_snuba-sessions-consumer_1            ...
Creating sentry_onpremise_snuba-outcomes-consumer_1            ...
Creating sentry_onpremise_snuba-subscription-consumer-transactions_1 ...
Creating sentry_onpremise_snuba-transactions-consumer_1              ...
Creating sentry_onpremise_postgres_1                                 ... done
Creating sentry_onpremise_snuba-api_1                                ... done
Creating sentry_onpremise_symbolicator_1                             ... done
Creating sentry_onpremise_smtp_1                                     ... done
Creating sentry_onpremise_snuba-consumer_1                           ... done
Creating sentry_onpremise_memcached_1                                ... done
Creating sentry_onpremise_snuba-replacer_1                           ... done
Creating sentry_onpremise_snuba-subscription-consumer-events_1       ... done
Creating sentry_onpremise_snuba-outcomes-consumer_1                  ... done
Creating sentry_onpremise_snuba-subscription-consumer-transactions_1 ... done
Creating sentry_onpremise_snuba-transactions-consumer_1              ... done
Creating sentry_onpremise_snuba-sessions-consumer_1                  ... done
Creating sentry_onpremise_web_run                                    ...
Creating sentry_onpremise_web_run                                    ... done
10:31:11 [INFO] sentry.plugins.github: apps-not-configured
* Unknown config option found: 'github-app.extended-permissions'
* Unknown config option found: 'slack.legacy-app'
Running migrations for default
Operations to perform:
  Apply all migrations: admin, auth, contenttypes, jira_ac, nodestore, sentry, sessions, sites, social_auth
Running migrations:
User Options: 100% |#############################################| Time: 0:00:00
 OK
User Options: 100% |#############################################| Time: 0:00:00
 OK
  Applying sentry.0186_add_externalactor... OK
Saved Searchs: 100% |############################################| Time: 0:00:00
 OK
  Applying sentry.0188_remove_externalteam_externaluser_fk_constraints... OK
  Applying sentry.0189_remove_externaluser_externalteam_models... OK
  Applying sentry.0190_drop_external_user_table... OK
  Applying sentry.0191_make_externalactor_integration_id_not_null... OK
  Applying sentry.0192_remove_fileblobowner_org_fk... OK
  Applying sentry.0193_grouprelease_indexes... OK
  Applying sentry.0194_add_custom_scm_provider... OK
  Applying sentry.0195_add_team_key_transactions... OK
  Applying sentry.0196_add_restricted_member_limit... OK
  Applying sentry.0197_add_scim_enabled_boolean... OK
  Applying sentry.0198_add_project_transaction_threshold... OK
  Applying sentry.0199_release_semver... OK
  Applying sentry.0200_release_indices... OK
  Applying sentry.0201_semver_package... OK
  Applying sentry.0202_org_slug_upper_idx... OK
  Applying sentry.0203_groupedmessage_status_index... OK
  Applying sentry.0204_use_project_team_for_team_key_transactions... OK
Releases: 100% |#################################################| Time: 0:00:00
 OK
  Applying sentry.0206_organization_require_email_verification_flag... OK
  Applying sentry.0207_release_adoption... OK
  Applying sentry.0208_add_team_scope... OK
  Applying sentry.0209_avatar_remove_file_fk... OK
Organization Options: 100% |#####################################| Time: 0:00:00
 OK
Creating missing DSNs
Correcting Group.num_comments counter

▶ Migrating file storage ...

▶ Generating Relay credentials ...
../relay/config.yml already exists, skipped creation.

▶ Setting up GeoIP integration ...
Setting up IP address geolocation ...
IP address geolocation database already exists.
IP address geolocation is not configured for updates.
See https://develop.sentry.dev/self-hosted/geolocation/ for instructions.
Error setting up IP address geolocation.


-----------------------------------------------------------------

You're all done! Run the following command to get Sentry running:

  docker-compose up -d

-----------------------------------------------------------------

Nothing really relevant in docker compose logs I believe, can provide if requested though.

Resolution

The healthchecks introduced in #975 seem to be CPU expensive, even though the commands don't look terribly expensive. Maybe reducing the interval to 30s would work for smaller instances?

There seems to be something else that's eating the CPU noticeably more than before the update, I'll have a closer look later if CPU credits keep decreasing.

The text was updated successfully, but these errors were encountered:

BYK · 2021-06-21T18:10:07Z

There seems to be something else that's eating the CPU noticeably more than before the update, I'll have a closer look later if CPU credits keep decreasing.

It might be more performance or native processing? I'll explore health checks, thanks for reporting!

BYK · 2021-06-21T18:17:35Z

@wodCZ

Maybe reducing the interval to 30s would work for smaller instances?

I looked around a bit and this seems like the most viable way. Would you like to tweak this and then submit a patch yourself (so we can give you credit through commits)?

wodCZ · 2021-06-22T11:06:40Z

@BYK thank you for a swift reply 🙏

I looked around a bit and this seems like the most viable way. Would you like to tweak this and then submit a patch yourself (so we can give you credit through commits)?

I'll play with the intervals on our instance and I'll get back with whether the 30s interval actually solves the CPU credit drainage 👌 The patch should be easy then, definitely can do 🙂

wodCZ · 2021-06-25T08:46:30Z

The 30s interval seems to be working fine. The average CPU usage went up from hour-average ~12% to ~17%.
From AWS' documentation on burstable credits, the baseline of t3a.large is 30%, so the instance shouldn't be running short on credits.
Even t3a.medium seems to be sufficient for running sentry with light workloads, with the healthchecks enabled at 30s.

There are some opened issues related to Docker Healthchecks' cpu usage:

Possibly healthchecks can be updated to run more often once these get resolved.

PR coming soon.

The 2s interval caused constantly high CPU usage. 30s interval with 3 retries is the Docker default and doesn't hurt the system that much. Fixes getsentry#1000

BYK · 2021-06-25T11:37:03Z

Possibly healthchecks can be updated to run more often once these get resolved.

We may wanna reference these issues in the code as a note to any future users/maintainers.

) The 2s interval caused constantly high CPU usage. 30s interval with 3 retries is the Docker default and doesn't hurt the system that much. Fixes #1000

github-actions bot added the Status: Unrouted label Jun 21, 2021

BYK added the Status: Backlog label Jun 21, 2021

BYK self-assigned this Jun 21, 2021

github-actions bot removed the Status: Unrouted label Jun 21, 2021

wodCZ mentioned this issue Jun 25, 2021

fix: raise healthcheck interval for redis, memcached and postgres #1007

Merged

BYK added the Status: In Progress label Jun 25, 2021

github-actions bot removed the Status: Backlog label Jun 25, 2021

BYK closed this as completed in #1007 Jun 28, 2021

BYK pushed a commit that referenced this issue Jun 28, 2021

fix: raise healthcheck interval for redis, memcached and postgres (#1007

e25e36b

) The 2s interval caused constantly high CPU usage. 30s interval with 3 retries is the Docker default and doesn't hurt the system that much. Fixes #1000

github-actions bot locked and limited conversation to collaborators Jul 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High CPU usage after update #1000

High CPU usage after update #1000

wodCZ commented Jun 21, 2021

BYK commented Jun 21, 2021

BYK commented Jun 21, 2021

wodCZ commented Jun 22, 2021

wodCZ commented Jun 25, 2021

BYK commented Jun 25, 2021

High CPU usage after update #1000

High CPU usage after update #1000

Comments

wodCZ commented Jun 21, 2021

Version Information

Steps to Reproduce

Expected Result

Actual Result

Logs

Resolution

BYK commented Jun 21, 2021

BYK commented Jun 21, 2021

wodCZ commented Jun 22, 2021

wodCZ commented Jun 25, 2021

BYK commented Jun 25, 2021