Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU usage after update #1000

Closed
wodCZ opened this issue Jun 21, 2021 · 5 comments · Fixed by #1007
Closed

High CPU usage after update #1000

wodCZ opened this issue Jun 21, 2021 · 5 comments · Fixed by #1007
Assignees

Comments

@wodCZ
Copy link
Contributor

wodCZ commented Jun 21, 2021

Version Information

Version: branch 21.6.1

Steps to Reproduce

  1. We were running on a version roughly 2-3 months old, on master, I believe Sentry showed 21.6.dev0?
  2. I've checked out 21.6.1 and ran ./install.sh. There was an error with a docker network having an active endpoint, so I restarted docker, then install script finished and sentry started.
  3. The CPU usage went up noticeably. We're running on aws, t3a.large instance (2vcpu, 8gb ram). The server began to burn through the CPU credits it saved up.
    CleanShot 2021-06-21 at 17 32 52
    CleanShot 2021-06-21 at 17 33 20
    CleanShot 2021-06-21 at 17 33 44
  4. At first I thought there was some new processing added and maybe Sentry is chewing through the old issues, but I couldn't really find any process that would confirm this.
  5. Using Netdata I noticed a regular spikes in CPU usage for redis, memcached and postgres containers. Looking through commits, feat: Add healthchecks for redis, memcached and postgres #975 matched the description. I commented out the added lines in my docker-compose and the issue went away.

Expected Result

Relatively low cpu usage even though the instance isn't a massive beast. We have ~890 events per 24hrs.

CleanShot 2021-06-21 at 17 42 45

Actual Result

The CPU was constantly spiking,

CleanShot 2021-06-21 at 17 44 03

Logs

What you saw along the way, e.g.:

Latest install logs
▶ Parsing command line ...

▶ Setting up error handling ...

▶ Checking minimum requirements ...
WARN: Recommended minimum CPU cores available to Docker is 4, found 2

▶ Creating volumes for persistent storage ...
Created sentry-clickhouse.
Created sentry-data.
Created sentry-kafka.
Created sentry-postgres.
Created sentry-redis.
Created sentry-symbolicator.
Created sentry-zookeeper.

▶ Ensuring files from examples ...
../sentry/sentry.conf.py already exists, skipped creation.
../sentry/config.yml already exists, skipped creation.
../symbolicator/config.yml already exists, skipped creation.
../sentry/requirements.txt already exists, skipped creation.

▶ Generating secret key ...

▶ Replacing TSDB ...

▶ Fetching and updating Docker images ...
Some service image(s) must be built from source by running:
    docker-compose build snuba-transactions-cleanup snuba-cleanup symbolicator-cleanup sentry-cleanup
nightly: Pulling from getsentry/sentry
Digest: sha256:46a338a8e784066e9e5791da3083f992d3cd30513947399d59416d0e74083abb
Status: Image is up to date for getsentry/sentry:nightly
docker.io/getsentry/sentry:nightly

▶ Building and tagging Docker images ...

smtp uses an image, skipping
memcached uses an image, skipping
redis uses an image, skipping
postgres uses an image, skipping
zookeeper uses an image, skipping
kafka uses an image, skipping
clickhouse uses an image, skipping
geoipupdate uses an image, skipping
snuba-api uses an image, skipping
snuba-consumer uses an image, skipping
snuba-outcomes-consumer uses an image, skipping
snuba-sessions-consumer uses an image, skipping
snuba-transactions-consumer uses an image, skipping
snuba-replacer uses an image, skipping
snuba-subscription-consumer-events uses an image, skipping
snuba-subscription-consumer-transactions uses an image, skipping
symbolicator uses an image, skipping
web uses an image, skipping
cron uses an image, skipping
worker uses an image, skipping
ingest-consumer uses an image, skipping
post-process-forwarder uses an image, skipping
subscription-consumer-events uses an image, skipping
subscription-consumer-transactions uses an image, skipping
relay uses an image, skipping
nginx uses an image, skipping
Building snuba-cleanup
Step 1/5 : ARG BASE_IMAGE
Step 2/5 : FROM ${BASE_IMAGE}
 ---> d88d98e289f8
Step 3/5 : RUN apt-get update && apt-get install -y --no-install-recommends cron &&     rm -r /var/lib/apt/lists/*
 ---> Using cache
 ---> 358117da8fd4
Step 4/5 : COPY entrypoint.sh /entrypoint.sh
 ---> Using cache
 ---> f62a846a565e
Step 5/5 : ENTRYPOINT ["/entrypoint.sh"]
 ---> Using cache
 ---> e845cbc5b471

Successfully built e845cbc5b471
Successfully tagged snuba-cleanup-onpremise-local:latest
Building snuba-transactions-cleanup
Step 1/5 : ARG BASE_IMAGE
Step 2/5 : FROM ${BASE_IMAGE}
 ---> d88d98e289f8
Step 3/5 : RUN apt-get update && apt-get install -y --no-install-recommends cron &&     rm -r /var/lib/apt/lists/*
 ---> Using cache
 ---> 358117da8fd4
Step 4/5 : COPY entrypoint.sh /entrypoint.sh
 ---> Using cache
 ---> f62a846a565e
Step 5/5 : ENTRYPOINT ["/entrypoint.sh"]
 ---> Using cache
 ---> e845cbc5b471

Successfully built e845cbc5b471
Successfully tagged snuba-cleanup-onpremise-local:latest
Building symbolicator-cleanup
Step 1/5 : ARG BASE_IMAGE
Step 2/5 : FROM ${BASE_IMAGE}
 ---> 3d7425d7b53f
Step 3/5 : RUN apt-get update && apt-get install -y --no-install-recommends cron &&     rm -r /var/lib/apt/lists/*
 ---> Using cache
 ---> 2f627c3d4a61
Step 4/5 : COPY entrypoint.sh /entrypoint.sh
 ---> Using cache
 ---> 7ea338ba7ecc
Step 5/5 : ENTRYPOINT ["/entrypoint.sh"]
 ---> Using cache
 ---> 28fc8186aef1

Successfully built 28fc8186aef1
Successfully tagged symbolicator-cleanup-onpremise-local:latest
Building sentry-cleanup
Step 1/5 : ARG BASE_IMAGE
Step 2/5 : FROM ${BASE_IMAGE}
 ---> 54a53ba83510
Step 3/5 : RUN apt-get update && apt-get install -y --no-install-recommends cron &&     rm -r /var/lib/apt/lists/*
 ---> Using cache
 ---> f5abb0673180
Step 4/5 : COPY entrypoint.sh /entrypoint.sh
 ---> Using cache
 ---> 5d11e01c10f4
Step 5/5 : ENTRYPOINT ["/entrypoint.sh"]
 ---> Using cache
 ---> 7e01dd084207

Successfully built 7e01dd084207
Successfully tagged sentry-cleanup-onpremise-local:latest

Docker images built.

▶ Turning things off ...
Removing network onpremise_default
Network onpremise_default not found.
Removing network sentry_onpremise_default
Network sentry_onpremise_default not found.

▶ Setting up Zookeeper ...
Creating network "sentry_onpremise_default" with the default driver
Creating sentry_onpremise_zookeeper_run ...
Creating sentry_onpremise_zookeeper_run ... done
Creating sentry_onpremise_zookeeper_run ...
Creating sentry_onpremise_zookeeper_run ... done
Creating sentry_onpremise_zookeeper_run ...
Creating sentry_onpremise_zookeeper_run ... done

▶ Downloading and installing wal2json ...
Unable to find image 'curlimages/curl:latest' locally
latest: Pulling from curlimages/curl
339de151aab4: Pulling fs layer
9576f9a419e9: Pulling fs layer
5b67e0e9bce9: Pulling fs layer
7c0640a6c519: Pulling fs layer
1d8f7d50664a: Pulling fs layer
3ce74da83aeb: Pulling fs layer
7094fabf5e16: Pulling fs layer
2f4d0b649c1f: Pulling fs layer
110e7f874674: Pulling fs layer
7c0640a6c519: Waiting
1d8f7d50664a: Waiting
3ce74da83aeb: Waiting
7094fabf5e16: Waiting
2f4d0b649c1f: Waiting
110e7f874674: Waiting
9576f9a419e9: Verifying Checksum
9576f9a419e9: Download complete
5b67e0e9bce9: Verifying Checksum
5b67e0e9bce9: Download complete
339de151aab4: Verifying Checksum
339de151aab4: Pull complete
1d8f7d50664a: Verifying Checksum
1d8f7d50664a: Download complete
7c0640a6c519: Verifying Checksum
7c0640a6c519: Download complete
3ce74da83aeb: Verifying Checksum
3ce74da83aeb: Download complete
9576f9a419e9: Pull complete
5b67e0e9bce9: Pull complete
7c0640a6c519: Pull complete
7094fabf5e16: Verifying Checksum
7094fabf5e16: Download complete
2f4d0b649c1f: Verifying Checksum
2f4d0b649c1f: Download complete
110e7f874674: Verifying Checksum
110e7f874674: Download complete
1d8f7d50664a: Pull complete
3ce74da83aeb: Pull complete
7094fabf5e16: Pull complete
2f4d0b649c1f: Pull complete
110e7f874674: Pull complete
Digest: sha256:6e0a786e3e5181df00eaf3a0a1749c18a6bb20b01c9bd192ea72176ce8a1c94b
Status: Downloaded newer image for curlimages/curl:latest
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  5416  100  5416    0     0  35922      0 --:--:-- --:--:-- --:--:-- 36106
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   638  100   638    0     0   3656      0 --:--:-- --:--:-- --:--:--  3666
100  101k  100  101k    0     0   290k      0 --:--:-- --:--:-- --:--:--  290k

▶ Bootstrapping and migrating Snuba ...
Creating sentry_onpremise_redis_1 ...
Creating sentry_onpremise_clickhouse_1 ...
Creating sentry_onpremise_zookeeper_1  ...
Creating sentry_onpremise_clickhouse_1 ... done
Creating sentry_onpremise_redis_1      ... done
Creating sentry_onpremise_zookeeper_1  ... done
Creating sentry_onpremise_kafka_1      ...
Creating sentry_onpremise_kafka_1      ... done
Creating sentry_onpremise_snuba-api_run ...
Creating sentry_onpremise_snuba-api_run ... done
2021-06-21 10:29:44,125 Attempting to connect to Kafka (attempt 0)...
2021-06-21 10:29:46,145 Attempting to connect to Kafka (attempt 1)...
2021-06-21 10:29:48,149 Attempting to connect to Kafka (attempt 2)...
2021-06-21 10:29:50,152 Attempting to connect to Kafka (attempt 3)...
2021-06-21 10:29:51,143 Connected to Kafka on attempt 3
2021-06-21 10:29:51,144 Creating Kafka topics...
2021-06-21 10:29:51,936 Topic event-replacements-legacy created
2021-06-21 10:29:51,936 Topic ingest-metrics created
2021-06-21 10:29:51,937 Topic snuba-queries created
Starting sentry_onpremise_zookeeper_1 ...
Starting sentry_onpremise_clickhouse_1 ...
Starting sentry_onpremise_redis_1      ...
Starting sentry_onpremise_redis_1      ... done
Starting sentry_onpremise_clickhouse_1 ... done
Starting sentry_onpremise_zookeeper_1  ... done
Starting sentry_onpremise_kafka_1      ...
Starting sentry_onpremise_kafka_1      ... done
Creating sentry_onpremise_snuba-api_run ...
Creating sentry_onpremise_snuba-api_run ... done
Starting migration from 2021-06-21
Migrated 2021-06-21. (1 of 13 partitions done)
Migrated 2021-06-14. (2 of 13 partitions done)
Migrated 2021-06-07. (3 of 13 partitions done)
Migrated 2021-05-31. (4 of 13 partitions done)
Migrated 2021-05-24. (5 of 13 partitions done)
Migrated 2021-05-17. (6 of 13 partitions done)
Migrated 2021-05-10. (7 of 13 partitions done)
Migrated 2021-05-03. (8 of 13 partitions done)
Migrated 2021-04-26. (9 of 13 partitions done)
Migrated 2021-04-19. (10 of 13 partitions done)
Migrated 2021-04-12. (11 of 13 partitions done)
Migrated 2021-04-05. (12 of 13 partitions done)
Migrated 2021-03-29. (13 of 13 partitions done)
Done. Optimizing.
Finished running migrations

▶ Creating additional Kafka topics ...

▶ Ensuring proper PostgreSQL version ...

▶ Setting up / migrating database ...
Creating sentry_onpremise_postgres_1 ...
Starting sentry_onpremise_redis_1    ...
Starting sentry_onpremise_zookeeper_1 ...
Creating sentry_onpremise_symbolicator_1 ...
Starting sentry_onpremise_zookeeper_1    ... done
Starting sentry_onpremise_redis_1        ... done
Starting sentry_onpremise_clickhouse_1   ...
Starting sentry_onpremise_clickhouse_1   ... done
Creating sentry_onpremise_memcached_1    ...
Starting sentry_onpremise_kafka_1        ...
Starting sentry_onpremise_kafka_1        ... done
Creating sentry_onpremise_smtp_1         ...
Creating sentry_onpremise_snuba-api_1    ...
Creating sentry_onpremise_snuba-consumer_1 ...
Creating sentry_onpremise_snuba-replacer_1 ...
Creating sentry_onpremise_snuba-subscription-consumer-events_1 ...
Creating sentry_onpremise_snuba-sessions-consumer_1            ...
Creating sentry_onpremise_snuba-outcomes-consumer_1            ...
Creating sentry_onpremise_snuba-subscription-consumer-transactions_1 ...
Creating sentry_onpremise_snuba-transactions-consumer_1              ...
Creating sentry_onpremise_postgres_1                                 ... done
Creating sentry_onpremise_snuba-api_1                                ... done
Creating sentry_onpremise_symbolicator_1                             ... done
Creating sentry_onpremise_smtp_1                                     ... done
Creating sentry_onpremise_snuba-consumer_1                           ... done
Creating sentry_onpremise_memcached_1                                ... done
Creating sentry_onpremise_snuba-replacer_1                           ... done
Creating sentry_onpremise_snuba-subscription-consumer-events_1       ... done
Creating sentry_onpremise_snuba-outcomes-consumer_1                  ... done
Creating sentry_onpremise_snuba-subscription-consumer-transactions_1 ... done
Creating sentry_onpremise_snuba-transactions-consumer_1              ... done
Creating sentry_onpremise_snuba-sessions-consumer_1                  ... done
Creating sentry_onpremise_web_run                                    ...
Creating sentry_onpremise_web_run                                    ... done
10:31:11 [INFO] sentry.plugins.github: apps-not-configured
* Unknown config option found: 'github-app.extended-permissions'
* Unknown config option found: 'slack.legacy-app'
Running migrations for default
Operations to perform:
  Apply all migrations: admin, auth, contenttypes, jira_ac, nodestore, sentry, sessions, sites, social_auth
Running migrations:
User Options: 100% |#############################################| Time: 0:00:00
 OK
User Options: 100% |#############################################| Time: 0:00:00
 OK
  Applying sentry.0186_add_externalactor... OK
Saved Searchs: 100% |############################################| Time: 0:00:00
 OK
  Applying sentry.0188_remove_externalteam_externaluser_fk_constraints... OK
  Applying sentry.0189_remove_externaluser_externalteam_models... OK
  Applying sentry.0190_drop_external_user_table... OK
  Applying sentry.0191_make_externalactor_integration_id_not_null... OK
  Applying sentry.0192_remove_fileblobowner_org_fk... OK
  Applying sentry.0193_grouprelease_indexes... OK
  Applying sentry.0194_add_custom_scm_provider... OK
  Applying sentry.0195_add_team_key_transactions... OK
  Applying sentry.0196_add_restricted_member_limit... OK
  Applying sentry.0197_add_scim_enabled_boolean... OK
  Applying sentry.0198_add_project_transaction_threshold... OK
  Applying sentry.0199_release_semver... OK
  Applying sentry.0200_release_indices... OK
  Applying sentry.0201_semver_package... OK
  Applying sentry.0202_org_slug_upper_idx... OK
  Applying sentry.0203_groupedmessage_status_index... OK
  Applying sentry.0204_use_project_team_for_team_key_transactions... OK
Releases: 100% |#################################################| Time: 0:00:00
 OK
  Applying sentry.0206_organization_require_email_verification_flag... OK
  Applying sentry.0207_release_adoption... OK
  Applying sentry.0208_add_team_scope... OK
  Applying sentry.0209_avatar_remove_file_fk... OK
Organization Options: 100% |#####################################| Time: 0:00:00
 OK
Creating missing DSNs
Correcting Group.num_comments counter

▶ Migrating file storage ...

▶ Generating Relay credentials ...
../relay/config.yml already exists, skipped creation.

▶ Setting up GeoIP integration ...
Setting up IP address geolocation ...
IP address geolocation database already exists.
IP address geolocation is not configured for updates.
See https://develop.sentry.dev/self-hosted/geolocation/ for instructions.
Error setting up IP address geolocation.


-----------------------------------------------------------------

You're all done! Run the following command to get Sentry running:

  docker-compose up -d

-----------------------------------------------------------------

Nothing really relevant in docker compose logs I believe, can provide if requested though.

Resolution

The healthchecks introduced in #975 seem to be CPU expensive, even though the commands don't look terribly expensive. Maybe reducing the interval to 30s would work for smaller instances?

There seems to be something else that's eating the CPU noticeably more than before the update, I'll have a closer look later if CPU credits keep decreasing.

@BYK
Copy link
Member

BYK commented Jun 21, 2021

There seems to be something else that's eating the CPU noticeably more than before the update, I'll have a closer look later if CPU credits keep decreasing.

It might be more performance or native processing? I'll explore health checks, thanks for reporting!

@BYK BYK self-assigned this Jun 21, 2021
@BYK
Copy link
Member

BYK commented Jun 21, 2021

@wodCZ

Maybe reducing the interval to 30s would work for smaller instances?

I looked around a bit and this seems like the most viable way. Would you like to tweak this and then submit a patch yourself (so we can give you credit through commits)?

@wodCZ
Copy link
Contributor Author

wodCZ commented Jun 22, 2021

@BYK thank you for a swift reply 🙏

I looked around a bit and this seems like the most viable way. Would you like to tweak this and then submit a patch yourself (so we can give you credit through commits)?

I'll play with the intervals on our instance and I'll get back with whether the 30s interval actually solves the CPU credit drainage 👌 The patch should be easy then, definitely can do 🙂

@wodCZ
Copy link
Contributor Author

wodCZ commented Jun 25, 2021

The 30s interval seems to be working fine. The average CPU usage went up from hour-average ~12% to ~17%.
From AWS' documentation on burstable credits, the baseline of t3a.large is 30%, so the instance shouldn't be running short on credits.
Even t3a.medium seems to be sufficient for running sentry with light workloads, with the healthchecks enabled at 30s.

There are some opened issues related to Docker Healthchecks' cpu usage:

Possibly healthchecks can be updated to run more often once these get resolved.

CleanShot 2021-06-25 at 10 32 31

PR coming soon.

wodCZ added a commit to wodCZ/onpremise that referenced this issue Jun 25, 2021
The 2s interval caused constantly high CPU usage. 30s interval with 3 retries is the Docker default and doesn't hurt the system that much.
Fixes getsentry#1000
@BYK
Copy link
Member

BYK commented Jun 25, 2021

Possibly healthchecks can be updated to run more often once these get resolved.

We may wanna reference these issues in the code as a note to any future users/maintainers.

@BYK BYK closed this as completed in #1007 Jun 28, 2021
BYK pushed a commit that referenced this issue Jun 28, 2021
)

The 2s interval caused constantly high CPU usage. 30s interval with 3 retries is the Docker default and doesn't hurt the system that much.
Fixes #1000
@github-actions github-actions bot locked and limited conversation to collaborators Jul 14, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants