-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
schema-reader: Shutdown service if corrupt entries in _schemas
topic
#936
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
nosahama
force-pushed
the
nosahama/abort-service-on-corrupt-schema
branch
4 times, most recently
from
August 22, 2024 14:02
d4cfcbb
to
5040bf1
Compare
Coverage reportClick to see where and how coverage changed
This report was generated by python-coverage-comment-action |
nosahama
force-pushed
the
nosahama/abort-service-on-corrupt-schema
branch
3 times, most recently
from
August 27, 2024 08:20
92d127e
to
9086eeb
Compare
Previously, when we encounter errors within the `_schemas` topic, we would continue the message loading and skip the problematic schema. This is not ideal as it might leave the application with corrupt schema data and the side-effects could be grave. What we do now is to kill the service, log the errors and allow a graceful shutdown. We will follow this work by adding metrics for such cases.
nosahama
force-pushed
the
nosahama/abort-service-on-corrupt-schema
branch
from
August 27, 2024 13:20
9086eeb
to
1cdd0f1
Compare
This was referenced Aug 29, 2024
nosahama
force-pushed
the
nosahama/abort-service-on-corrupt-schema
branch
2 times, most recently
from
August 30, 2024 14:44
d2f22c9
to
2e716a4
Compare
nosahama
commented
Aug 30, 2024
nosahama
force-pushed
the
nosahama/abort-service-on-corrupt-schema
branch
5 times, most recently
from
September 3, 2024 16:22
c6588b7
to
751c662
Compare
We'd like to allow the shutdown logic on corrutp schema be guarded by a feature flag so we do not surprise customers, this is disabled by default.
nosahama
force-pushed
the
nosahama/abort-service-on-corrupt-schema
branch
from
September 4, 2024 12:45
751c662
to
2d9f1e8
Compare
jjaakola-aiven
approved these changes
Sep 4, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
About this change - What it does
These breaking changes are guarded by environmental variables
These can be toggled by the various users, the default values are shown below:
The logic below only applies when
KARAPACE_KAFKA_SCHEMA_READER_STRICT_MODE
is set totrue
, everything else remains the same, thus no test changes.Previously, when we encounter errors within the
_schemas
topic, we would continue the message loading and skip the problematic schema. This is not ideal as it might leave the application with corrupt schema data and the side-effects could be grave.What we do now is to kill the service, log the errors and allow a graceful shutdown. We will follow this work by adding metrics for such cases.
This will also stop the service post backup-v1 restore if there are any corrupt schemas present in the backup log file
We can see below that the shutdown is graceful:
Adding
restart: always
to docker compose shows the behaviour, the service never proceeds past that stage, i think based on the restart behaviour forsystemd
, we might need to rely on the alerts and then intervene otherwise it'd leave the service in a crash loop, we need to verify if there are SLOs or metrics setup somewhere to track at least service uptime, which will be affected by this.