Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip recalculating the rate in MaxReplicationLagModule when it can't be done #12620

Merged
merged 4 commits into from
Apr 14, 2023

Conversation

ejortegau
Copy link
Contributor

@ejortegau ejortegau commented Mar 13, 2023

Description

This PR defends against lag records with nil stats which can lead to segfaults.

Related Issue(s)

#12619

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on the CI
  • Documentation was added or is not required

Deployment Notes

N/A

…be done

This defends against lag records with nil stats which can lead to segfaults.
See vitessio#12619

Signed-off-by: Eduardo J. Ortega U <5791035+ejortegau@users.noreply.github.com>
@ejortegau ejortegau requested a review from deepthi as a code owner March 13, 2023 17:20
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Mar 13, 2023

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
  • If a test is added or modified, there should be a documentation on top of the test to explain what the expected behavior is what the test does.

If a new flag is being introduced:

  • Is it really necessary to add this flag?
  • Flag names should be clear and intuitive (as far as possible)
  • Help text should be descriptive.
  • Flag names should use dashes (-) as word separators rather than underscores (_).

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow should be required, the maintainer team should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should include a link to an issue that describes the bug.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from VTop, if used there.

@vitess-bot vitess-bot bot added NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says labels Mar 13, 2023
@shlomi-noach shlomi-noach added Type: Bug Component: TabletManager and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says labels Mar 14, 2023
Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we should have a unit test to repro this. If that's a pain we can merge without one.

Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ejortegau can you please add a new unit test here: https://github.com/vitessio/vitess/blob/main/go/vt/throttler/max_replication_lag_module_test.go

This new unit test can ensure that we don't crash or otherwise fail if any of the relevant structs are nil — at least lagRecordNow.stats (guessing there's others too, but maybe not). It really doesn't have to be much. I'll then merge this.

Thank you for yet another contribution! ❤️

go/vt/throttler/max_replication_lag_module.go Outdated Show resolved Hide resolved
Signed-off-by: Eduardo J. Ortega U <5791035+ejortegau@users.noreply.github.com>
@ejortegau
Copy link
Contributor Author

@ejortegau can you please add a new unit test here: https://github.com/vitessio/vitess/blob/main/go/vt/throttler/max_replication_lag_module_test.go

This new unit test can ensure that we don't crash or otherwise fail if any of the relevant structs are nil — at least lagRecordNow.stats (guessing there's others too, but maybe not). It really doesn't have to be much. I'll then merge this.

Thank you for yet another contribution! ❤️

Added one (very simple) one.

Signed-off-by: Eduardo J. Ortega U <5791035+ejortegau@users.noreply.github.com>
@ejortegau
Copy link
Contributor Author

Hi, @mattlord

I don't know if you saw, but I added unit testing as requested. Is there anything else that is missing before this can be merged? Thanks!

expectPanic bool
}{
{
name: "Zero lag",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Isn't this zero time, not zero lag?

Copy link
Contributor Author

@ejortegau ejortegau Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I called it Zero lag because it tests the case here trigerring a panic when lagRecordNow.isZero() is True.

time: time.Time{},
TabletHealth: discovery.TabletHealth{Stats: nil},
},
expectPanic: true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not in this PR, but we should revisit all the panics in the throttler code base. Better to return a well-known error and exit cleanly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I was very surprised when I saw that this code just panics

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deepthi / @ejortegau: panics are cleaned up in this PR: #12901 👍. Reviews appreciated!

@deepthi
Copy link
Member

deepthi commented Apr 12, 2023

I believe the test failures are because we upgraded the go version. A merge from main should fix them.

@ejortegau
Copy link
Contributor Author

I believe the test failures are because we upgraded the go version. A merge from main should fix them.

I just synced my fork and merged main - letting tests re-run.

@deepthi deepthi merged commit 8c68b59 into vitessio:main Apr 14, 2023
dbussink pushed a commit to dbussink/vitess that referenced this pull request Apr 17, 2023
…be done (vitessio#12620)

* Skip recalculating the rate in MaxReplicationLagModule when it can't be done

This defends against lag records with nil stats which can lead to segfaults.
See vitessio#12619

Signed-off-by: Eduardo J. Ortega U <5791035+ejortegau@users.noreply.github.com>

* Address PR comments.

Signed-off-by: Eduardo J. Ortega U <5791035+ejortegau@users.noreply.github.com>

* Make linter happy

Signed-off-by: Eduardo J. Ortega U <5791035+ejortegau@users.noreply.github.com>

---------

Signed-off-by: Eduardo J. Ortega U <5791035+ejortegau@users.noreply.github.com>
timvaillancourt pushed a commit to slackhq/vitess that referenced this pull request May 27, 2024
…be done (vitessio#12620)

* Skip recalculating the rate in MaxReplicationLagModule when it can't be done

This defends against lag records with nil stats which can lead to segfaults.
See vitessio#12619

Signed-off-by: Eduardo J. Ortega U <5791035+ejortegau@users.noreply.github.com>

* Address PR comments.

Signed-off-by: Eduardo J. Ortega U <5791035+ejortegau@users.noreply.github.com>

* Make linter happy

Signed-off-by: Eduardo J. Ortega U <5791035+ejortegau@users.noreply.github.com>

---------

Signed-off-by: Eduardo J. Ortega U <5791035+ejortegau@users.noreply.github.com>
timvaillancourt added a commit to slackhq/vitess that referenced this pull request May 28, 2024
* Skip recalculating the rate in MaxReplicationLagModule when it can't be done (vitessio#12620)

* Skip recalculating the rate in MaxReplicationLagModule when it can't be done

This defends against lag records with nil stats which can lead to segfaults.
See vitessio#12619

Signed-off-by: Eduardo J. Ortega U <5791035+ejortegau@users.noreply.github.com>

* Address PR comments.

Signed-off-by: Eduardo J. Ortega U <5791035+ejortegau@users.noreply.github.com>

* Make linter happy

Signed-off-by: Eduardo J. Ortega U <5791035+ejortegau@users.noreply.github.com>

---------

Signed-off-by: Eduardo J. Ortega U <5791035+ejortegau@users.noreply.github.com>

* Throttled transactions return MySQL error code 1041 ER_OUT_OF_RESOURCES (vitessio#12949)

This error code seems better suited to represent the fact that transactions are
being throttled by the server due to some form of resource contention than the
current code 1203 ER_TOO_MANY_USER_CONNECTIONS.

Signed-off-by: Eduardo J. Ortega U <5791035+ejortegau@users.noreply.github.com>

* MaxReplicationLagModule.recalculateRate no longer fills the log (vitessio#14875)

Signed-off-by: Eduardo J. Ortega U <5791035+ejortegau@users.noreply.github.com>

---------

Signed-off-by: Eduardo J. Ortega U <5791035+ejortegau@users.noreply.github.com>
Co-authored-by: Eduardo J. Ortega U <5791035+ejortegau@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants