tests: reproduce a panic #18588

clement2026 · 2024-09-14T09:45:43Z

This PR is just for demo and discussion, please don't merge it.

@ahrtr Let's continue our discussion in #18459 (comment)

Please forgive me for not explaining the problem clearly. The panic issue I mentioned is not a bug in the main branch; it's actually a side effect of PR #18459.

The panic shows up because PR #18459 separates the raft log compact from the snapshot.

I've made minor changes to server.go, to separate the raft log compact from the snapshot. Now, the raft log gets compacted in each apply loop, retaining only SnapshotCatchUpEntries` entries.

Two test cases are added to demonstrate when a panic should or shouldn't occur. Please take a look at the files changed.

Signed-off-by: Clement <gh.2lgqz@aleeas.com>

k8s-ci-robot · 2024-09-14T09:45:46Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: clement2026
Once this PR has been reviewed and has the lgtm label, please assign jmhbnz for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2024-09-14T09:45:52Z

Hi @clement2026. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

codecov-commenter · 2024-09-14T09:53:54Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 71.42857% with 2 lines in your changes missing coverage. Please review.

Project coverage is 66.47%. Comparing base (981061a) to head (75a1cbd).

❗ Current head 75a1cbd differs from pull request most recent head ba4bc88

Please upload reports for the commit ba4bc88 to get more accurate results.

Files with missing lines	Patch %	Lines
server/etcdserver/server.go	71.42%	2 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

Files with missing lines	Coverage Δ
server/etcdserver/server.go	`74.60% <71.42%> (-6.81%)`	⬇️

... and 87 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #18588      +/-   ##
==========================================
- Coverage   68.73%   66.47%   -2.26%     
==========================================
  Files         420      420              
  Lines       35474    35477       +3     
==========================================
- Hits        24382    23583     -799     
- Misses       9666    10437     +771     
- Partials     1426     1457      +31

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 981061a...ba4bc88. Read the comment docs.

Signed-off-by: Clement <gh.2lgqz@aleeas.com>

ahrtr · 2024-09-16T12:50:25Z

The panic is because you changed the production code in this PR.

Previously etcd only performs compaction after each snapshot. Now you perform compaction on each application (applying entries). (Have you evaluated the CPU usage change?). You changed the behaviour, but I do not see a KEP, nor a design doc, let alone careful review by the community. Please draft a google doc or KEP to summary the motivation & design, and request review from the community.

I understood that it's related to #17098, and also I see you raised some related small PRs. Each of the PR might seem harmless, but let's get the high level design clarified before moving on.

serathius · 2024-09-17T08:11:23Z

Thanks for raising this concern. I understand the importance of KEPs and community review, especially for changes to critical apply code.

As you noted, this PR is part of a series of performance improvements outlined in issue #17098. This issue clearly states the motivation for these changes, and includes data gathered by @clement2026. These PRs aim to address the specific problem of etcd not having a snapshot file at startup, as detailed in #18459.

Creating a snapshot at startup offers several advantages:

Minimal Disruption: It doesn't require altering the maybeSendSnapshot logic, preserving backward compatibility and avoiding potential edge cases.
Leverages Existing Mechanisms: It reuses etcd's existing snapshotting mechanism, ensuring safety and reliability.

You're right that individual PRs might seem harmless in isolation. However, they are interconnected and build upon each other to achieve the desired performance improvements. A comprehensive overview can be found in #17098.

To address your request for a more formalized design document, I'm happy to draft a Google Doc or KEP summarizing the motivation, design decisions, and implementation details of this performance improvement initiative. I'll share it with the community for review and feedback.

clement2026 · 2024-09-18T12:55:56Z

Thanks @serathius for clarifying the ongoing work! If there's anything I can help with, just let me know.

serathius · 2024-09-19T08:10:34Z

@clement2026 Please reach out to me on the Kubernetes Slack.

I'm little busy this week, but I would want to unblock your work as early as possible.

clement2026 · 2024-09-23T13:24:08Z

After chatting with serathius, I've realized this PR isn’t needed anymore. I'll keep working on #17098 in a new PR.
Closing this.

tests: reproduce raft panic

8947d9a

Signed-off-by: Clement <gh.2lgqz@aleeas.com>

k8s-ci-robot added do-not-merge/work-in-progress area/testing labels Sep 14, 2024

k8s-ci-robot added the needs-ok-to-test label Sep 14, 2024

k8s-ci-robot added the size/L label Sep 14, 2024

clement2026 changed the title ~~reproduce raft panic~~ tests: reproduce raft panic Sep 14, 2024

clement2026 changed the title ~~tests: reproduce raft panic~~ tests: reproduce a panic Sep 14, 2024

tests: reproduce raft panic

ba4bc88

Signed-off-by: Clement <gh.2lgqz@aleeas.com>

ahrtr mentioned this pull request Sep 16, 2024

Reduce memory usage of etcd member catchup mechanism #17098

Open

clement2026 closed this Sep 23, 2024

clement2026 deleted the archive-reproduce-raft-panic branch September 24, 2024 09:21

clement2026 restored the archive-reproduce-raft-panic branch September 24, 2024 09:21

clement2026 deleted the archive-reproduce-raft-panic branch September 24, 2024 09:32

ahrtr mentioned this pull request Nov 3, 2024

Run a separate in memory snapshot to reduce number of entries stored in raft memory storage #18825

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests: reproduce a panic #18588

tests: reproduce a panic #18588

clement2026 commented Sep 14, 2024 •

edited

Loading

k8s-ci-robot commented Sep 14, 2024

k8s-ci-robot commented Sep 14, 2024

codecov-commenter commented Sep 14, 2024 •

edited

Loading

ahrtr commented Sep 16, 2024

serathius commented Sep 17, 2024

clement2026 commented Sep 18, 2024

serathius commented Sep 19, 2024

clement2026 commented Sep 23, 2024

tests: reproduce a panic #18588

tests: reproduce a panic #18588

Conversation

clement2026 commented Sep 14, 2024 • edited Loading

k8s-ci-robot commented Sep 14, 2024

k8s-ci-robot commented Sep 14, 2024

codecov-commenter commented Sep 14, 2024 • edited Loading

Codecov Report

ahrtr commented Sep 16, 2024

serathius commented Sep 17, 2024

clement2026 commented Sep 18, 2024

serathius commented Sep 19, 2024

clement2026 commented Sep 23, 2024

clement2026 commented Sep 14, 2024 •

edited

Loading

codecov-commenter commented Sep 14, 2024 •

edited

Loading