-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcdserver: call refreshRangePermCache on Recover() in AuthStore. #14574
Conversation
At least an e2e test case needs to be added to verify the case. The current e2e test framework doesn't support dynamically adding new member. Please wait for #14404 to get merged firstly. Afterwards, you can add an e2e test case for this PR. |
Please add an e2e test case following the same way as https://github.com/etcd-io/etcd/blob/main/tests/e2e/etcd_grpcproxy_test.go |
Will take a look next week.
Вс, 16 окт. 2022 г. в 01:11, Benjamin Wang ***@***.***>:
Please add an e2e test case following the same way as
https://github.com/etcd-io/etcd/blob/main/tests/e2e/etcd_grpcproxy_test.go
—
Reply to this email directly, view it on GitHub
<#14574 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABOURUNQ4BQBTAWFQ6COKJTWDO2ENANCNFSM6AAAAAARCXJFOI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
Oleg Guba
|
status update: working on e2e test, can't really reproduce an issue so far. |
Updated the PR with e2e test. |
Codecov Report
@@ Coverage Diff @@
## main #14574 +/- ##
=======================================
Coverage 75.47% 75.47%
=======================================
Files 457 457
Lines 37274 37276 +2
=======================================
+ Hits 28133 28135 +2
+ Misses 7373 7367 -6
- Partials 1768 1774 +6
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
782774a
to
d5de5e7
Compare
@ahrtr rebased the PR, could you take a look pls? |
Please rebase this PR again. |
done |
@veshij also could you squash test related commits into a single commit? |
0438192
to
392c288
Compare
done |
494ba7d
to
4e5e74f
Compare
Signed-off-by: Oleg Guba <oleg@dropbox.com>
t.Logf("not exactly 2 hashkv responses returned: %d", len(hashKvs)) | ||
return false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be an error or fatal error, and fail the test immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of a node returns an error (which sometimes happens on first request) e2e hashkv implementation returns single entry without an explicit error.
Unfortunately I'm not ready to troubleshoot that further as part of this PR, we've already spent quite a bit on finding issues with e2e test implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe return code is not checked around here
Lines 57 to 75 in e5790d2
func NewExpectWithEnv(name string, args []string, env []string, serverProcessConfigName string) (ep *ExpectProcess, err error) { | |
ep = &ExpectProcess{ | |
cfg: expectConfig{ | |
name: serverProcessConfigName, | |
cmd: name, | |
args: args, | |
env: env, | |
}, | |
} | |
ep.cmd = commandFromConfig(ep.cfg) | |
if ep.fpty, err = pty.Start(ep.cmd); err != nil { | |
return nil, err | |
} | |
ep.wg.Add(1) | |
go ep.read() | |
return ep, nil | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You already considered the case above, line 71 & 72
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
71...72 will be triggered if HashKV() returns an error, but it does not, just a truncated response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that we might need to enhance (*ExpectProcess) ExpectFunc. When the process exited with an error code ( != 0), then we need to return an error.
Please feel free to investigate & enhance it in a separate PR. Let's update this PR per https://github.com/etcd-io/etcd/pull/14574/files#r1007400980
I expect any methods of EtcdctlV3, including HashKV
, should return an error when the command etcdctl
exits with an error code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I'll have time to work on ExpectProcess fixes, I don't know how many tests will fail if we change the behavior and what else will need to be updated.
I'll be happy to rebase this PR and update the test to explicitly fail once e2e env is fixed, will leave it as is for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked it in #14638
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might take a look a bit later this week, can't promise.
Signed-off-by: Oleg Guba <oleg@dropbox.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, will do this week. |
Thanks a lot for fixing the issue @veshij ! |
@ahrtr started looking into backporting to 3.4/3.5. Is there a plan to backport e2e changes? I can backport the fix itself w/o touching tests though. |
Yes, we need to backport the e2e test firstly. @biosvs Do you have bandwidth to backport #14589 to 3.5 and probably 3.4? |
…d-io#14574 Signed-off-by: Oleg Guba <oleg@dropbox.com> Signed-off-by: Hitoshi Mitake <h.mitake@gmail.com>
@ahrtr When I worked on backporting of my tests, it was impossible to backport changes for testing framework itself due to huge number of changes in e2e framework between 3.5 and main branch. It's definitely easier to use raw calls, as I did it in #14397. These calls could be collected to some "library" package, I guess, but I'll have time to try it only in a week (i.e. after next week). |
Thanks for the feedback. Let me take care of it. |
// start second process | ||
if err := epc.StartNewProc(ctx, t, rootUserClientOpts); err != nil { | ||
t.Fatalf("could not start second etcd process (%v)", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't trigger transfering db snapshot to new member and restore auth store from backend. Confirmed by local testing.
By default SnapshotCatchUpEntries
is 5k.
Lines 53 to 59 in 0dfd726
// SnapshotCatchUpEntries is the number of entries for a slow follower | |
// to catch-up after compacting the raft storage entries. | |
// We expect the follower has a millisecond level latency with the leader. | |
// The max throughput is around 10K. Keep a 5K entries is enough for helping | |
// follower to catch up. | |
// WARNING: only change this for tests. Always use "DefaultSnapshotCatchUpEntries" | |
SnapshotCatchUpEntries uint64 |
E2E test don't have ability to tune this hard-coded value while integration test could. For example
etcd/tests/integration/v3_alarm_test.go
Lines 260 to 269 in 0dfd726
func TestV3CorruptAlarmWithLeaseCorrupted(t *testing.T) { | |
integration.BeforeTest(t) | |
lg := zaptest.NewLogger(t) | |
clus := integration.NewCluster(t, &integration.ClusterConfig{ | |
CorruptCheckTime: time.Second, | |
Size: 3, | |
SnapshotCount: 10, | |
SnapshotCatchUpEntries: 5, | |
DisableStrictReconfigCheck: true, | |
}) |
So I doubt this test really protects the issue from happening again while the fix makes sense to me. @veshij @ahrtr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! The reason why we can successfully & manually reproduce & verify this is that the snapi is usually < 5K, it always compacts all entries starting from index 2.
etcd/server/etcdserver/server.go
Lines 2099 to 2102 in 0dfd726
compacti := uint64(1) | |
if snapi > s.Cfg.SnapshotCatchUpEntries { | |
compacti = snapi - s.Cfg.SnapshotCatchUpEntries | |
} |
Two points:
- This might not be correct, we shouldn't compact the log at all if snapi < 5K. Of course, it isn't a big deal.
- In an e2e or integration test case, the snapi should be usually < 5K, so the test case should be still valid. But I agree that we should correct it. So let's write more than 5K K/V to make it correct and stable.
@chaochn47 @veshij Please feel free to deliver a PR to address it for both main
and release-3.5
. Let fix # 2 firstly, and probably @ 1 separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm - we need to only increase number of keys written to 5K (>= SnapshotCatchUpEntries)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YES
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why we can successfully & manually reproduce & verify this is that the snapi is usually < 5K, it always compacts all entries starting from index 2.
I am a little bit confused about this statement. How this test reproduced the scenario that db snapshot is transferred to new member? @ahrtr
Looks like raft log won't be compacted unless the very first one entry.
Lines 227 to 248 in 0dfd726
// Compact discards all log entries prior to compactIndex. | |
// It is the application's responsibility to not attempt to compact an index | |
// greater than raftLog.applied. | |
func (ms *MemoryStorage) Compact(compactIndex uint64) error { | |
ms.Lock() | |
defer ms.Unlock() | |
offset := ms.ents[0].Index | |
if compactIndex <= offset { | |
return ErrCompacted | |
} | |
if compactIndex > ms.lastIndex() { | |
getLogger().Panicf("compact %d is out of bound lastindex(%d)", compactIndex, ms.lastIndex()) | |
} | |
i := compactIndex - offset | |
ents := make([]pb.Entry, 1, 1+uint64(len(ms.ents))-i) | |
ents[0].Index = ms.ents[i].Index | |
ents[0].Term = ms.ents[i].Term | |
ents = append(ents, ms.ents[i+1:]...) | |
ms.ents = ents | |
return nil | |
} |
When new member bootstraps, leader will replicate all the entries to new member instead of transferring snapshot, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I obviously don't fully understand the replication behavior here, but what I noticed while writing the test - without SnapshotCount set to some low value I wasn't able to trigger the issue. Based on logs it looked like raft log was applied instead of snapshot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correction to my previous typo: all entries starting from index 2 will be retained. Since the first entry is compacted, so when a new member is added, the leader will still send a snapshot to the new member.
So let's add the following two cases:
- keep the current case;
- write more than 5K K/V, e.g 5100, and keep all others unchanged.
This won't trigger transfering db snapshot to new member and restore auth store from backend. Confirmed by local testing.
I am curious how did you confirm that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation!! Just verified again new member will always receive a snapshot from leader whenever the configured snapshot count < KV writes.
/home/chaochn/go/src/go.etcd.io/etcd/bin/etcd (TestAuthCluster-test-0) (23533):
{
"level": "info",
"ts": "2022-11-04T06:05:50.932Z",
"caller": "etcdserver/snapshot_merge.go:66",
"msg": "sent database snapshot to writer",
"bytes": 20480,
"size": "20 kB"
}
/home/chaochn/go/src/go.etcd.io/etcd/bin/etcd (TestAuthCluster-test-1) (23773):
{
"level": "info",
"ts": "2022-11-04T06:05:50.932Z",
"caller": "rafthttp/http.go:257",
"msg": "receiving database snapshot",
"local-member-id": "4e6825c690b0cb86",
"remote-snapshot-sender-id": "ca50e9357181d758",
"incoming-snapshot-index": 48,
"incoming-snapshot-message-size-bytes": 7723,
"incoming-snapshot-message-size": "7.7 kB"
}
I must be confused by existing member restart and new member joining the cluster.
The former requires at least 5k KV writes to trigger a db snapshot.
Sorry for the confusion, all good and make sense to me now!
…d-io#14574 Signed-off-by: Oleg Guba <oleg@dropbox.com> Signed-off-by: Hitoshi Mitake <h.mitake@gmail.com>
When new node joins the cluster and applies snapshot - auth cache is not updated.
On auth-enabled cluster it leads to all requests with "user doesn't exist".
It also leads to data inconsistencies in the cluster since updates can't be applied from raft log.
See discussion in #14571.
Fixes: #14571