Incorrect hash when resuming scheduled compaction after etcd restarts #15919

kennydo · 2023-05-18T05:56:23Z

What happened?

I tried out using the --experimental-compact-hash-check-enabled flag on my etcd cluster today, and I think I might have encountered a race condition when a server starts up/resumes a scheduled compaction that causes the hash value to be incorrect.

About my setup: I have a 5 member cluster, running as k8s pods. They already have the --experimental-initial-corrupt-check flag enabled.

I was adding the --experimental-compact-hash-check-enabled flag (and also upgrading from 3.5.6 to 3.5.9) on a 5-member node. For 3 of the 4 etcd members, I had already stopped the etcd pod, updated their container URLs to 3.5.9, added the new hash check flag, and then restarted the etcd pod. I then stopped the 4th follower (which I'll call FOO), edited its pod config, and brought it back up. I then stopped the leader, and did the same update/restart. This triggered a leader election, and one of the former followers (which I'll call BAR) became the new leader.

The new leader then performed the periodic compact hash check, as configured. But the strange thing is that, even though all of my pods were up and running (and had passed the initial corruption check), the very first periodic check failed!

Due to the liveness probe problem mentioned in #13340, I had to deal with my pods getting restarted by kube due to the alarm... But when I disarmed the CORRUPT alarm so my etcd pods could stay up and running, they all again passed the initial corrupt data check, and logged the same hashes for the following "storing new hash" logs. So that's why I suspect there was very unlucky timing and a race condition that caused FOO to calculate the wrong hash value and use that when BAR asked for its hash during that periodic compaction hash check.

What did you expect to happen?

I expected the periodic compaction hash check to succeed and for my cluster be healthy because:

the initial data check passed on all members of the cluster
the hash/compact-revision that FOO logged with the "storing new hash" line both before and after the problematic hash matched the hashes in the rest of the cluster

How can we reproduce it (as minimally and precisely as possible)?

I'm not sure how to reproduce it. Maybe create a multi-member cluster with enough data so that compaction takes a non-trivial amount of time, trigger a compaction, and then stop one of the members right in the middle of the compaction so that compaction fails and needs to resume compacting when it comes back up?

Anything else we need to know?

I saw #15548, but that mentions the cluster ID changing, and in this scenario where just I'm restarting etcd with different flags, I don't think the cluster ID would change.

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.5.9
Git SHA: bdbbde998
Go Version: go1.19.9
Go OS/Arch: linux/amd64

$ etcdctl version
etcdctl version: 3.5.9
API version: 3.5

Etcd configuration (command line flags or environment variables)

(from a k8s pod manifest)

  - command:
    - etcd
    - --advertise-client-urls=https://XXX:2379
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --client-cert-auth=true
    - --data-dir=/var/lib/etcd
    - --experimental-initial-corrupt-check=true
    - --experimental-compact-hash-check-enabled=true
    - --initial-advertise-peer-urls=https://XXX:2380
    - --initial-cluster-token=XXX
    - --initial-cluster=XXX
    - --initial-cluster-state=existing
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - --listen-client-urls=https://127.0.0.1:2379,https://XXX:2379
    - --listen-metrics-urls=http://127.0.0.1:2381,http://XXX:2381
    - --listen-peer-urls=https://XXX:2380
    - --name=XXX
    - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
    - --peer-client-cert-auth=true
    - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
    - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --snapshot-count=10000
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --auto-compaction-mode=periodic
    - --auto-compaction-retention=1h
    - --quota-backend-bytes=15032385536
    image: registry.k8s.io/etcd:3.5.9-0

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# Lost due to pods being restarted, but you can see the endpoints and IDs below

# This endpoint status table was captured while the cluster was in the alarmed state
$ etcdctl --endpoints $endpoints endpoint status -w table
+--------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+----------------+
|      ENDPOINT      |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX |     ERRORS     |
+--------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+----------------+
|  XXX:2379 | 367806663aa1996c |   3.5.9 |  4.9 GB |     false |      false |        30 |  718184797 |          718184797 | alarm:CORRUPT  |
| XXX:2379 | 6069c22b85ee10b0 |   3.5.9 |  4.8 GB |      true |      false |        30 |  718184798 |          718184798 | alarm:CORRUPT  |
| XXX:2379 | 917d2067aeb00315 |   3.5.9 |  4.8 GB |     false |      false |        30 |  718184802 |          718184802 | alarm:CORRUPT  |
| XXX:2379 | b2529a8db451f246 |   3.5.9 |  4.8 GB |     false |      false |        30 |  718184804 |          718184804 | alarm:CORRUPT  |
|  XXX:2379 | e56f12f0d98830bb |   3.5.9 |  4.9 GB |     false |      false |        30 |  718184805 |          718184805 | alarm:CORRUPT  |
+--------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+----------------+

Relevant log output

(not literal logs, but reconstructed from my investigation)

* 15:34:55 - all nodes log "storing new hash". revision=702700874 compact-revision=702656236 hash=483395635
* ... I update the other etcd nodes one by one
* 15:39:54 - FOO  starts shutting down
* 15:39:54 - every node but FOO logs "storing new hash" of hash=2533962327 for compact-revision=702700874 at revision=702739864.
* 15:39:54- FOO logs "Failed compaction" (because the etcd pod was terminating)
* 15:40:24 - Updated etcd pod on  FOO starts running
* 15:40:24 - FOO logs "resume scheduled compaction"
* 15:40:35 - FOO logs "storing new hash" of hash=2915350535 for compact-revision=702700874 at revision=702739864. (But every other node for this compact-revision had hash=2533962327)!
* 15:42:06 - BAR became leader
* 15:42:18 - BAR started compact hash check
* 15:42:18 -  BAR (leader) says "failed compaction hash check". For revision 702739864, leader had 2533962327, but follower FOO had hash 2915350535. BAR logged "alarm raised"

The text was updated successfully, but these errors were encountered:

ahrtr · 2023-05-24T09:14:08Z

Thanks for raising this issue. It seems that the member automatically recovered from the "corrupted" status after you disarmed the CORRUPT alarm?

I think we can intentionally inject a failpoint during the compaction operation to try to reproduce this issue.

kennydo · 2023-05-24T18:29:14Z

I'm not sure whether the resolution was directly triggered by disarming the CORRUPT alarm, or whether it unblocked something (ex: now that the cluster was in a healthy state again, one of the automatic compaction requests from kube apiserver could occur and update the value stored for the latest compacted hash.) But whatever happened after I disarmed it happened automatically from my perspective. The cluster went back to serving write requests just fine.

CaojiamingAlan · 2023-05-26T21:53:55Z

Seems that I've found the bug. It seems that when resuming the the compaction after recovery, the hasher does not fully hash every key and value, but starts from where it is interrupted(since the older values has been deleted), thus giving a different hash value.

The result of compaction won't be different, but the hash values diverge, as compaction only keeps the latest version, but the hasher uses all of the history.

Please assign this to me @ahrtr , I will provide tests to reproduce this issue and fix it.

Relevant code:

etcd/server/storage/mvcc/kvstore_compaction.go

Lines 44 to 61 in 7cc98e6

    
           h := newKVHasher(prevCompactRev, compactMainRev, keep) 
        
           last := make([]byte, 8+1+8) 
        
           for { 
        
           	var rev revision 
        
           	start := time.Now() 
        
           	tx := s.b.BatchTx() 
        
           	tx.LockOutsideApply() 
        
           	keys, values := tx.UnsafeRange(schema.Key, last, end, int64(batchNum)) 
        
           	for i := range keys { 
        
           		rev = bytesToRev(keys[i]) 
        
           		if _, ok := keep[rev]; !ok { 
        
           			tx.UnsafeDelete(schema.Key, keys[i]) 
        
           			keyCompactions++ 
        
           		} 
        
           		h.WriteKeyValue(keys[i], values[i]) 
        
           	}

Here the stale value of a key is deleted in the process of hashing (line 57). If this is interrupted, the hasher can't get it anymore.

ahrtr · 2023-05-26T22:26:33Z

starts from where it is interrupted(since the older values has been deleted), thus giving a different hash value.

Good catch! It seems that the most feasible way is to just ignore the hash if the previous compaction was somehow interrupted, e.g the first revision != prevCompactRev.

CaojiamingAlan · 2023-05-29T22:02:36Z

e.g the first revision != prevCompactRev.

@ahrtr There's a minor issue about it: during the very first compaction, the prevCompactRev = -1.

From the official documentation of etcd:

When the store is created, the initial revision is 1.

So, is it guaranteed that the first revision that CAN BE COMPACTED is 2, and the revision is always continuous? It agrees with my observation. If so, then when prevCompactRev = -1 and revision > 2, we can conclude that the hash is incomplete. Is that correct?

If it is not the case, my idea about this is to do a query and persist the first revision in the store when scheduling the compaction. When really starting to do the compaction, we check if the first revision in the store is still the same as the one persisted. If they are the same, then we can believe that the hash is complete; otherwise we drop the hash value. This approach does not depend on how etcd deals with revisions, as long as it increases monotonically.

…-io#15919. If there is a gap between prevCompactRev and the smallest revision, then the compaction is continued from the middle. In this case, we cannot get the correct compaction hash, thus we drop it. Signed-off-by: caojiamingalan <alan.c.19971111@gmail.com>

ahrtr · 2023-05-30T07:51:09Z

and the revision is always continuous?

YES, it's.

my idea about this is to do a query and persist the first revision in the store when scheduling the compaction.

I think we just need to add a new field FirstRealRevision to record the real first compact revision. When a leader compares its local hash with any peer's hash, it should check both FirstRealRevision1 and CompactRevision at corrupt.go#L302. If any of them doesn't match, then we should skip the hash check. WDYT? @serathius

BTW, I propose to rename CompactRevision to PreviousCompactRevision or FromCompactRevision , and Revision to CurrentCompactRevision or ToCompactRevision as well, see below.

type KeyValueHash struct {
	Hash                    uint32
	FirstRealRevision       int64 // New added field to record the real first compacted revision
	PreviousCompactRevision int64 // Renamed from `CompactRevision`
	CurrentCompactRevision  int64 // Renamed from `Revision`
}

serathius · 2023-05-30T08:16:01Z

@CaojiamingAlan Thanks for looking into the issue. The issue is corrected to resumed compaction reporting incorrect hash for incorrect revision range (compactRev, revision). I don't think we should use the hash calculated whn compaction is resumed mid way. Is there any problem with implementing just that?

ahrtr · 2023-05-30T08:51:49Z

@serathius . I see you disagree my comment above, but no detailed reasons (which I don't think is good). Actually I have two comments, I don't know which one you don't agree.

Add a FirstRealRevision.
Rename CompactRevision to FromCompactRevision, and Revision to ToCompactRevision. Actually each hash value is based on revision range (previousCompactionRev, currentCompactionRev). Renaming them can improve the readability. Of course, we can do this in a separate PR.

serathius · 2023-05-30T09:03:56Z

I think for a short term fix for detecting data corruption implementation is complicated enough. Would prefer to just fix the standing issue and refocus the effort to #13839.

Renaming them can improve the readability

New proposed name are more confusing then before. For calculating hash during compaction there is no from and to. They are just CompactRevision and Revision from the time when compaction happen. Adding from and to would be misleading.

ahrtr · 2023-05-30T09:19:47Z

Would prefer to just fix the standing issue

If you read the history comments, most of them are discussing on fixing the issue.

They are just CompactRevision and Revision from the time when compaction happen. Adding from and to would be misleading.

Don't agree with this. For the latest hash value, the names CompactRevision and Revison make some sense, but not for historical hash values because you save up to 10 hashes. The only blocker for the renaming is you reuse the struct KeyValueHash for both CompactHashCheck and PeriodicCheck, because the renaming make sense for CompactHashCheck, but not for PeriodicCheck. I will not insist on renaming due to the blocker.

CaojiamingAlan · 2023-05-30T20:28:24Z

I think we just need to add a new field FirstRealRevision to record the real first compact revision. When a leader compares its local hash with any peer's hash, it should check both FirstRealRevision1 and CompactRevision at corrupt.go#L302.

I think this is incorrect if we consider only the CompactionCheck. If one store is actually corrupted and drops some of the revisions, it cannot be detected: it is likely that the FirstRealRevisions are different for the corrupted store and the correct store, and the check is skipped. However, probably we can know this through PeriodicalCheck. I think the problem here is, is it allowed to have false negatives for each single corruption check. If it is, then this is a valid approach.

ahrtr · 2023-05-30T22:46:46Z

If one store is actually corrupted and drops some of the revisions, it cannot be detected: it is likely that the FirstRealRevisions are different for the corrupted store and the correct store, and the check is skipped.

It's a valid point. Then the simplest way is to compare ScheduledCompactKeyName and FinishedCompactKeyName before performing each compaction operation, if they don't match, then it means previous compaction somehow did not finish; accordingly we should skip calculating hash in such case.

before writing hash to hashstore. If they do not match, then it means this compaction is interrupted and its hash value is invalid. In such cases, we won't write the hash values to the hashstore, and avoids the incorrect corruption alarm. See etcd-io#15919. Also fix some typos and reorder the functions to improve readability. Signed-off-by: caojiamingalan <alan.c.19971111@gmail.com>

…tKeyName before writing hash to release-3.5. Fix etcd-io#15919. Check ScheduledCompactKeyName and FinishedCompactKeyName before writing hash to hashstore. If they do not match, then it means this compaction has once been interrupted and its hash value is invalid. In such cases, we won't write the hash values to the hashstore, and avoids the incorrect corruption alarm. Signed-off-by: caojiamingalan <alan.c.19971111@gmail.com>

serathius · 2023-10-27T12:38:50Z

Should this be backported to v3.4?

kennydo added the type/bug label May 18, 2023

ahrtr assigned CaojiamingAlan May 26, 2023

ahrtr added the stage/triaged label May 26, 2023

CaojiamingAlan mentioned this issue May 31, 2023

Check ScheduledCompactKeyName and FinishedCompactKeyName before writing hash #15985

Merged

ahrtr mentioned this issue Jun 2, 2023

cli: Add etcdutl snapshot hashkv command #15965

Merged

ahrtr closed this as completed in #15985 Jun 8, 2023

CaojiamingAlan mentioned this issue Jun 12, 2023

[3.5] etcdserver: backport check scheduledCompactKeyName and finishedCompac… #16068

Merged

serathius mentioned this issue Oct 12, 2023

Plan release v3.5.10 #16733

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect hash when resuming scheduled compaction after etcd restarts #15919

Incorrect hash when resuming scheduled compaction after etcd restarts #15919

kennydo commented May 18, 2023

ahrtr commented May 24, 2023

kennydo commented May 24, 2023

CaojiamingAlan commented May 26, 2023 •

edited

Loading

ahrtr commented May 26, 2023

CaojiamingAlan commented May 29, 2023

ahrtr commented May 30, 2023

serathius commented May 30, 2023

ahrtr commented May 30, 2023

serathius commented May 30, 2023

ahrtr commented May 30, 2023

CaojiamingAlan commented May 30, 2023

ahrtr commented May 30, 2023

serathius commented Oct 27, 2023

Incorrect hash when resuming scheduled compaction after etcd restarts #15919

Incorrect hash when resuming scheduled compaction after etcd restarts #15919

Comments

kennydo commented May 18, 2023

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

ahrtr commented May 24, 2023

kennydo commented May 24, 2023

CaojiamingAlan commented May 26, 2023 • edited Loading

ahrtr commented May 26, 2023

CaojiamingAlan commented May 29, 2023

ahrtr commented May 30, 2023

serathius commented May 30, 2023

ahrtr commented May 30, 2023

serathius commented May 30, 2023

ahrtr commented May 30, 2023

CaojiamingAlan commented May 30, 2023

ahrtr commented May 30, 2023

serathius commented Oct 27, 2023

CaojiamingAlan commented May 26, 2023 •

edited

Loading