"Magic number error"s after changing compression format on the ingesters #937

mdisibio · 2021-09-07T16:39:40Z

Describe the bug
Tempo was running stable for some time on zstd compression. We decided to investigate snappy compression, switched the storage/trace/block/encoding field to 'snappy' and restarted ingesters. Afterwards, tempo started encountering many 'magic number' errors. This error is from a compactor but other components had the same error:

level=error ts=2021-09-07T15:57:10.601766281Z caller=compactor.go:110 msg="error during compaction cycle" err="error iterating input blocks: error reading objects for records, blockID: 4644fe19-1bc4-495f-99da-be0d1831543a, err: invalid input: magic number mismatch"

Inspecting the block, it seems like the data file was written with snappy as expected, but the meta.json still contains zstd. This lines up with the error message, which seems to be from zstd checking and failing for its magic number prefix.

A couple more things to note about the cli output below:
(1) Our ingesters cut blocks fairly consistently, and on zstd were ~350MB, and this block is ~600MB. This leads us to believe that the data file is indeed snappy.
(2) The size is also wrong. It is the ~350MB size from zstd and not the true data file size anymore.

$ gsutil ls -l gs://<mybucket>/1/4644fe19-1bc4-495f-99da-be0d1831543a/
    100024  2021-09-07T14:38:40Z  gs://<mybucket>/1/4644fe19-1bc4-495f-99da-be0d1831543a/bloom-0
    100024  2021-09-07T14:38:40Z  gs://<mybucket>/1/4644fe19-1bc4-495f-99da-be0d1831543a/bloom-1
    100024  2021-09-07T14:38:40Z  gs://<mybucket>/1/4644fe19-1bc4-495f-99da-be0d1831543a/bloom-2
    100024  2021-09-07T14:38:41Z  gs://<mybucket>/1/4644fe19-1bc4-495f-99da-be0d1831543a/bloom-3
 611997809  2021-09-07T14:38:40Z  gs://<mybucket>/1/4644fe19-1bc4-495f-99da-be0d1831543a/data
    256000  2021-09-07T14:38:41Z  gs://<mybucket>/1/4644fe19-1bc4-495f-99da-be0d1831543a/index
       393  2021-09-07T14:38:41Z  gs://<mybucket>/1/4644fe19-1bc4-495f-99da-be0d1831543a/meta.json
TOTAL: 7 objects, 612654298 bytes (584.27 MiB)

$ gsutil cp gs://<mybucket>/1/4644fe19-1bc4-495f-99da-be0d1831543a/meta.json .
Copying gs://<mybucket>/1/4644fe19-1bc4-495f-99da-be0d1831543a/meta.json...
/ [1 files][  393.0 B/  393.0 B]                                                
Operation completed over 1 objects/393.0 B.       
                               
$ cat meta.json 
{"format":"v2","blockID":"4644fe19-1bc4-495f-99da-be0d1831543a","minID":"AAAAAAAAAAAAAAXhfU3CWw==","maxID":"7V8evRot612xwrcHbnhCcQ==",
"tenantID":"1","startTime":"2021-09-07T14:34:36.182817282Z","endTime":"2021-09-07T14:37:06.427198961Z",
"totalObjects":268962,"size":349737066,"compactionLevel":0,
"encoding":"zstd","indexPageSize":256000,"totalRecords":2329,"dataEncoding":"v1","bloomShards":4}

To Reproduce
Steps to reproduce the behavior:

Start Tempo r7
Create blocks with zstd encoding
Change configmap to snappy
Deploy r8 (which includes an ingester update and restart)
Perform queries or let compactors run

Expected behavior
It should be fine to switch an ingester's encoding between restarts. Blocks written prior should keep the previous encoding, and new blocks should be written with the new encoding.

Environment:

Infrastructure: Kubernetes
Deployment tool: jsonnet

Additional context
Here are ingester logs for this block. After the restart it both replayed the WAL and rediscovered the complete block. It doesn't seem like it should have done both... only one needs to be done. Maybe the error is an edge case in this replay/rediscover logic.

2021-09-07 14:37:06 | level=info ts=2021-09-07T14:37:06.427374977Z caller=flush.go:157 msg="head block cut. enqueueing flush op" userid=1 block=4644fe19-1bc4-495f-99da-be0d1831543a
2021-09-07 14:37:08 | level=info ts=2021-09-07T14:37:08.833570107Z caller=flush.go:232 msg="completing block" userid=1 blockID=4644fe19-1bc4-495f-99da-be0d1831543a
2021-09-07 14:38:17 | level=info ts=2021-09-07T14:38:17.626684172Z caller=wal.go:93 msg="beginning replay" file=4644fe19-1bc4-495f-99da-be0d1831543a:1:v2:snappy:v1 size=748082090
2021-09-07 14:38:21 | level=info ts=2021-09-07T14:38:21.974282518Z caller=wal.go:120 msg="replay complete" file=4644fe19-1bc4-495f-99da-be0d1831543a:1:v2:snappy:v1 duration=4.347593985s
2021-09-07 14:38:21 | level=info ts=2021-09-07T14:38:21.987396049Z caller=instance.go:595 msg="reloaded local block" tenantID=1 block=4644fe19-1bc4-495f-99da-be0d1831543a flushed=0001-01-01T00:00:00Z
2021-09-07 14:38:24 | level=info ts=2021-09-07T14:38:24.280352444Z caller=flush.go:232 msg="completing block" userid=1 blockID=4644fe19-1bc4-495f-99da-be0d1831543a
2021-09-07 14:38:31 | level=info ts=2021-09-07T14:38:31.560074186Z caller=flush.go:288 msg="flushing block" userid=1 block=4644fe19-1bc4-495f-99da-be0d1831543a
2021-09-07 14:38:39 | level=info ts=2021-09-07T14:38:39.082371529Z caller=flush.go:239 msg="block completed" userid=1 blockID=4644fe19-1bc4-495f-99da-be0d1831543a duration=14.802010543s

The text was updated successfully, but these errors were encountered:

mdisibio added this to the v1.2 milestone Sep 7, 2021

mdisibio mentioned this issue Sep 7, 2021

Delete completed local blocks when replaying wal #939

Merged

3 tasks

mdisibio closed this as completed in #939 Sep 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Magic number error"s after changing compression format on the ingesters #937

"Magic number error"s after changing compression format on the ingesters #937

mdisibio commented Sep 7, 2021 •

edited

Loading

"Magic number error"s after changing compression format on the ingesters #937

"Magic number error"s after changing compression format on the ingesters #937

Comments

mdisibio commented Sep 7, 2021 • edited Loading

mdisibio commented Sep 7, 2021 •

edited

Loading