Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

receive of raw encrypted incrimental backup caused stacktrace and D state #7180

Closed
prometheanfire opened this issue Feb 16, 2018 · 12 comments
Closed

Comments

@prometheanfire
Copy link
Contributor

System information

Type Version/Name
Distribution Name Gentoo
Distribution Version latest
Linux Kernel 4.14.18-gentoo
Architecture x86_64
ZFS Version built from master on the 8th, somwhere around f54976d
SPL Version similliarly built from master on the 8th

Describe the problem you're observing

Kernel backtrace and receive hanging in d state.

Describe how to reproduce the problem

zfs send -Lwecp -I LOCAL_POOL@backup-201802091942 LOCAL_POOL@backup-201802161114 | ssh 1.2.3.4 zfs recv -duvs -o canmount=off REMOTE_POOL/remote-backups/LOCAL_POOL

Include any warning/errors/backtraces from the system logs

[400296.285606] VERIFY3(0 == zap_add(mos, dsobj, spa_feature_table[f].fi_guid, sizeof (zero), 1, &zero, tx)) failed (0 == 17)
[400296.285712] PANIC at dsl_dataset.c:802:dsl_dataset_activate_feature()
[400296.285803] Showing stack for process 9102
[400296.285807] CPU: 4 PID: 9102 Comm: txg_sync Not tainted 4.14.18-gentoo #1
[400296.285809] Hardware name: Supermicro X10SLH-F/X10SLM+-F/X10SLH-F/X10SLM+-F, BIOS 1.1 07/19/2013
[400296.285810] Call Trace:
[400296.285823]  dump_stack+0x46/0x65
[400296.285830]  spl_panic+0xc8/0x110
[400296.285841]  ? zap_add_impl+0x96/0x150
[400296.285845]  ? zap_add+0x78/0xa0
[400296.285853]  dsl_dataset_activate_feature+0x104/0x160
[400296.285858]  dsl_crypto_recv_key_sync+0x548/0x950
[400296.285864]  dsl_sync_task_sync+0xac/0x110
[400296.285868]  dsl_pool_sync+0x355/0x4c0
[400296.285874]  spa_sync+0x449/0xd80
[400296.285880]  txg_sync_thread+0x2de/0x540
[400296.285884]  ? txg_delay+0x1e0/0x1e0
[400296.285887]  ? __thread_exit+0x20/0x20
[400296.285890]  thread_generic_wrapper+0x6f/0x80
[400296.285897]  kthread+0x119/0x130
[400296.285901]  ? kthread_create_on_node+0x70/0x70
[400296.285905]  ? kthread_create_on_node+0x70/0x70
[400296.285910]  ret_from_fork+0x35/0x40
@prometheanfire
Copy link
Contributor Author

additional info, the kernel on the sender and receiver are the exact same (built and reused).

@prometheanfire
Copy link
Contributor Author

@tcaputi you may want in on this, am I using an unsupported option when sending?

@tcaputi
Copy link
Contributor

tcaputi commented Feb 16, 2018

Can you tell me what datasets are encrypted on each system? Specifically is the sent dataset encrypted and are any parents of the received dataset encrypted?

@prometheanfire
Copy link
Contributor Author

All datasets on the source system are encrypted, pool created with the following:

zpool create -O encryption=on -O keyformat=passphrase zfstest /dev/zvol/slaanesh-zp00/zfstest

The destination systems has no encrypted datasets, just the raw receive from the first send to it (which it received correctly). ZFS commands are hanging, so I can't get the output of zfs list on the destination system, but the send/receive command set was as follows.

zfs send -Lwecp LOCAL_POOL@backup-201802091942 | ssh 10.0.1.14 zfs recv -duvsF -o canmount=off REMOTE_POOL/remote-backups/LOCAL_POOL

@tcaputi
Copy link
Contributor

tcaputi commented Feb 16, 2018

OK. I think this is probably related to #7117. This manifestation of it doesn't seem to be a very serious ASSERT and we could probably just ignore it, but I'd like to know why it's happening before I say so for sure.

@tcaputi
Copy link
Contributor

tcaputi commented Feb 16, 2018

Did your original pool have any clones, by the way?

EDIT: nevermind. I see this is just a single dataset being sent, so this shouldn't matter.

@prometheanfire
Copy link
Contributor Author

Any way to work around it (other than sending non-raw, unencrypted)?

@tcaputi
Copy link
Contributor

tcaputi commented Feb 16, 2018

looking into it now.don't want to tell you something that causes permanent damage...

@tcaputi
Copy link
Contributor

tcaputi commented Feb 18, 2018

So basically, this ASSERT is being caused because the code is confused about whether or not your dataset is encrypted. I have not been able to replicate your exact error, but i have replicated a couple other small issues and I have a patch to fix them (which I will be making a PR for soon).

The 2 bugs I found have to do with zfs recv -F and the way it replaces existing datasets if necessary. Can I ask why you added that into your command? Was there an existing dataset before that was preventing the receive from working? If so, the patch might fix this as well.

@prometheanfire
Copy link
Contributor Author

Honestly my workflow for creating a new backup was to create the dataset then receive -F over it. It's probably just paranoia or stupidity on my part but a I've never been able to create a dataset on initial receive. I'll watch for your PR though and let you know how it goes (on 4.14.20 with master as of a few hours ago).

@tcaputi
Copy link
Contributor

tcaputi commented Mar 1, 2018

@prometheanfire Any thoughts for how we might go about getting this issue closed? I'm not sure if we were ever able to replicate the issue after these patches went through.

@prometheanfire
Copy link
Contributor Author

I haven't been able to retest (I'm traveling), I'm going to close it and will reopen (or make a new one if I have to) if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants