-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
composefs: random EINVALs #2042
Comments
Motivated by containers#2042 where we just get a bare `invalid argument` out of the entire storage stack. My offhand guess skimming some of the code is by far the most likely thing here is the raw `lgetxattr` call. It'd be useful to know that for sure. Signed-off-by: Colin Walters <walters@verbum.org>
|
this is more helpful, thanks! The erofs image could be corrupted CC @cgwalters |
Hmm...you think possibly some container image -> mkcomposefs -> (internal mkfs.erofs-alike) -> kernel refusing to read it? I'm unfortunately not yet familiar with the podman test suite, but is this 100% reproducible, so basically setting up composefs for c/storage and run that relevant test? |
I've tried to get as close as possible to the CI setup but I've not managed to reproduce it locally yet :/ Might be worth testing with a version of podman that stores somewhere the erofs image that fails on mount, so we can analyze it. |
Maybe:
|
a couple of improvements for composefs that could help to detect potential failures when creating the erofs image: |
If a image fails to mount, it could just have corrupted on-disk superblock and/or rootinode. |
Yeah, we probably need the dmesg to figure this one out. |
At the top of each log is a link to
|
I looked at the first one, and its clearly an error in this code:
Where mountOpts is either empty or "noacl". But I don't see any errors reported from erofs in the journal around that time (or indeed ever). So, where is the EINVAL coming from? If the image was somehow corrupt, shouldn't erofs give some kernel log? Is it the loop device name that is invalid or something? |
The second one has: Now, we're failing to unmount too, with EINVAL. That is very strange... |
anyway, clearly erofs itself doesn't return EINVAL on unmount(), I'm not sure how it happens, maybe the directory is already unmounted so it's not a mountpoint though. |
Yeah, or maybe the mount failed. |
umount: |
yeah, anyway, it'd be better to have some dmesg result, since it seems (I think) there are enough kernel prints in the mount failure path. |
@hsiangkao Well, the thing is that we see these in the journal logs:
So, should we not see if there were any other erofs error messages? |
I think if it's printed, this mount is already successed in erofs itself, see: so at least mount() won't return EINVAL at least for this time of mount. Is it the exact time to fail? |
@hsiangkao No, that is just an example of some dmesg output from the journal logs we have. What I meant was, if these successful mounts are reported (and they are), should not also failed mounts be reported in the logs? I have been unable to find any other erofs log output other than copies of the above example. |
If there is no "erofs" dmesg log returned in the kernel message, I guess EINVAL wasn't returned by erofs, maybe that is what you mentioned just now: loop device is invalid or likewise... since it seems that currently erofs added error messages to all error paths in erofs_fc_fill_super(): |
oh, I'm not sure how errorfc() actually works now though... |
Hmm, yeah, errorfc() seems to log to the error buffer in the fs context, at least in some cases, and you can then extract these messages via the fd (as in e.g. https://github.com/torvalds/linux/blob/b446a2dae984fa5bd56dd7c3a02a426f87e05813/samples/vfs/test-fsmount.c#L20). Does this mean those errors are not logged though? |
Not tried though, but I guess those errorfc()s are hardly triggered: Anyway, I tend to guess that EINVAL wasn't returned by erofs. It looks like invalid device or something. |
If anything I'd expect it to be this: super.c: ret = -EINVAL; |
ping, what is the status of this?
|
I'm not totally sure this is all related to composefs; for example:
@edsantiago can you elaborate a bit on the background on this? Basically two questions:
|
Right, I agree that seems like the most likely source of an EINVAL from the erofs layer - and https://github.com/containers/composefs/releases/tag/v1.0.5 contains containers/composefs@76b4da5 which has a "possible" fix for this (though seeing it would just turn the question into how corrupt composefs erofs files were being generated). Given that some of the EINVALs we're seeing here are coming from what looks like generic VFS operations, I think there's either a kernel bug/regression, or somehow us using composefs tickles other bugs generically for the VFS. It'd be really useful to have a bit more data on these tests on other OS versions. |
We are only testing composefs in rawhide.
When composefs is enabled, it takes place on all tests. With the exception of any tests that use their own private |
thanks, the change in c/storage helped just a little bit. There are a bunch of reasons why |
Sometimes on commit, sometimes on diff
The text was updated successfully, but these errors were encountered: