-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Too many levels of symbolic links" when "cd"ing to snapshot subdir #816
Comments
Thanks for filing the bug, I've seen this once before and we didn't manage to run it down then. |
Not sure how helpful this is, but "me too". I have a pool which I have been using under CentOS 6.3 x86_64 (2.6.32), and there I have issues with a system hang when running find inside the .zfs subdirectory (with a load of snapshots present). I just thought I'd try the same pool under Ubuntu 12.04 x86_64 (3.2.0-23), and although I see no system hang, instead I get intermittent errors like this: find: ‘/tank1/data1/.zfs/snapshot/2012.0608.0113.Fri.snapadm.weekly’: Too many levels of symbolic links Command exited with non-zero status 1 There are no symbolic links though. Even without this error, I don't think the find command is finding everything it should. ZFS on Linux 0.6.0 rc9. Andy |
Hi, same problem here. |
Hi, in my case I found several symlink files in the folder (but the same folder under ext4 doesn't cause any problem). To find all symlink files you can use: sudo find /zpool/dataset -type l -exec ls -l {} ; |
for me, I get this randomly accessing files (via python) over NFS mounted zvol. OSError(40, 'Too many levels of symbolic links') |
If your at all able to reproduce this it would be very helpful to get an |
I'll attempt to get an strace |
Running in zsh:
gives strace output for the failing step:
Running the "cd" again works OK. This is on Ubuntu 12.04, 64 bit, 3.2.0-27-generic, ZFS v0.6.0.65-rc9. |
It seems directly related to the number of files in a folder. If there are 300 files in a folder, no errors ever. If there are 7K files, then i get the error quite often. |
I am seeing the same symptom; getting the same backtrace as mkj, and noticing the same pattern as msmitherdc in it affecting file systems containing lots of files. Likewise it is only happening the first time I try entering a snapshot subdir, within a certain time window. The second attempt right afterwards always seem to succeed. Did notice running a difference in the reported ownership of the snapshot subdir. Before a failing attempt it is listed as belonging to root:root, with some generic permissions. Before the second attempt, the succeeding one, the ownerships as well as the permissions actually match the existing ones on the top level of the file system in question. Some cached metadata from the first attempt, making all the difference the second time around?
Seeing this running a 64-bit Ubuntu 12.04, on the 3.2.0-30-generic kernel, with zfs 0.6.0.71. |
I wonder if this is simply due to the snapshot being slow to mount. The subsequent attempt would work because the snapshot was then successfully mounted. It would continue to work until the snapshot gets automatically unmounted due to inactivity. The way the .zfs/snapshot directory was implemented is by mounting the required snapshot on demand. Basically, the traversal in to the snapshot triggers the mount and will block in the system call until it completes. This makes the process transparent to the user and greatly simplifies the kernel code since each snapshot can be treated as an individual mount point. However, perhaps there are some races which remain. Incidentally, the permission issue you reported is just how the mount point is permissioned before the snapshot gets mounted on top. So that's to be expected. The above strace output is valuable, but the ideal bit of debugging to have would be a call trace using ftrace or systemtap. We'd be able to see exactly where that ELOOP was returned in the kernel to chdir(). |
I agree, I have this same problem and it's absolutely consistent: the first access gives the error (it doesn't require "cd", for example, "ls" gives the same message). The second and subsequent accesses are fine. In my case, it's not related to the number of files in the directory. Once it works, it works for "a while" (what is the inactivity timeout?) and then after some period, the error occurs again. This is on Ubuntu 12.04, kernel 3.2.0-31, ZFS v0.6.0.80-rc11. |
That "awhile" would be 5 minutes. By default that's the timeout to expire idle snapshots which were automounted. If you want to mitigate the issue for now you could crank this use by increasing the $ modinfo module/zfs/zfs.ko | grep expire parm: zfs_expire_snapshot:Seconds to expire .zfs/snapshot (int) |
I'm being affected by this problem too. Is there anything I can do to help debug? Ubuntu 12.10; kernel 3.5.0-18-generic; ZOL 0.6.0-rc12. |
I've been digging in to this problem. The process loops in follow_managed(), calling follow_automount() each time until it hits the 40 level limit, as shown by the following output from a custom systemtap script. 1355336265 ls(63225) kernel.function("follow_managed@/build/buildd/linux-3.2.0/fs/namei.c:797") zfs-auto-snap_daily-2012-12-08-0747 {.mnt=0xffff880610de1a00, .dentry=0xffff88059b44b380} 1355336265 ls(63225) kernel.function("follow_automount@/build/buildd/linux-3.2.0/fs/namei.c:717") 131264 {.mnt=0xffff880610de1a00, .dentry=0xffff88059b44b380} 1355336265 ls(63225) kernel.function("follow_automount@/build/buildd/linux-3.2.0/fs/namei.c:717") 131264 {.mnt=0xffff880610de1a00, .dentry=0xffff88059b44b380} 1355336265 ls(63225) kernel.function("follow_automount@/build/buildd/linux-3.2.0/fs/namei.c:717") 131264 {.mnt=0xffff880610de1a00, .dentry=0xffff88059b44b380} 1355336265 ls(63225) kernel.function("follow_automount@/build/buildd/linux-3.2.0/fs/namei.c:717") 131264 {.mnt=0xffff880610de1a00, .dentry=0xffff88059b44b380} ... The follow_automount probe shows the dentry->d_flags and the path structure. Notice that the dentry and mnt pointers never change. I think that in order to exit the while loop (shown below) the path->dentry pointer needs to point to the dentry for the root of the newly-mounted filesystem after the call to follow_automount(). This is taken care of in follow_automount() for the non-mount-collision case. I'm thinking that if zfsctl_mount_snapshot() could get a pointer to the struct vfsmount for the newly-mounted snapshot, then it could update the struct path. But I'm not sure how to do that; it looks like lookup_mnt() is what we need, but it's not exported by the kernel.
|
It's still not clear to me why this only sometimes fails.
You should be able to do this with |
Me neither. If the problem is as I described it seems like it should always fail. Unless the path pointer is shared and I just always "win" the race on my desktop. For me it always fails on my workstation, but I haven't reproduced it in a VM running the same kernel and ZFS versions.
Cool, I'll give that a try. Thanks |
Adding
|
Ensure that the path member pointers are associated with the newly-mounted snapshot when zpl_snapdir_automount() returns. Otherwise the follow_automount() function may be called repeatedly, leading to an incorrect ELOOP error return. This problem was observed as a 'Too many levels of symbolic links' error from user-space commands accessing an unmounted snapshot in the .zfs/snapshot directory. Issue openzfs#816
@cronnelly It would be great if anyone else having this issue could test the above patch before I submit a pull request. Thanks |
Up... down... I always get those confused. Based on your analysis it does look like this should resolve the issue. It would be great is some of the folks watching this issue could verify the proposed 1 line fix resolves the probably for them as well. |
Seems to do the trick. I have a server with a set of snapshots on which I pretty much all the time managed to trigger the "Too many levels of symbolic links" response. Now with this patch I haven't been able to reproduce the bug. Thanks! |
Works great for me. I was hitting this issue 100% of the time. Snapshot mounts are immediate now with no initial error. |
Likewise, the problem was completely repeatable and reproducible, and now it works perfectly. The only thing is, when I run "ls -l /xxxx/.zfs/snapshot/*" we end up with a very large number of "mount" commands running for a few minutes. Not that this is a normal operation, mind you, but it's not exactly scalable to huge numbers of snapshots. Thanks for the fix!! |
Thank you everyone, this fix was merged. |
This reverts commit 7afcf5b which accidentally introduced a regression with the .zfs snapshot directory. While the updated code still does correctly mount the requested snapshot. It updates the vfsmount such that it references the original dataset vfsmount. The result is that the snapshot itself isn't visible. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #816
Reopening this issue since the fix introduced a regression which wasn't initially caught. |
As of Linux 3.4 the UMH_WAIT_* constants were renumbered. In particular, the meaning of "1" changed from UMH_WAIT_PROC (wait for process to complete), to UMH_WAIT_EXEC (wait for the exec, but not the process). A number of call sites used the number 1 instead of the constant name, so the behavior was not as expected on kernels with this change. One visible consequence of this change was that processes accessing automounted snapshots received an ELOOP error because they failed to wait for zfs.mount to complete. Closes openzfs#816
The real root cause for the racy behavior was identified and fixed. Thanks Ned. 761394b call_usermodehelper() should wait for process |
Ensure that the path member pointers are associated with the newly-mounted snapshot when zpl_snapdir_automount() returns. Otherwise the follow_automount() function may be called repeatedly, leading to an incorrect ELOOP error return. This problem was observed as a 'Too many levels of symbolic links' error from user-space commands accessing an unmounted snapshot in the .zfs/snapshot directory. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#816
This reverts commit 7afcf5b which accidentally introduced a regression with the .zfs snapshot directory. While the updated code still does correctly mount the requested snapshot. It updates the vfsmount such that it references the original dataset vfsmount. The result is that the snapshot itself isn't visible. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#816
As of Linux 3.4 the UMH_WAIT_* constants were renumbered. In particular, the meaning of "1" changed from UMH_WAIT_PROC (wait for process to complete), to UMH_WAIT_EXEC (wait for the exec, but not the process). A number of call sites used the number 1 instead of the constant name, so the behavior was not as expected on kernels with this change. One visible consequence of this change was that processes accessing automounted snapshots received an ELOOP error because they failed to wait for zfs.mount to complete. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#816
For some reason this is an issue for me..
Not sure if this issue has regressed or if it's a new issue.. I was just trying to do an strace and got a kernel panic.. See the attached image. |
@behlendorf Sorry man just bumping this in case you didn't see it. |
That kernel panic is a duplicate. I saw that months ago and reported it. |
@drescherjm perhaps you can link to the other issue? |
Im using zfs on ubuntu server 16.04.1 and I had the same issue with the symlink error when accessing the snapshots. I got the error after sending incremental snapshots from another ubuntu server (running ubuntu server 14.04). After updating the affected server and trying everything in my mind (atime off, compression off, mountpoints etc) it still did not work. I did a reboot and suddenly everything worked again - until I transferred new incremental snapshots. This led me to try unmounting and remounting the filesystem after each time I transferred snapshots, and that seemed to do the trick! Now I just put the remount-commands into my script, and I am no longer bothered by this bug. This is not a fix, it is a only workaround. But in case someone cannot get it working, even with the newest versions of everything, then try this! :) |
I thought this was fixed long ago. |
Me too. I realize this is as old issue, but though it could be nice to post my solution here too. Just some additional info: |
I am experiencing this issue, unmount/mount workaround did it for me. |
this issue is getting long in the tooth, but still exists for a newly installed fully updated ubuntu 16.04 with incremental received snapshots. normal snapshots work fine. the unmount/mount workaround does work, so it's certainly a cache issue. I'm sending my snaps using http://www.bolthole.com/solaris/zrep/ is that matters. it's easy to make a test case to reproduce it using this configuration method http://www.bolthole.com/solaris/zrep/zrep.documentation.html#backupserver |
@chrwei are you able to reproduce this with Ubuntu 18.04? It's likely this was resolved in a newer version or ZFS, can you check exactly which version your running, |
I am on 0.6.5.6-0ubuntu20. I don't have any 18.04 and don't plan on it for some time. |
To add to this mystery, I have also found issues with this error mounting my ZFS pool via sshfs, with an unmount and remount fixing it as well. It only seems to affect zfs pools on my system, even with the same data. ZFS is running on Proxmox latest fully updated, and sshfs client is a fully updated Manjaro client. |
Merge with `Allow MMP to bypass waiting for other threads`
I'm really not sure where the problem lies. There are no symlinks in the entire path here. And the error does not always occur. The error goes away if I "split up" the chdir as demonstrated.
Linux dc 3.4.1-vs2.3.3.4 #2 SMP Sat Jun 23 16:39:09 MST 2012 x86_64 Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz GenuineIntel GNU/Linux
zfs/spl 0.6.0_rc9
-[root@dc]-[5.92/10.60/10.65]-66%-0d19h15m-2012-07-09T14:30:03-
-[/backup/1/minecraft501/.zfs/snapshot/20120701-2302/home/craft/bukkit/world:#]- cd /backup/1/minecraft501/.zfs/snapshot/20120703-1202/home/craft/bukkit/world
bash: cd: /backup/1/minecraft501/.zfs/snapshot/20120703-1202/home/craft/bukkit/world: Too many levels of symbolic links
-[root@dc]-[5.81/7.28/9.21]-64%-0d19h23m-2012-07-09T14:37:35-
-[/backup/1/minecraft501/.zfs/snapshot/20120701-2302/home/craft/bukkit/world:#]- cd /backup/1/minecraft501/.zfs/snapshot/
-[root@dc]-[24.16/11.15/10.45]-64%-0d19h23m-2012-07-09T14:37:45-
-[/backup/1/minecraft501/.zfs/snapshot:#]- cd 20120703-1202/home/craft/bukkit/world
-[root@dc]-[22.95/11.12/10.44]-64%-0d19h23m-2012-07-09T14:37:51-
-[/backup/1/minecraft501/.zfs/snapshot/20120703-1202/home/craft/bukkit/world:#]-
The text was updated successfully, but these errors were encountered: