Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Too many levels of symbolic links" when "cd"ing to snapshot subdir #816

Closed
mooinglemur opened this issue Jul 10, 2012 · 40 comments
Closed
Milestone

Comments

@mooinglemur
Copy link

I'm really not sure where the problem lies. There are no symlinks in the entire path here. And the error does not always occur. The error goes away if I "split up" the chdir as demonstrated.

Linux dc 3.4.1-vs2.3.3.4 #2 SMP Sat Jun 23 16:39:09 MST 2012 x86_64 Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz GenuineIntel GNU/Linux

zfs/spl 0.6.0_rc9

-[root@dc]-[5.92/10.60/10.65]-66%-0d19h15m-2012-07-09T14:30:03-
-[/backup/1/minecraft501/.zfs/snapshot/20120701-2302/home/craft/bukkit/world:#]- cd /backup/1/minecraft501/.zfs/snapshot/20120703-1202/home/craft/bukkit/world
bash: cd: /backup/1/minecraft501/.zfs/snapshot/20120703-1202/home/craft/bukkit/world: Too many levels of symbolic links

-[root@dc]-[5.81/7.28/9.21]-64%-0d19h23m-2012-07-09T14:37:35-
-[/backup/1/minecraft501/.zfs/snapshot/20120701-2302/home/craft/bukkit/world:#]- cd /backup/1/minecraft501/.zfs/snapshot/
-[root@dc]-[24.16/11.15/10.45]-64%-0d19h23m-2012-07-09T14:37:45-
-[/backup/1/minecraft501/.zfs/snapshot:#]- cd 20120703-1202/home/craft/bukkit/world

-[root@dc]-[22.95/11.12/10.44]-64%-0d19h23m-2012-07-09T14:37:51-
-[/backup/1/minecraft501/.zfs/snapshot/20120703-1202/home/craft/bukkit/world:#]-

@behlendorf
Copy link
Contributor

Thanks for filing the bug, I've seen this once before and we didn't manage to run it down then.

@ahmgithubahm
Copy link

Not sure how helpful this is, but "me too".

I have a pool which I have been using under CentOS 6.3 x86_64 (2.6.32), and there I have issues with a system hang when running find inside the .zfs subdirectory (with a load of snapshots present). I just thought I'd try the same pool under Ubuntu 12.04 x86_64 (3.2.0-23), and although I see no system hang, instead I get intermittent errors like this:

find: ‘/tank1/data1/.zfs/snapshot/2012.0608.0113.Fri.snapadm.weekly’: Too many levels of symbolic links Command exited with non-zero status 1

There are no symbolic links though. Even without this error, I don't think the find command is finding everything it should.

ZFS on Linux 0.6.0 rc9.

Andy

@kattunga
Copy link

Hi, same problem here.
When trying to cd to a snapshot, I get intermittent errors "Too many levels of symbolic links", I try 2 minutes later and I can.
Using Ubuntu 12.04 64bits ZOL rc10

@kattunga
Copy link

Hi, in my case I found several symlink files in the folder (but the same folder under ext4 doesn't cause any problem).

To find all symlink files you can use: sudo find /zpool/dataset -type l -exec ls -l {} ;

@msmitherdc
Copy link

for me, I get this randomly accessing files (via python) over NFS mounted zvol. OSError(40, 'Too many levels of symbolic links')

@behlendorf
Copy link
Contributor

If your at all able to reproduce this it would be very helpful to get an strace of failing command to see if this error is coming back from the kernel. And if so what system call is responsible.

@msmitherdc
Copy link

I'll attempt to get an strace

@mkj
Copy link

mkj commented Aug 19, 2012

Running in zsh:

evil:~/backup/vol/.zfs/snapshot ls
20100401/  20100513/  20100825/  20110302/  20110907/  20120330/
20100414/  20100608/  20101007/  20110329/  20111020/  20120722/
20100417/  20100609/  20101122/  20110501/  20111021/  20120801/
20100501/  20100714/  20101213/  20110613/  20111218/  20120808/
evil:~/backup/vol/.zfs/snapshot cd 20110329
cd: too many levels of symbolic links: 20110329
zsh: exit 1
evil:~/backup/vol/.zfs/snapshot

gives strace output for the failing step:

7118  09:53:55.903028 stat(".", {st_mode=S_IFDIR|0555, st_size=2, ...}) = 0
7118  09:53:55.903150 chdir("/home/matt/backup/vol/.zfs/snapshot/20110329") = -1 ELOOP (Too many levels of symbolic links)
7118  09:53:55.918222 stat(".", {st_mode=S_IFDIR|0555, st_size=3, ...}) = 0
7118  09:53:55.918346 chdir("20110329") = -1 ELOOP (Too many levels of symbolic links)

Running the "cd" again works OK. This is on Ubuntu 12.04, 64 bit, 3.2.0-27-generic, ZFS v0.6.0.65-rc9.

@msmitherdc
Copy link

It seems directly related to the number of files in a folder. If there are 300 files in a folder, no errors ever. If there are 7K files, then i get the error quite often.

@andreaso
Copy link

andreaso commented Sep 9, 2012

I am seeing the same symptom; getting the same backtrace as mkj, and noticing the same pattern as msmitherdc in it affecting file systems containing lots of files. Likewise it is only happening the first time I try entering a snapshot subdir, within a certain time window. The second attempt right afterwards always seem to succeed.

Did notice running a difference in the reported ownership of the snapshot subdir. Before a failing attempt it is listed as belonging to root:root, with some generic permissions. Before the second attempt, the succeeding one, the ownerships as well as the permissions actually match the existing ones on the top level of the file system in question. Some cached metadata from the first attempt, making all the difference the second time around?

root@halleck:/home/andreas/.zfs/snapshot# ls -ld H21
dr-xr-xr-x 1 root root 0 Sep  9 14:01 H21
root@halleck:/home/andreas/.zfs/snapshot# cd H21/
-bash: cd: H21/: Too many levels of symbolic links
root@halleck:/home/andreas/.zfs/snapshot# ls -ld H21
drwxr-x--x 31 andreas andreas 49 Sep  8 22:29 H21
root@halleck:/home/andreas/.zfs/snapshot# cd H21/
root@halleck:/home/andreas/.zfs/snapshot/H21#

Seeing this running a 64-bit Ubuntu 12.04, on the 3.2.0-30-generic kernel, with zfs 0.6.0.71.

@behlendorf
Copy link
Contributor

I wonder if this is simply due to the snapshot being slow to mount. The subsequent attempt would work because the snapshot was then successfully mounted. It would continue to work until the snapshot gets automatically unmounted due to inactivity.

The way the .zfs/snapshot directory was implemented is by mounting the required snapshot on demand. Basically, the traversal in to the snapshot triggers the mount and will block in the system call until it completes. This makes the process transparent to the user and greatly simplifies the kernel code since each snapshot can be treated as an individual mount point. However, perhaps there are some races which remain.

Incidentally, the permission issue you reported is just how the mount point is permissioned before the snapshot gets mounted on top. So that's to be expected.

The above strace output is valuable, but the ideal bit of debugging to have would be a call trace using ftrace or systemtap. We'd be able to see exactly where that ELOOP was returned in the kernel to chdir().

@ldonzis
Copy link

ldonzis commented Oct 9, 2012

I agree, I have this same problem and it's absolutely consistent: the first access gives the error (it doesn't require "cd", for example, "ls" gives the same message). The second and subsequent accesses are fine. In my case, it's not related to the number of files in the directory. Once it works, it works for "a while" (what is the inactivity timeout?) and then after some period, the error occurs again.

This is on Ubuntu 12.04, kernel 3.2.0-31, ZFS v0.6.0.80-rc11.

@behlendorf
Copy link
Contributor

That "awhile" would be 5 minutes. By default that's the timeout to expire idle snapshots which were automounted. If you want to mitigate the issue for now you could crank this use by increasing the zfs_expire_snapshot module option.

$ modinfo module/zfs/zfs.ko | grep expire
parm:           zfs_expire_snapshot:Seconds to expire .zfs/snapshot (int)

@cronnelly
Copy link

I'm being affected by this problem too. Is there anything I can do to help debug? Ubuntu 12.10; kernel 3.5.0-18-generic; ZOL 0.6.0-rc12.

@nedbass
Copy link
Contributor

nedbass commented Dec 12, 2012

I've been digging in to this problem. The process loops in follow_managed(), calling follow_automount() each time until it hits the 40 level limit, as shown by the following output from a custom systemtap script.

1355336265 ls(63225) kernel.function("follow_managed@/build/buildd/linux-3.2.0/fs/namei.c:797") zfs-auto-snap_daily-2012-12-08-0747 {.mnt=0xffff880610de1a00, .dentry=0xffff88059b44b380}
1355336265 ls(63225) kernel.function("follow_automount@/build/buildd/linux-3.2.0/fs/namei.c:717") 131264 {.mnt=0xffff880610de1a00, .dentry=0xffff88059b44b380}
1355336265 ls(63225) kernel.function("follow_automount@/build/buildd/linux-3.2.0/fs/namei.c:717") 131264 {.mnt=0xffff880610de1a00, .dentry=0xffff88059b44b380}
1355336265 ls(63225) kernel.function("follow_automount@/build/buildd/linux-3.2.0/fs/namei.c:717") 131264 {.mnt=0xffff880610de1a00, .dentry=0xffff88059b44b380}
1355336265 ls(63225) kernel.function("follow_automount@/build/buildd/linux-3.2.0/fs/namei.c:717") 131264 {.mnt=0xffff880610de1a00, .dentry=0xffff88059b44b380}
...

The follow_automount probe shows the dentry->d_flags and the path structure.

Notice that the dentry and mnt pointers never change. I think that in order to exit the while loop (shown below) the path->dentry pointer needs to point to the dentry for the root of the newly-mounted filesystem after the call to follow_automount(). This is taken care of in follow_automount() for the non-mount-collision case.

I'm thinking that if zfsctl_mount_snapshot() could get a pointer to the struct vfsmount for the newly-mounted snapshot, then it could update the struct path. But I'm not sure how to do that; it looks like lookup_mnt() is what we need, but it's not exported by the kernel.

 804         /* Given that we're not holding a lock here, we retain the value in a   
 805          * local variable for each dentry as we look at it so that we don't see 
 806          * the components of that value change under us */                      
 807         while (managed = ACCESS_ONCE(path->dentry->d_flags),                    
 808                managed &= DCACHE_MANAGED_DENTRY,                                
 809                unlikely(managed != 0)) {                                        
 810                 /* Allow the filesystem to manage the transit without i_mutex   
 811                  * being held. */                                               
 812                 if (managed & DCACHE_MANAGE_TRANSIT) {                          
 813                         BUG_ON(!path->dentry->d_op);                            
 814                         BUG_ON(!path->dentry->d_op->d_manage);                  
 815                         ret = path->dentry->d_op->d_manage(path->dentry, false);
 816                         if (ret < 0)                                            
 817                                 break;                                          
 818                 }                                                               
 819                                                                                 
 820                 /* Transit to a mounted filesystem. */                          
 821                 if (managed & DCACHE_MOUNTED) {                                 
 822                         struct vfsmount *mounted = lookup_mnt(path);            
 823                         if (mounted) {                                          
 824                                 dput(path->dentry);                             
 825                                 if (need_mntput)                                
 826                                         mntput(path->mnt);                      
 827                                 path->mnt = mounted;                            
 828                                 path->dentry = dget(mounted->mnt_root);         
 829                                 need_mntput = true;                             
 830                                 continue;                                       
 831                         }                                                       
 832                                                                                 
 833                         /* Something is mounted on this dentry in another       
 834                          * namespace and/or whatever was mounted there in this  
 835                          * namespace got unmounted before we managed to get the 
 836                          * vfsmount_lock */                                     
 837                 }                                                               
 838                                                                                 
 839                 /* Handle an automount point */                                 
 840                 if (managed & DCACHE_NEED_AUTOMOUNT) {                          
 841                         ret = follow_automount(path, flags, &need_mntput);      
 842                         if (ret < 0)                                            
 843                                 break;                                          
 844                         continue;                                               
 845                 }                                                               
 846                                                                                 
 847                 /* We didn't change the current path point */                   
 848                 break;                                                          
 849         }     

@behlendorf
Copy link
Contributor

It's still not clear to me why this only sometimes fails.

I'm thinking that if zfsctl_mount_snapshot() could get a pointer to the struct vfsmount for the newly-mounted snapshot, then it could update the struct path. But I'm not sure how to do that; it looks like lookup_mnt() is what we need, but it's not exported by the kernel.

You should be able to do this with follow_down_one.

@nedbass
Copy link
Contributor

nedbass commented Dec 12, 2012

It's still not clear to me why this only sometimes fails.

Me neither. If the problem is as I described it seems like it should always fail. Unless the path pointer is shared and I just always "win" the race on my desktop. For me it always fails on my workstation, but I haven't reproduced it in a VM running the same kernel and ZFS versions.

You should be able to do this with follow_down_one.

Cool, I'll give that a try. Thanks

@nedbass
Copy link
Contributor

nedbass commented Dec 12, 2012

Adding follow_up(path) to zpl_snapdir_automount() fixes it for me.

diff --git a/module/zfs/zpl_ctldir.c b/module/zfs/zpl_ctldir.c
index 7dfaf6e..09585c4 100644
--- a/module/zfs/zpl_ctldir.c
+++ b/module/zfs/zpl_ctldir.c
@@ -356,6 +356,8 @@ zpl_snapdir_automount(struct path *path)
        if (error)
                return ERR_PTR(error);

+       follow_up(path);
+
        /*
         * Rather than returning the new vfsmount for the snapshot we must
         * return NULL to indicate a mount collision.  This is done because

nedbass added a commit to nedbass/zfs that referenced this issue Dec 13, 2012
Ensure that the path member pointers are associated with the
newly-mounted snapshot when zpl_snapdir_automount() returns.  Otherwise
the follow_automount() function may be called repeatedly, leading to an
incorrect ELOOP error return. This problem was observed as a 'Too many
levels of symbolic links' error from user-space commands accessing an
unmounted snapshot in the .zfs/snapshot directory.

Issue openzfs#816
@nedbass
Copy link
Contributor

nedbass commented Dec 13, 2012

@cronnelly It would be great if anyone else having this issue could test the above patch before I submit a pull request. Thanks

@behlendorf
Copy link
Contributor

Up... down... I always get those confused. Based on your analysis it does look like this should resolve the issue. It would be great is some of the folks watching this issue could verify the proposed 1 line fix resolves the probably for them as well.

@andreaso
Copy link

Seems to do the trick.

I have a server with a set of snapshots on which I pretty much all the time managed to trigger the "Too many levels of symbolic links" response. Now with this patch I haven't been able to reproduce the bug.

Thanks!

@mgmartin
Copy link

Works great for me. I was hitting this issue 100% of the time. Snapshot mounts are immediate now with no initial error.

@ldonzis
Copy link

ldonzis commented Dec 13, 2012

Likewise, the problem was completely repeatable and reproducible, and now it works perfectly. The only thing is, when I run "ls -l /xxxx/.zfs/snapshot/*" we end up with a very large number of "mount" commands running for a few minutes. Not that this is a normal operation, mind you, but it's not exactly scalable to huge numbers of snapshots.

Thanks for the fix!!

@behlendorf
Copy link
Contributor

Thank you everyone, this fix was merged.

behlendorf added a commit that referenced this issue Jan 9, 2013
This reverts commit 7afcf5b which
accidentally introduced a regression with the .zfs snapshot directory.
While the updated code still does correctly mount the requested
snapshot.  It updates the vfsmount such that it references the
original dataset vfsmount.  The result is that the snapshot itself
isn't visible.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #816
@behlendorf
Copy link
Contributor

Reopening this issue since the fix introduced a regression which wasn't initially caught.

@behlendorf behlendorf reopened this Jan 9, 2013
nedbass added a commit to nedbass/zfs that referenced this issue Jan 10, 2013
As of Linux 3.4 the UMH_WAIT_* constants were renumbered.  In
particular, the meaning of "1" changed from UMH_WAIT_PROC (wait for
process to complete), to UMH_WAIT_EXEC (wait for the exec, but not the
process).  A number of call sites used the number 1 instead of the
constant name, so the behavior was not as expected on kernels with this
change.

One visible consequence of this change was that processes accessing
automounted snapshots received an ELOOP error because they failed to
wait for zfs.mount to complete.

Closes openzfs#816
@behlendorf
Copy link
Contributor

The real root cause for the racy behavior was identified and fixed. Thanks Ned.

761394b call_usermodehelper() should wait for process

unya pushed a commit to unya/zfs that referenced this issue Dec 13, 2013
Ensure that the path member pointers are associated with the
newly-mounted snapshot when zpl_snapdir_automount() returns.  Otherwise
the follow_automount() function may be called repeatedly, leading to an
incorrect ELOOP error return. This problem was observed as a 'Too many
levels of symbolic links' error from user-space commands accessing an
unmounted snapshot in the .zfs/snapshot directory.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#816
unya pushed a commit to unya/zfs that referenced this issue Dec 13, 2013
This reverts commit 7afcf5b which
accidentally introduced a regression with the .zfs snapshot directory.
While the updated code still does correctly mount the requested
snapshot.  It updates the vfsmount such that it references the
original dataset vfsmount.  The result is that the snapshot itself
isn't visible.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#816
unya pushed a commit to unya/zfs that referenced this issue Dec 13, 2013
As of Linux 3.4 the UMH_WAIT_* constants were renumbered.  In
particular, the meaning of "1" changed from UMH_WAIT_PROC (wait for
process to complete), to UMH_WAIT_EXEC (wait for the exec, but not the
process).  A number of call sites used the number 1 instead of the
constant name, so the behavior was not as expected on kernels with this
change.

One visible consequence of this change was that processes accessing
automounted snapshots received an ELOOP error because they failed to
wait for zfs.mount to complete.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#816
@ioquatix
Copy link

For some reason this is an issue for me.. cd and other commands fail on snapshots.. funnily enough it's to do with Minecraft too:

# cp -R /home/.zfs/snapshot/backup-20140619-195402/gaming/minecraft ~/minecraft-backup-snapshot
cp: cannot stat ‘/home/.zfs/snapshot/backup-20140619-195402/gaming/minecraft’: Too many levels of symbolic links

Not sure if this issue has regressed or if it's a new issue..

I was just trying to do an strace and got a kernel panic.. See the attached image.

@ioquatix
Copy link

Here is the kernel panic.. running zfs 0.6.4.2_r0_g44b5ec8_4.1.4_1-1

img_5281

@ioquatix
Copy link

@behlendorf Sorry man just bumping this in case you didn't see it.

@drescherjm
Copy link

That kernel panic is a duplicate. I saw that months ago and reported it.

@ioquatix
Copy link

@drescherjm perhaps you can link to the other issue?

@drescherjm
Copy link

#3257

@pivot69
Copy link

pivot69 commented Dec 8, 2016

Im using zfs on ubuntu server 16.04.1 and I had the same issue with the symlink error when accessing the snapshots. I got the error after sending incremental snapshots from another ubuntu server (running ubuntu server 14.04).

After updating the affected server and trying everything in my mind (atime off, compression off, mountpoints etc) it still did not work. I did a reboot and suddenly everything worked again - until I transferred new incremental snapshots.

This led me to try unmounting and remounting the filesystem after each time I transferred snapshots, and that seemed to do the trick! Now I just put the remount-commands into my script, and I am no longer bothered by this bug.

This is not a fix, it is a only workaround. But in case someone cannot get it working, even with the newest versions of everything, then try this! :)

@drescherjm
Copy link

I thought this was fixed long ago.

@pivot69
Copy link

pivot69 commented Dec 8, 2016

Me too. I realize this is as old issue, but though it could be nice to post my solution here too.
Its the same issue as this: #4514

Just some additional info:
The ubuntu server sending snapshots (14.04) has the ubuntu-zfs package installed.
[ 1.570547] ZFS: Loaded module v0.6.5.7-1~trusty, ZFS pool version 5000, ZFS filesystem version 5
The ubuntu server receiving snapshots (16.04.1) has zfs native
[ 17.440504] ZFS: Loaded module v0.6.5.6-0ubuntu10, ZFS pool version 5000, ZFS filesystem version 5

@eladik
Copy link

eladik commented Mar 10, 2017

I am experiencing this issue, unmount/mount workaround did it for me.

@chrwei
Copy link

chrwei commented Jun 13, 2018

this issue is getting long in the tooth, but still exists for a newly installed fully updated ubuntu 16.04 with incremental received snapshots. normal snapshots work fine. the unmount/mount workaround does work, so it's certainly a cache issue.

I'm sending my snaps using http://www.bolthole.com/solaris/zrep/ is that matters. it's easy to make a test case to reproduce it using this configuration method http://www.bolthole.com/solaris/zrep/zrep.documentation.html#backupserver

@behlendorf
Copy link
Contributor

@chrwei are you able to reproduce this with Ubuntu 18.04? It's likely this was resolved in a newer version or ZFS, can you check exactly which version your running, cat /sys/module/zfs/version. If you're still able to reproduce it with 0.7.x or newer it would be helpful if you could put together a small script with reproduces the issue.

@chrwei
Copy link

chrwei commented Jun 13, 2018

I am on 0.6.5.6-0ubuntu20.

I don't have any 18.04 and don't plan on it for some time.

@kdb424
Copy link

kdb424 commented Jan 12, 2019

To add to this mystery, I have also found issues with this error mounting my ZFS pool via sshfs, with an unmount and remount fixing it as well. It only seems to affect zfs pools on my system, even with the same data. ZFS is running on Proxmox latest fully updated, and sshfs client is a fully updated Manjaro client.
EDIT: ZFS Version 0.7.12-1

pcd1193182 pushed a commit to pcd1193182/zfs that referenced this issue Sep 26, 2023
Merge with `Allow MMP to bypass waiting for other threads`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.