"Too many levels of symbolic links" when "cd"ing to snapshot subdir #816

mooinglemur · 2012-07-10T01:08:55Z

I'm really not sure where the problem lies. There are no symlinks in the entire path here. And the error does not always occur. The error goes away if I "split up" the chdir as demonstrated.

Linux dc 3.4.1-vs2.3.3.4 #2 SMP Sat Jun 23 16:39:09 MST 2012 x86_64 Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz GenuineIntel GNU/Linux

zfs/spl 0.6.0_rc9

-[root@dc]-[5.92/10.60/10.65]-66%-0d19h15m-2012-07-09T14:30:03-
-[/backup/1/minecraft501/.zfs/snapshot/20120701-2302/home/craft/bukkit/world:#]- cd /backup/1/minecraft501/.zfs/snapshot/20120703-1202/home/craft/bukkit/world
bash: cd: /backup/1/minecraft501/.zfs/snapshot/20120703-1202/home/craft/bukkit/world: Too many levels of symbolic links

-[root@dc]-[5.81/7.28/9.21]-64%-0d19h23m-2012-07-09T14:37:35-
-[/backup/1/minecraft501/.zfs/snapshot/20120701-2302/home/craft/bukkit/world:#]- cd /backup/1/minecraft501/.zfs/snapshot/
-[root@dc]-[24.16/11.15/10.45]-64%-0d19h23m-2012-07-09T14:37:45-
-[/backup/1/minecraft501/.zfs/snapshot:#]- cd 20120703-1202/home/craft/bukkit/world

-[root@dc]-[22.95/11.12/10.44]-64%-0d19h23m-2012-07-09T14:37:51-
-[/backup/1/minecraft501/.zfs/snapshot/20120703-1202/home/craft/bukkit/world:#]-

behlendorf · 2012-07-10T18:31:11Z

Thanks for filing the bug, I've seen this once before and we didn't manage to run it down then.

ahmgithubahm · 2012-07-18T09:12:24Z

Not sure how helpful this is, but "me too".

I have a pool which I have been using under CentOS 6.3 x86_64 (2.6.32), and there I have issues with a system hang when running find inside the .zfs subdirectory (with a load of snapshots present). I just thought I'd try the same pool under Ubuntu 12.04 x86_64 (3.2.0-23), and although I see no system hang, instead I get intermittent errors like this:

find: ‘/tank1/data1/.zfs/snapshot/2012.0608.0113.Fri.snapadm.weekly’: Too many levels of symbolic links Command exited with non-zero status 1

There are no symbolic links though. Even without this error, I don't think the find command is finding everything it should.

ZFS on Linux 0.6.0 rc9.

Andy

kattunga · 2012-08-17T23:27:18Z

Hi, same problem here.
When trying to cd to a snapshot, I get intermittent errors "Too many levels of symbolic links", I try 2 minutes later and I can.
Using Ubuntu 12.04 64bits ZOL rc10

kattunga · 2012-08-18T13:51:15Z

Hi, in my case I found several symlink files in the folder (but the same folder under ext4 doesn't cause any problem).

To find all symlink files you can use: sudo find /zpool/dataset -type l -exec ls -l {} ;

msmitherdc · 2012-08-18T18:01:26Z

for me, I get this randomly accessing files (via python) over NFS mounted zvol. OSError(40, 'Too many levels of symbolic links')

behlendorf · 2012-08-18T18:10:45Z

If your at all able to reproduce this it would be very helpful to get an strace of failing command to see if this error is coming back from the kernel. And if so what system call is responsible.

msmitherdc · 2012-08-18T18:12:32Z

I'll attempt to get an strace

mkj · 2012-08-19T11:27:15Z

Running in zsh:

evil:~/backup/vol/.zfs/snapshot ls
20100401/  20100513/  20100825/  20110302/  20110907/  20120330/
20100414/  20100608/  20101007/  20110329/  20111020/  20120722/
20100417/  20100609/  20101122/  20110501/  20111021/  20120801/
20100501/  20100714/  20101213/  20110613/  20111218/  20120808/
evil:~/backup/vol/.zfs/snapshot cd 20110329
cd: too many levels of symbolic links: 20110329
zsh: exit 1
evil:~/backup/vol/.zfs/snapshot

gives strace output for the failing step:

7118  09:53:55.903028 stat(".", {st_mode=S_IFDIR|0555, st_size=2, ...}) = 0
7118  09:53:55.903150 chdir("/home/matt/backup/vol/.zfs/snapshot/20110329") = -1 ELOOP (Too many levels of symbolic links)
7118  09:53:55.918222 stat(".", {st_mode=S_IFDIR|0555, st_size=3, ...}) = 0
7118  09:53:55.918346 chdir("20110329") = -1 ELOOP (Too many levels of symbolic links)

Running the "cd" again works OK. This is on Ubuntu 12.04, 64 bit, 3.2.0-27-generic, ZFS v0.6.0.65-rc9.

msmitherdc · 2012-08-25T14:10:39Z

It seems directly related to the number of files in a folder. If there are 300 files in a folder, no errors ever. If there are 7K files, then i get the error quite often.

andreaso · 2012-09-09T21:20:46Z

I am seeing the same symptom; getting the same backtrace as mkj, and noticing the same pattern as msmitherdc in it affecting file systems containing lots of files. Likewise it is only happening the first time I try entering a snapshot subdir, within a certain time window. The second attempt right afterwards always seem to succeed.

Did notice running a difference in the reported ownership of the snapshot subdir. Before a failing attempt it is listed as belonging to root:root, with some generic permissions. Before the second attempt, the succeeding one, the ownerships as well as the permissions actually match the existing ones on the top level of the file system in question. Some cached metadata from the first attempt, making all the difference the second time around?

root@halleck:/home/andreas/.zfs/snapshot# ls -ld H21
dr-xr-xr-x 1 root root 0 Sep  9 14:01 H21
root@halleck:/home/andreas/.zfs/snapshot# cd H21/
-bash: cd: H21/: Too many levels of symbolic links
root@halleck:/home/andreas/.zfs/snapshot# ls -ld H21
drwxr-x--x 31 andreas andreas 49 Sep  8 22:29 H21
root@halleck:/home/andreas/.zfs/snapshot# cd H21/
root@halleck:/home/andreas/.zfs/snapshot/H21#

Seeing this running a 64-bit Ubuntu 12.04, on the 3.2.0-30-generic kernel, with zfs 0.6.0.71.

behlendorf · 2012-09-10T18:28:39Z

I wonder if this is simply due to the snapshot being slow to mount. The subsequent attempt would work because the snapshot was then successfully mounted. It would continue to work until the snapshot gets automatically unmounted due to inactivity.

The way the .zfs/snapshot directory was implemented is by mounting the required snapshot on demand. Basically, the traversal in to the snapshot triggers the mount and will block in the system call until it completes. This makes the process transparent to the user and greatly simplifies the kernel code since each snapshot can be treated as an individual mount point. However, perhaps there are some races which remain.

Incidentally, the permission issue you reported is just how the mount point is permissioned before the snapshot gets mounted on top. So that's to be expected.

The above strace output is valuable, but the ideal bit of debugging to have would be a call trace using ftrace or systemtap. We'd be able to see exactly where that ELOOP was returned in the kernel to chdir().

ldonzis · 2012-10-09T02:30:28Z

I agree, I have this same problem and it's absolutely consistent: the first access gives the error (it doesn't require "cd", for example, "ls" gives the same message). The second and subsequent accesses are fine. In my case, it's not related to the number of files in the directory. Once it works, it works for "a while" (what is the inactivity timeout?) and then after some period, the error occurs again.

This is on Ubuntu 12.04, kernel 3.2.0-31, ZFS v0.6.0.80-rc11.

behlendorf · 2012-10-09T03:59:07Z

That "awhile" would be 5 minutes. By default that's the timeout to expire idle snapshots which were automounted. If you want to mitigate the issue for now you could crank this use by increasing the zfs_expire_snapshot module option.

$ modinfo module/zfs/zfs.ko | grep expire
parm:           zfs_expire_snapshot:Seconds to expire .zfs/snapshot (int)

cronnelly · 2012-11-25T14:50:56Z

I'm being affected by this problem too. Is there anything I can do to help debug? Ubuntu 12.10; kernel 3.5.0-18-generic; ZOL 0.6.0-rc12.

nedbass · 2012-12-12T18:47:26Z

I've been digging in to this problem. The process loops in follow_managed(), calling follow_automount() each time until it hits the 40 level limit, as shown by the following output from a custom systemtap script.

1355336265 ls(63225) kernel.function("follow_managed@/build/buildd/linux-3.2.0/fs/namei.c:797") zfs-auto-snap_daily-2012-12-08-0747 {.mnt=0xffff880610de1a00, .dentry=0xffff88059b44b380}
1355336265 ls(63225) kernel.function("follow_automount@/build/buildd/linux-3.2.0/fs/namei.c:717") 131264 {.mnt=0xffff880610de1a00, .dentry=0xffff88059b44b380}
1355336265 ls(63225) kernel.function("follow_automount@/build/buildd/linux-3.2.0/fs/namei.c:717") 131264 {.mnt=0xffff880610de1a00, .dentry=0xffff88059b44b380}
1355336265 ls(63225) kernel.function("follow_automount@/build/buildd/linux-3.2.0/fs/namei.c:717") 131264 {.mnt=0xffff880610de1a00, .dentry=0xffff88059b44b380}
1355336265 ls(63225) kernel.function("follow_automount@/build/buildd/linux-3.2.0/fs/namei.c:717") 131264 {.mnt=0xffff880610de1a00, .dentry=0xffff88059b44b380}
...

The follow_automount probe shows the dentry->d_flags and the path structure.

Notice that the dentry and mnt pointers never change. I think that in order to exit the while loop (shown below) the path->dentry pointer needs to point to the dentry for the root of the newly-mounted filesystem after the call to follow_automount(). This is taken care of in follow_automount() for the non-mount-collision case.

I'm thinking that if zfsctl_mount_snapshot() could get a pointer to the struct vfsmount for the newly-mounted snapshot, then it could update the struct path. But I'm not sure how to do that; it looks like lookup_mnt() is what we need, but it's not exported by the kernel.

 804         /* Given that we're not holding a lock here, we retain the value in a   
 805          * local variable for each dentry as we look at it so that we don't see 
 806          * the components of that value change under us */                      
 807         while (managed = ACCESS_ONCE(path->dentry->d_flags),                    
 808                managed &= DCACHE_MANAGED_DENTRY,                                
 809                unlikely(managed != 0)) {                                        
 810                 /* Allow the filesystem to manage the transit without i_mutex   
 811                  * being held. */                                               
 812                 if (managed & DCACHE_MANAGE_TRANSIT) {                          
 813                         BUG_ON(!path->dentry->d_op);                            
 814                         BUG_ON(!path->dentry->d_op->d_manage);                  
 815                         ret = path->dentry->d_op->d_manage(path->dentry, false);
 816                         if (ret < 0)                                            
 817                                 break;                                          
 818                 }                                                               
 819                                                                                 
 820                 /* Transit to a mounted filesystem. */                          
 821                 if (managed & DCACHE_MOUNTED) {                                 
 822                         struct vfsmount *mounted = lookup_mnt(path);            
 823                         if (mounted) {                                          
 824                                 dput(path->dentry);                             
 825                                 if (need_mntput)                                
 826                                         mntput(path->mnt);                      
 827                                 path->mnt = mounted;                            
 828                                 path->dentry = dget(mounted->mnt_root);         
 829                                 need_mntput = true;                             
 830                                 continue;                                       
 831                         }                                                       
 832                                                                                 
 833                         /* Something is mounted on this dentry in another       
 834                          * namespace and/or whatever was mounted there in this  
 835                          * namespace got unmounted before we managed to get the 
 836                          * vfsmount_lock */                                     
 837                 }                                                               
 838                                                                                 
 839                 /* Handle an automount point */                                 
 840                 if (managed & DCACHE_NEED_AUTOMOUNT) {                          
 841                         ret = follow_automount(path, flags, &need_mntput);      
 842                         if (ret < 0)                                            
 843                                 break;                                          
 844                         continue;                                               
 845                 }                                                               
 846                                                                                 
 847                 /* We didn't change the current path point */                   
 848                 break;                                                          
 849         }

behlendorf · 2012-12-12T18:59:39Z

It's still not clear to me why this only sometimes fails.

I'm thinking that if zfsctl_mount_snapshot() could get a pointer to the struct vfsmount for the newly-mounted snapshot, then it could update the struct path. But I'm not sure how to do that; it looks like lookup_mnt() is what we need, but it's not exported by the kernel.

You should be able to do this with follow_down_one.

nedbass · 2012-12-12T19:12:12Z

It's still not clear to me why this only sometimes fails.

Me neither. If the problem is as I described it seems like it should always fail. Unless the path pointer is shared and I just always "win" the race on my desktop. For me it always fails on my workstation, but I haven't reproduced it in a VM running the same kernel and ZFS versions.

You should be able to do this with follow_down_one.

Cool, I'll give that a try. Thanks

nedbass · 2012-12-12T23:07:29Z

Adding follow_up(path) to zpl_snapdir_automount() fixes it for me.

diff --git a/module/zfs/zpl_ctldir.c b/module/zfs/zpl_ctldir.c
index 7dfaf6e..09585c4 100644
--- a/module/zfs/zpl_ctldir.c
+++ b/module/zfs/zpl_ctldir.c
@@ -356,6 +356,8 @@ zpl_snapdir_automount(struct path *path)
        if (error)
                return ERR_PTR(error);

+       follow_up(path);
+
        /*
         * Rather than returning the new vfsmount for the snapshot we must
         * return NULL to indicate a mount collision.  This is done because

Ensure that the path member pointers are associated with the newly-mounted snapshot when zpl_snapdir_automount() returns. Otherwise the follow_automount() function may be called repeatedly, leading to an incorrect ELOOP error return. This problem was observed as a 'Too many levels of symbolic links' error from user-space commands accessing an unmounted snapshot in the .zfs/snapshot directory. Issue openzfs#816

nedbass · 2012-12-13T00:12:23Z

@cronnelly It would be great if anyone else having this issue could test the above patch before I submit a pull request. Thanks

behlendorf · 2012-12-13T00:12:25Z

Up... down... I always get those confused. Based on your analysis it does look like this should resolve the issue. It would be great is some of the folks watching this issue could verify the proposed 1 line fix resolves the probably for them as well.

andreaso · 2012-12-13T01:44:41Z

Seems to do the trick.

I have a server with a set of snapshots on which I pretty much all the time managed to trigger the "Too many levels of symbolic links" response. Now with this patch I haven't been able to reproduce the bug.

Thanks!

mgmartin · 2012-12-13T03:23:18Z

Works great for me. I was hitting this issue 100% of the time. Snapshot mounts are immediate now with no initial error.

ldonzis · 2012-12-13T16:03:20Z

Likewise, the problem was completely repeatable and reproducible, and now it works perfectly. The only thing is, when I run "ls -l /xxxx/.zfs/snapshot/*" we end up with a very large number of "mount" commands running for a few minutes. Not that this is a normal operation, mind you, but it's not exactly scalable to huge numbers of snapshots.

Thanks for the fix!!

behlendorf · 2012-12-13T17:46:17Z

Thank you everyone, this fix was merged.

This reverts commit 7afcf5b which accidentally introduced a regression with the .zfs snapshot directory. While the updated code still does correctly mount the requested snapshot. It updates the vfsmount such that it references the original dataset vfsmount. The result is that the snapshot itself isn't visible. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #816

behlendorf · 2013-01-09T19:27:46Z

Reopening this issue since the fix introduced a regression which wasn't initially caught.

As of Linux 3.4 the UMH_WAIT_* constants were renumbered. In particular, the meaning of "1" changed from UMH_WAIT_PROC (wait for process to complete), to UMH_WAIT_EXEC (wait for the exec, but not the process). A number of call sites used the number 1 instead of the constant name, so the behavior was not as expected on kernels with this change. One visible consequence of this change was that processes accessing automounted snapshots received an ELOOP error because they failed to wait for zfs.mount to complete. Closes openzfs#816

behlendorf · 2013-01-10T17:24:30Z

The real root cause for the racy behavior was identified and fixed. Thanks Ned.

761394b call_usermodehelper() should wait for process

Ensure that the path member pointers are associated with the newly-mounted snapshot when zpl_snapdir_automount() returns. Otherwise the follow_automount() function may be called repeatedly, leading to an incorrect ELOOP error return. This problem was observed as a 'Too many levels of symbolic links' error from user-space commands accessing an unmounted snapshot in the .zfs/snapshot directory. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#816

This reverts commit 7afcf5b which accidentally introduced a regression with the .zfs snapshot directory. While the updated code still does correctly mount the requested snapshot. It updates the vfsmount such that it references the original dataset vfsmount. The result is that the snapshot itself isn't visible. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#816

As of Linux 3.4 the UMH_WAIT_* constants were renumbered. In particular, the meaning of "1" changed from UMH_WAIT_PROC (wait for process to complete), to UMH_WAIT_EXEC (wait for the exec, but not the process). A number of call sites used the number 1 instead of the constant name, so the behavior was not as expected on kernels with this change. One visible consequence of this change was that processes accessing automounted snapshots received an ELOOP error because they failed to wait for zfs.mount to complete. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#816

ioquatix · 2015-08-16T07:19:40Z

For some reason this is an issue for me.. cd and other commands fail on snapshots.. funnily enough it's to do with Minecraft too:

# cp -R /home/.zfs/snapshot/backup-20140619-195402/gaming/minecraft ~/minecraft-backup-snapshot
cp: cannot stat ‘/home/.zfs/snapshot/backup-20140619-195402/gaming/minecraft’: Too many levels of symbolic links

Not sure if this issue has regressed or if it's a new issue..

I was just trying to do an strace and got a kernel panic.. See the attached image.

ioquatix · 2015-08-16T07:31:16Z

Here is the kernel panic.. running zfs 0.6.4.2_r0_g44b5ec8_4.1.4_1-1

ioquatix · 2015-08-17T23:41:55Z

@behlendorf Sorry man just bumping this in case you didn't see it.

drescherjm · 2015-08-18T00:16:21Z

That kernel panic is a duplicate. I saw that months ago and reported it.

ioquatix · 2015-08-18T00:17:09Z

@drescherjm perhaps you can link to the other issue?

drescherjm · 2015-08-18T00:28:42Z

#3257

pivot69 · 2016-12-08T10:59:54Z

Im using zfs on ubuntu server 16.04.1 and I had the same issue with the symlink error when accessing the snapshots. I got the error after sending incremental snapshots from another ubuntu server (running ubuntu server 14.04).

After updating the affected server and trying everything in my mind (atime off, compression off, mountpoints etc) it still did not work. I did a reboot and suddenly everything worked again - until I transferred new incremental snapshots.

This led me to try unmounting and remounting the filesystem after each time I transferred snapshots, and that seemed to do the trick! Now I just put the remount-commands into my script, and I am no longer bothered by this bug.

This is not a fix, it is a only workaround. But in case someone cannot get it working, even with the newest versions of everything, then try this! :)

drescherjm · 2016-12-08T15:50:35Z

I thought this was fixed long ago.

pivot69 · 2016-12-08T17:47:54Z

Me too. I realize this is as old issue, but though it could be nice to post my solution here too.
Its the same issue as this: #4514

Just some additional info:
The ubuntu server sending snapshots (14.04) has the ubuntu-zfs package installed.
[ 1.570547] ZFS: Loaded module v0.6.5.7-1~trusty, ZFS pool version 5000, ZFS filesystem version 5
The ubuntu server receiving snapshots (16.04.1) has zfs native
[ 17.440504] ZFS: Loaded module v0.6.5.6-0ubuntu10, ZFS pool version 5000, ZFS filesystem version 5

eladik · 2017-03-10T17:56:22Z

I am experiencing this issue, unmount/mount workaround did it for me.

chrwei · 2018-06-13T14:53:12Z

this issue is getting long in the tooth, but still exists for a newly installed fully updated ubuntu 16.04 with incremental received snapshots. normal snapshots work fine. the unmount/mount workaround does work, so it's certainly a cache issue.

I'm sending my snaps using http://www.bolthole.com/solaris/zrep/ is that matters. it's easy to make a test case to reproduce it using this configuration method http://www.bolthole.com/solaris/zrep/zrep.documentation.html#backupserver

behlendorf · 2018-06-13T16:14:26Z

@chrwei are you able to reproduce this with Ubuntu 18.04? It's likely this was resolved in a newer version or ZFS, can you check exactly which version your running, cat /sys/module/zfs/version. If you're still able to reproduce it with 0.7.x or newer it would be helpful if you could put together a small script with reproduces the issue.

chrwei · 2018-06-13T18:19:20Z

I am on 0.6.5.6-0ubuntu20.

I don't have any 18.04 and don't plan on it for some time.

kdb424 · 2019-01-12T22:15:58Z

To add to this mystery, I have also found issues with this error mounting my ZFS pool via sshfs, with an unmount and remount fixing it as well. It only seems to affect zfs pools on my system, even with the same data. ZFS is running on Proxmox latest fully updated, and sshfs client is a fully updated Manjaro client.
EDIT: ZFS Version 0.7.12-1

Merge with `Allow MMP to bypass waiting for other threads`

behlendorf closed this as completed in 7afcf5b Dec 13, 2012

behlendorf reopened this Jan 9, 2013

nedbass mentioned this issue Jan 10, 2013

call_usermodehelper() should wait for process #1191

Closed

behlendorf closed this as completed in 761394b Jan 10, 2013

Deewiant mentioned this issue Jul 25, 2015

Linux 3.18 compat: Snapshot automounting #3344

Closed

kernelOfTruth mentioned this issue Jan 11, 2016

Kernel panic while accessing snapshot directory #3948

Closed

odoucet mentioned this issue Apr 12, 2016

"Too many levels of symbolic links" when accessing snapshots #4514

Open

pcd1193182 pushed a commit to pcd1193182/zfs that referenced this issue Sep 26, 2023

Merge pull request openzfs#816 from ahrens/merge

bbc2095

Merge with `Allow MMP to bypass waiting for other threads`

"Too many levels of symbolic links" when "cd"ing to snapshot subdir #816

"Too many levels of symbolic links" when "cd"ing to snapshot subdir #816

Comments

mooinglemur commented Jul 10, 2012

behlendorf commented Jul 10, 2012

ahmgithubahm commented Jul 18, 2012

kattunga commented Aug 17, 2012

kattunga commented Aug 18, 2012

msmitherdc commented Aug 18, 2012

behlendorf commented Aug 18, 2012

msmitherdc commented Aug 18, 2012

mkj commented Aug 19, 2012

msmitherdc commented Aug 25, 2012

andreaso commented Sep 9, 2012

behlendorf commented Sep 10, 2012

ldonzis commented Oct 9, 2012

behlendorf commented Oct 9, 2012

cronnelly commented Nov 25, 2012

nedbass commented Dec 12, 2012

behlendorf commented Dec 12, 2012

nedbass commented Dec 12, 2012

nedbass commented Dec 12, 2012

nedbass commented Dec 13, 2012

behlendorf commented Dec 13, 2012

andreaso commented Dec 13, 2012

mgmartin commented Dec 13, 2012

ldonzis commented Dec 13, 2012

behlendorf commented Dec 13, 2012

behlendorf commented Jan 9, 2013

behlendorf commented Jan 10, 2013

ioquatix commented Aug 16, 2015

ioquatix commented Aug 16, 2015

ioquatix commented Aug 17, 2015

drescherjm commented Aug 18, 2015

ioquatix commented Aug 18, 2015

drescherjm commented Aug 18, 2015

pivot69 commented Dec 8, 2016

drescherjm commented Dec 8, 2016

pivot69 commented Dec 8, 2016 • edited Loading

eladik commented Mar 10, 2017

chrwei commented Jun 13, 2018

behlendorf commented Jun 13, 2018

chrwei commented Jun 13, 2018

kdb424 commented Jan 12, 2019 • edited Loading

pivot69 commented Dec 8, 2016 •

edited

Loading

kdb424 commented Jan 12, 2019 •

edited

Loading