Reduce the reliance on `PATH_MAX` #27930

barosl · 2015-08-21T10:21:51Z

This PR rewrites the code that previously relied on PATH_MAX.

On my tests, even though the user gives the buffer length explicitly, both Linux's glibc and OS X's libc seems to obey the hard limit of PATH_MAX internally. So, to truly remove the limitation of PATH_MAX, the related system calls should be rewritten from scratch in Rust, which this PR does not try to do.

However, eliminating the need of PATH_MAX is still a good idea for various reasons, such as: (1) they might change the implementation in the future, and (2) some platforms don't have a hard-coded PATH_MAX, such as GNU Hurd.

More details are in the commit messages.

Fixes #27454.

r? @alexcrichton

rust-highfive · 2015-08-21T10:22:03Z

r? @nikomatsakis

(rust_highfive has picked a reviewer for you, use r? to override)

alexcrichton · 2015-08-21T17:29:21Z

src/libstd/sys/unix/fs.rs

+        }
+
+        let new_cap = buf.capacity() * 2;
+        buf.reserve(new_cap);


This is actually the extra space to reserve, so this may be growing a little more rapidly than anticipated. Like with read_to_end this can probably just call buf.reserve(1) after setting the length of the capacity to let the internal doubling logic take over

The length of buf remains 0 until the end, so I think buf.reserve() will work as expected in this case. The problem is, as you said, Vec::reserve has the internal doubling logic, so the capacity will grow as in N, 4N, 16N, 64N, ....

The recently added getcwd has a similar logic but it uses an external variable to track the desired capacity, so the capacity being four times per each iteration will not happen. However, every one of two Vec::reserve calls will be no-op because the previously increased capacity will be sufficient to meet the new requirement. This also means every one of two system calls is wasted.

Correct me if I'm wrong, but the behavior of read_to_end_uninitialized seems to be a bit misleading too. What it will do is N, 2(N+1), 2(2(N+1))+1, ..., because the internal doubling is applied after the capacity is increased by one. Is this an expected behavior? I don't know much about the allocator, but I guess the capacity remaining 2^N seems to be ideal. The behavior does not follow this.

There seems to be a more appropriate function for this purpose: Vec::reserve_exact. It has the almost identical behavior except it does not double the capacity internally, so it would be useful if the user is assured to provide an appropriate new capacity and also wants to prevent excessive memory allocation. What do you think of this?

Ah actually because the length of the buffer isn't changing here (nor in the getcwd case) it's actually working as anticipated. The reserve function reserves space for n extra elements, but because the length of the vector is 0 it works out that you're just specifying the capacity.

In read_to_end_uninitialized the request is always rounded up to the next power of two so it ends up just doubling the vector whenever it reallocates. So long as the len is updated to be the same as the capacity in this (and possibly getcwd too) then reserve(1) should be all you need.

In read_to_end_uninitialized the request is always rounded up to the next power of two so it ends up just doubling the vector whenever it reallocates.

Actually, what I said above is that it does not work like that! For example, if I add println!("{:?}", buf.capacity()) right after the buf.reserve(1) statement in the function, it outputs:

32 66 134 270 542 1086 2174

It doesn't produce the powers of 2, because 1 is added before doubling the capacity. And that's why I say it doesn't seem to be ideal. I believe the modern allocators tend to work well when the requested space is the powers of 2, so the new capacity should be adjusted to a power of 2 if possible, but read_to_end_uninitialized doesn't seem to follow that. Is my observation correct?

Therefore my point is, to truly get a new capacity with a power of 2, one might have to use reserve_exact or something similar. If the behavior of read_to_end_uninitialized is unintended, it might also have to be modified to use it.

Ah yeah I guess that makes sense. In any case, however, these should all follow the same strategy instead of reinventing the wheel. Let's have them all take the reserve(1) route to punt to Vec as much as possible for figuring out its reallocation strategy

Uhm, sorry for asking this again, but there's still a remaining concern. As the length is not actually set to the capacity at each iteration currently, calling buf.reserve(1) will be a no-op in my code. Do you think the length should be updated every time before calling reserve? Something like:

loop { let buf_read = try!(readlink(...)); unsafe { buf.set_len(buf_read); } // This will be added if buf_read != buf.capacity() { unsafe { buf.set_len(buf_read); } // This should probably be removed buf.shrink_to_fit(); return Ok(...); } buf.reserve(1); }

That feels a bit odd to me because unlike read_to_end_uninitialized, we abandon the buffer immediately if readlink fails. But if you're OK with the code above, I'll follow that.

Plus, do you think getcwd might also have to be modified to use reserve(1)? This may have to be another PR, but if you like it, I can add another commit to address this.

Sorry for disturbing you, this shall be my last review request!

Yeah that sort of loop is fine to do, and it's fine by me if you want to share an implementation between getcwd and readlink. Thanks!

barosl · 2015-08-26T16:34:07Z

I addressed all of the comments and added another commit to modify getcwd. Please review it!

alexcrichton · 2015-08-26T23:36:49Z

src/libstd/sys/unix/os.rs

@@ -30,7 +30,7 @@ use sys::c;
 use sys::fd;
 use vec;

-const GETCWD_BUF_BYTES: usize = 2048;
+const GETCWD_BUF_BYTES: usize = 512;


It's ok to just go ahead and delete this constant and inline its value, I'm not sure that it's necessarily buying us a whole lot up here

alexcrichton · 2015-08-26T23:37:53Z

Looks good to me, thanks @barosl! r=me after a few small nits and a squash

`_PC_NAME_MAX` is necessary to use `libc::pathconf`. Its value is fixed to 3 currently, but actually it varies with the platform. * _PC_NAME_MAX == 3 Linux (glibc): https://sourceware.org/git/?p=glibc.git;a=blob;f=bits/confname.h;h=1c714dfbf9398b8a600f9b69426a7ad8c7e89ab4;hb=HEAD#l32 NaCl (newlib): https://chromium.googlesource.com/native_client/nacl-newlib/+/373135ec5241d09138aa56603742b94b3b64ea1d/newlib/libc/include/sys/unistd.h#430 * _PC_NAME_MAX == 4 Android (Bionic): https://github.com/android/platform_bionic/blob/7e919daeaad62515ebbbf7b06badc77625a14d90/libc/include/unistd.h#L59 FreeBSD: https://svnweb.freebsd.org/base/head/sys/sys/unistd.h?revision=239347&view=markup#l127 NetBSD: http://ftp.netbsd.org/pub/NetBSD/NetBSD-current/src/sys/sys/unistd.h OS X: http://opensource.apple.com/source/xnu/xnu-2782.10.72/bsd/sys/unistd.h This commit fixes this, and also addes the `_PC_PATH_MAX` constant needed by further commits.

barosl · 2015-08-27T18:18:07Z

Done. What commits do you think should be squashed? I feel a sense that the changes to libc may have to be separated, but what about others? Should I combine the commit messages as well? For now, I rebased them onto the current HEAD.

alexcrichton · 2015-08-27T18:40:20Z

Yeah it's fine to leave libc separate but feel free to squash the other four, so long as the commit message makes sense when taken in isolation I'm fine with it. Thanks!

- Rewrite `std::sys::fs::readlink` not to rely on `PATH_MAX` It currently has the following problems: 1. It uses `_PC_NAME_MAX` to query the maximum length of a file path in the underlying system. However, the meaning of the constant is the maximum length of *a path component*, not a full path. The correct constant should be `_PC_PATH_MAX`. 2. `pathconf` *may* fail if the referred file does not exist. This can be problematic if the file which the symbolic link points to does not exist, but the link itself does exist. In this case, the current implementation resorts to the hard-coded value of `1024`, which is not ideal. 3. There may exist a platform where there is no limit on file path lengths in general. That's the reaon why GNU Hurd doesn't define `PATH_MAX` at all, in addition to having `pathconf` always returning `-1`. In these platforms, the content of the symbolic link can be silently truncated if the length exceeds the hard-coded limit mentioned above. 4. The value obtained by `pathconf` may be outdated at the point of actually calling `readlink`. This is inherently racy. This commit introduces a loop that gradually increases the length of the buffer passed to `readlink`, eliminating the need of `pathconf`. - Remove the arbitrary memory limit of `std::sys::fs::realpath` As per POSIX 2013, `realpath` will return a malloc'ed buffer if the second argument is a null pointer.[1] [1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/realpath.html - Comment on functions that are still using `PATH_MAX` There are some functions that only work in terms of `PATH_MAX`, such as `F_GETPATH` in OS X. Comments on them for posterity.

Make `std::sys::os::getcwd` call `Vec::reserve(1)` followed by `Vec::set_len` to double the buffer. This is to align with other similar functions, such as: - `std::sys_common::io::read_to_end_uninitialized` - `std::sys::fs::readlink` Also, reduce the initial buffer size from 2048 to 512. The previous size was introduced with 4bc26ce in 2013, but it seems a bit excessive. This is probably because buffer doubling was not implemented back then.

barosl · 2015-08-27T19:51:24Z

Done, squashed into three commits.

alexcrichton · 2015-08-27T20:39:26Z

@bors: r+ 6065678

Thanks!

@alexcrichton

This PR rewrites the code that previously relied on `PATH_MAX`. On my tests, even though the user gives the buffer length explicitly, both Linux's glibc and OS X's libc seems to obey the hard limit of `PATH_MAX` internally. So, to truly remove the limitation of `PATH_MAX`, the related system calls should be rewritten from scratch in Rust, which this PR does not try to do. However, eliminating the need of `PATH_MAX` is still a good idea for various reasons, such as: (1) they might change the implementation in the future, and (2) some platforms don't have a hard-coded `PATH_MAX`, such as GNU Hurd. More details are in the commit messages. Fixes #27454. r? @alexcrichton

bors · 2015-08-27T21:47:35Z

⌛ Testing commit 6065678 with merge 79dd92f...

bors · 2015-08-27T23:25:36Z

☀️ Test successful - auto-linux-32-nopt-t, auto-linux-32-opt, auto-linux-64-nopt-t, auto-linux-64-opt, auto-linux-64-x-android-t, auto-mac-32-opt, auto-mac-64-nopt-t, auto-mac-64-opt, auto-win-gnu-32-nopt-t, auto-win-gnu-32-opt, auto-win-gnu-64-nopt-t, auto-win-gnu-64-opt, auto-win-msvc-32-opt, auto-win-msvc-64-opt

For Bitrig, NetBSD and OpenBSD the constant was incorrectly in posix01, when it's actually posix08, so we move it. This is a [breaking-change], but we already had one in rust-lang#27930. Fix NetBSD's F_DUPFD_CLOEXEC constant. For a similar feature detection, see this musl thread: http://comments.gmane.org/gmane.linux.lib.musl.general/2963 This assumes that an int literal has type `c_int` for varidic functions.

rust-highfive assigned nikomatsakis Aug 21, 2015

rust-highfive assigned alexcrichton and unassigned nikomatsakis Aug 21, 2015

alexcrichton reviewed Aug 21, 2015
View reviewed changes

barosl force-pushed the path_max branch from d4ac616 to c86a951 Compare August 24, 2015 20:00

barosl mentioned this pull request Aug 25, 2015

Atomically set CLOEXEC on duplicated sockets #27980

Merged

barosl force-pushed the path_max branch 3 times, most recently from dcc456b to 15d0322 Compare August 26, 2015 16:31

alexcrichton reviewed Aug 26, 2015
View reviewed changes

barosl force-pushed the path_max branch from 15d0322 to 902ea71 Compare August 27, 2015 18:15

barosl added 2 commits August 28, 2015 04:46

barosl force-pushed the path_max branch from 902ea71 to 6065678 Compare August 27, 2015 19:49

bors merged commit 6065678 into rust-lang:master Aug 27, 2015

alexcrichton mentioned this pull request Sep 17, 2015

Tracking issue for std::fs::canonicalize #27706

Closed

the8472 mentioned this pull request Jan 9, 2022

Default buffer size of getcwd for linux #92672

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce the reliance on `PATH_MAX` #27930

Reduce the reliance on `PATH_MAX` #27930

barosl commented Aug 21, 2015

rust-highfive commented Aug 21, 2015

alexcrichton Aug 21, 2015

barosl Aug 22, 2015

alexcrichton Aug 23, 2015

barosl Aug 23, 2015

alexcrichton Aug 24, 2015

barosl Aug 24, 2015

alexcrichton Aug 24, 2015

barosl commented Aug 26, 2015

alexcrichton Aug 26, 2015

alexcrichton commented Aug 26, 2015

barosl commented Aug 27, 2015

alexcrichton commented Aug 27, 2015

barosl commented Aug 27, 2015

alexcrichton commented Aug 27, 2015

bors commented Aug 27, 2015

bors commented Aug 27, 2015

Reduce the reliance on PATH_MAX #27930

Reduce the reliance on PATH_MAX #27930

Conversation

barosl commented Aug 21, 2015

rust-highfive commented Aug 21, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

barosl commented Aug 26, 2015

Choose a reason for hiding this comment

alexcrichton commented Aug 26, 2015

barosl commented Aug 27, 2015

alexcrichton commented Aug 27, 2015

barosl commented Aug 27, 2015

alexcrichton commented Aug 27, 2015

bors commented Aug 27, 2015

bors commented Aug 27, 2015

Reduce the reliance on `PATH_MAX` #27930

Reduce the reliance on `PATH_MAX` #27930