Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the reliance on PATH_MAX #27930

Merged
merged 3 commits into from
Aug 27, 2015
Merged

Reduce the reliance on PATH_MAX #27930

merged 3 commits into from
Aug 27, 2015

Conversation

barosl
Copy link
Contributor

@barosl barosl commented Aug 21, 2015

This PR rewrites the code that previously relied on PATH_MAX.

On my tests, even though the user gives the buffer length explicitly, both Linux's glibc and OS X's libc seems to obey the hard limit of PATH_MAX internally. So, to truly remove the limitation of PATH_MAX, the related system calls should be rewritten from scratch in Rust, which this PR does not try to do.

However, eliminating the need of PATH_MAX is still a good idea for various reasons, such as: (1) they might change the implementation in the future, and (2) some platforms don't have a hard-coded PATH_MAX, such as GNU Hurd.

More details are in the commit messages.

Fixes #27454.

r? @alexcrichton

@rust-highfive
Copy link
Collaborator

r? @nikomatsakis

(rust_highfive has picked a reviewer for you, use r? to override)

}

let new_cap = buf.capacity() * 2;
buf.reserve(new_cap);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually the extra space to reserve, so this may be growing a little more rapidly than anticipated. Like with read_to_end this can probably just call buf.reserve(1) after setting the length of the capacity to let the internal doubling logic take over

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The length of buf remains 0 until the end, so I think buf.reserve() will work as expected in this case. The problem is, as you said, Vec::reserve has the internal doubling logic, so the capacity will grow as in N, 4N, 16N, 64N, ....

The recently added getcwd has a similar logic but it uses an external variable to track the desired capacity, so the capacity being four times per each iteration will not happen. However, every one of two Vec::reserve calls will be no-op because the previously increased capacity will be sufficient to meet the new requirement. This also means every one of two system calls is wasted.

Correct me if I'm wrong, but the behavior of read_to_end_uninitialized seems to be a bit misleading too. What it will do is N, 2(N+1), 2(2(N+1))+1, ..., because the internal doubling is applied after the capacity is increased by one. Is this an expected behavior? I don't know much about the allocator, but I guess the capacity remaining 2^N seems to be ideal. The behavior does not follow this.

There seems to be a more appropriate function for this purpose: Vec::reserve_exact. It has the almost identical behavior except it does not double the capacity internally, so it would be useful if the user is assured to provide an appropriate new capacity and also wants to prevent excessive memory allocation. What do you think of this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah actually because the length of the buffer isn't changing here (nor in the getcwd case) it's actually working as anticipated. The reserve function reserves space for n extra elements, but because the length of the vector is 0 it works out that you're just specifying the capacity.

In read_to_end_uninitialized the request is always rounded up to the next power of two so it ends up just doubling the vector whenever it reallocates. So long as the len is updated to be the same as the capacity in this (and possibly getcwd too) then reserve(1) should be all you need.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In read_to_end_uninitialized the request is always rounded up to the next power of two so it ends up just doubling the vector whenever it reallocates.

Actually, what I said above is that it does not work like that! For example, if I add println!("{:?}", buf.capacity()) right after the buf.reserve(1) statement in the function, it outputs:

32
66
134
270
542
1086
2174

It doesn't produce the powers of 2, because 1 is added before doubling the capacity. And that's why I say it doesn't seem to be ideal. I believe the modern allocators tend to work well when the requested space is the powers of 2, so the new capacity should be adjusted to a power of 2 if possible, but read_to_end_uninitialized doesn't seem to follow that. Is my observation correct?

Therefore my point is, to truly get a new capacity with a power of 2, one might have to use reserve_exact or something similar. If the behavior of read_to_end_uninitialized is unintended, it might also have to be modified to use it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah I guess that makes sense. In any case, however, these should all follow the same strategy instead of reinventing the wheel. Let's have them all take the reserve(1) route to punt to Vec as much as possible for figuring out its reallocation strategy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, sorry for asking this again, but there's still a remaining concern. As the length is not actually set to the capacity at each iteration currently, calling buf.reserve(1) will be a no-op in my code. Do you think the length should be updated every time before calling reserve? Something like:

loop {
    let buf_read = try!(readlink(...));

    unsafe { buf.set_len(buf_read); } // This will be added

    if buf_read != buf.capacity() {
        unsafe { buf.set_len(buf_read); } // This should probably be removed
        buf.shrink_to_fit();

        return Ok(...);
    }

    buf.reserve(1);
}

That feels a bit odd to me because unlike read_to_end_uninitialized, we abandon the buffer immediately if readlink fails. But if you're OK with the code above, I'll follow that.

Plus, do you think getcwd might also have to be modified to use reserve(1)? This may have to be another PR, but if you like it, I can add another commit to address this.

Sorry for disturbing you, this shall be my last review request!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that sort of loop is fine to do, and it's fine by me if you want to share an implementation between getcwd and readlink. Thanks!

@barosl
Copy link
Contributor Author

barosl commented Aug 26, 2015

I addressed all of the comments and added another commit to modify getcwd. Please review it!

@@ -30,7 +30,7 @@ use sys::c;
use sys::fd;
use vec;

const GETCWD_BUF_BYTES: usize = 2048;
const GETCWD_BUF_BYTES: usize = 512;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok to just go ahead and delete this constant and inline its value, I'm not sure that it's necessarily buying us a whole lot up here

@alexcrichton
Copy link
Member

Looks good to me, thanks @barosl! r=me after a few small nits and a squash

@barosl
Copy link
Contributor Author

barosl commented Aug 27, 2015

Done. What commits do you think should be squashed? I feel a sense that the changes to libc may have to be separated, but what about others? Should I combine the commit messages as well? For now, I rebased them onto the current HEAD.

@alexcrichton
Copy link
Member

Yeah it's fine to leave libc separate but feel free to squash the other four, so long as the commit message makes sense when taken in isolation I'm fine with it. Thanks!

- Rewrite `std::sys::fs::readlink` not to rely on `PATH_MAX`

It currently has the following problems:

1. It uses `_PC_NAME_MAX` to query the maximum length of a file path in
the underlying system. However, the meaning of the constant is the
maximum length of *a path component*, not a full path. The correct
constant should be `_PC_PATH_MAX`.

2. `pathconf` *may* fail if the referred file does not exist. This can
be problematic if the file which the symbolic link points to does not
exist, but the link itself does exist. In this case, the current
implementation resorts to the hard-coded value of `1024`, which is not
ideal.

3. There may exist a platform where there is no limit on file path
lengths in general. That's the reaon why GNU Hurd doesn't define
`PATH_MAX` at all, in addition to having `pathconf` always returning
`-1`. In these platforms, the content of the symbolic link can be
silently truncated if the length exceeds the hard-coded limit mentioned
above.

4. The value obtained by `pathconf` may be outdated at the point of
actually calling `readlink`. This is inherently racy.

This commit introduces a loop that gradually increases the length of the
buffer passed to `readlink`, eliminating the need of `pathconf`.

- Remove the arbitrary memory limit of `std::sys::fs::realpath`

As per POSIX 2013, `realpath` will return a malloc'ed buffer if the
second argument is a null pointer.[1]

[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/realpath.html

- Comment on functions that are still using `PATH_MAX`

There are some functions that only work in terms of `PATH_MAX`, such as
`F_GETPATH` in OS X. Comments on them for posterity.
Make `std::sys::os::getcwd` call `Vec::reserve(1)` followed by
`Vec::set_len` to double the buffer. This is to align with other similar
functions, such as:

- `std::sys_common::io::read_to_end_uninitialized`
- `std::sys::fs::readlink`

Also, reduce the initial buffer size from 2048 to 512. The previous size was
introduced with 4bc26ce in 2013, but it seems a bit excessive. This is
probably because buffer doubling was not implemented back then.
@barosl
Copy link
Contributor Author

barosl commented Aug 27, 2015

Done, squashed into three commits.

@alexcrichton
Copy link
Member

@bors: r+ 6065678

Thanks!

bors added a commit that referenced this pull request Aug 27, 2015
This PR rewrites the code that previously relied on `PATH_MAX`.

On my tests, even though the user gives the buffer length explicitly, both Linux's glibc and OS X's libc seems to obey the hard limit of `PATH_MAX` internally. So, to truly remove the limitation of `PATH_MAX`, the related system calls should be rewritten from scratch in Rust, which this PR does not try to do.

However, eliminating the need of `PATH_MAX` is still a good idea for various reasons, such as: (1) they might change the implementation in the future, and (2) some platforms don't have a hard-coded `PATH_MAX`, such as GNU Hurd.

More details are in the commit messages.

Fixes #27454.

r? @alexcrichton
@bors
Copy link
Contributor

bors commented Aug 27, 2015

⌛ Testing commit 6065678 with merge 79dd92f...

@bors bors merged commit 6065678 into rust-lang:master Aug 27, 2015
tbu- added a commit to tbu-/rust that referenced this pull request Aug 30, 2015
For Bitrig, NetBSD and OpenBSD the constant was incorrectly in posix01, when
it's actually posix08, so we move it. This is a [breaking-change], but we
already had one in rust-lang#27930.

Fix NetBSD's F_DUPFD_CLOEXEC constant.

For a similar feature detection, see this musl thread:
http://comments.gmane.org/gmane.linux.lib.musl.general/2963

This assumes that an int literal has type `c_int` for varidic functions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants