Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose raw std{in,out,err} #148

Open
SUPERCILEX opened this issue Dec 16, 2022 · 25 comments
Open

Expose raw std{in,out,err} #148

SUPERCILEX opened this issue Dec 16, 2022 · 25 comments
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api

Comments

@SUPERCILEX
Copy link

Proposal

Problem statement/motivation

It is well known that people want more control over stdout and friends (1 2 3 4 5 6 etc etc). Some want to avoid locking, others want to disable line buffer, others still want to change the buffer sizes, and so on.

Solution sketches

The current std* fns are fundamentally incompatible with the flexibility desired above. For example, you cannot both use buffering and not require locks. Thus, I believe the most flexible and simple solution is to simply expose the raw std* Read/Write impls. Users can then:

  • Add locking around their own data structure if they need it.
  • Add buffering (of any kind) around their own data structure if they need it.
  • Compose different variants of the std* fns as needed (maybe some bit of code wants buffering, but other parts don't)

In the future if we allow overriding the println and co streams, this idea will be complementary since you could build up the desired wrappers around stdout and replace the default stream with the one you just built.

Non-goals: this proposal does not suggest changing the current std* impls. rust-lang/rust#60673 is therefore not addressed by this proposal (but of course you can build your own version and use it in writeln).

Downsides: interaction between raw std* and normal std* may be a little weird. I view this as a non-issue in the same sense that files take an "it's your problem, figure it out" attitude. For example, I can open the same file path twice and it'll be my problem if my writes smash each other. Similarly, it's on the user to decide how they want to handle interleavings (and they might not even care b/c they know their program is single-threaded).

API

Expose the existing Std{in,out,err}Raw (and constructor functions) in std::io.

Links and related work

See links above.

@SUPERCILEX SUPERCILEX added api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api labels Dec 16, 2022
@thomcc
Copy link
Member

thomcc commented Dec 17, 2022

How should this work on Windows? We currently convert from utf8 to utf16 when you write to the console.

@SUPERCILEX
Copy link
Author

Any reason why that's bad? I don't see raw as meaning raw bytes stream, but rather no locking or buffering. If you want a raw byte stream, then you can grab the FD and convert it to a File. As a sidenote, I considered proposing implementing AsFd on Stdout (instead of the current StdoutLock) so you could manually convert it to a File, but that API ends up being really annoying to use since you need to use unsafe and wrap the file in a ManuallyDrop. Hence proposing letting you access raw stdout, but not the raw byte stream.

@thomcc
Copy link
Member

thomcc commented Dec 17, 2022

If we don't buffer at all, it becomes difficult to ensure the conversion succeeds, we'd have to panic if your write is a truncated UTF-8 sequence, even if you're later going to write the rest of the sequence. For example, "😀" is the bytes [240, 159, 152, 128].

But if you want to write [240, 159, 152] and then [128], that succeeds on Unix, but on Windows we'd need to either buffer or panic (and if we buffer, we pretty much have to lock).

@SUPERCILEX
Copy link
Author

Interesting, I poked around and it turns out windows is double buffered: https://github.com/rust-lang/rust/blob/master/library/std/src/sys/windows/stdio.rs.

I don't know how to word this properly then, but the proposal is to do the minimal amount of processing to still be considered a stdio stream as opposed to a byte stream. So windows would continue to be buffered for UTF-8 writes to work.

@the8472
Copy link
Member

the8472 commented Dec 17, 2022

None of the linked issues make a good argument for removing the mutex. The first one says that they're overkill for single-threaded programs. But on single-threaded programs a mutex lock-unlock is 1 uncontended atomic CAS and 1 uncontended atomic write, that's very cheap compared to the IO syscalls.

The rest should be covered by making buffering switchable between line, block and no buffering. rust-lang/rust#78515 (comment) outlines the libs team's preferred path forward on that.

@SUPERCILEX
Copy link
Author

Sure, but I can easily see someone wanting to use the kernel's locking: just print stuff to stdout and as long as it's a single write under the pipe buffer size, you'll get atomic writes (on Linux at least IIRC).

Also to be clear, I'm not suggesting removing the lock on Stdout, I'm saying that Stdout won't ever be able to satisfy everyone because it's a tailored use case and therefore we should offer as unfettered access as possible while acting as stdout.

@ChrisDenton
Copy link
Member

I don't think std will ever be able to satisfy everyone. Nor do I think it has to. And if it's independent of the normal std I/O then I wonder why it needs to be in std rather than a crate?

@m-ou-se
Copy link
Member

m-ou-se commented Jul 25, 2023

should offer as unfettered access as possible while acting as stdout

Note that StdoutRaw on windows isn't as 'unfettered' / 'raw' as one might expect. It still checks if the output is a console or not, and if it is, has a bunch of logic for using WriteConsoleW after converting the [u8] stream to UTF-16.

@SUPERCILEX
Copy link
Author

Right, hence why I'd like to have this be in the stdlib. A crate would have to replicate all the craziness in https://github.com/rust-lang/rust/blob/8327047b23dc1eb150fdfb42177939388661eb6d/library/std/src/sys/windows/stdio.rs. I don't know much about windows, but I presume the stdlib is doing that for a reason and a crate would have to mirror the stdlib implementation.

@the8472
Copy link
Member

the8472 commented Jul 25, 2023

Can you describe use-cases, how the API should look like? How it should behave under concurrent access?
We debated some sort of take() method that would extract the handle/descriptor and leave the equivalent of /dev/null in its place, but we don't even know if that would solve your use-cases.

Generally it is unclear what a reasonable "raw access" API would look like in a portable way or if non-portable solutions making more platform-specific guarantees would be preferred or ...

@ChrisDenton
Copy link
Member

As far as I understand this proposal, it's just like io::Stdout without the lock, right? It's not doing anything special.

@SUPERCILEX
Copy link
Author

how the API should look like?

My proposal is super simple: make these APIs public.

How it should behave under concurrent access?

Do nothing, let the OS and the user deal with it.

We debated some sort of take() method that would extract the handle/descriptor and leave the equivalent of /dev/null in its place, but we don't even know if that would solve your use-cases.

I'm not quite sure why the discussion is trying to prevent you from shooting yourself in the foot. Ignoring std{out,in,err} for a moment, I can very easily open two files to "foo.txt" and wreak havoc. That's a problem between me and the operating system. Getting back to std*, I could use OS APIs to dup those file descriptors and then mess with them outside the rust controlled environment, so exposing raw APIs doesn't change std* safety.

your use-cases

My use case is pretty simple: I have a single-threaded app that I want to be as small and efficient as possible. Currently, I have something like this that I call whenever I need to writeln!, but this is pretty annoying since the types differ.

    #[cfg(unix)]
    {
        use std::{fs::File, mem::ManuallyDrop, os::unix::io::FromRawFd};

        let mut stdin = ManuallyDrop::new(unsafe { File::from_raw_fd(0) });
        let mut stdout = ManuallyDrop::new(unsafe { File::from_raw_fd(1) });
        (stdin, stdout)
    }
    #[cfg(not(unix))]
    {
        use std::io;

        let mut stdin = io::stdin().lock();
        let mut stdout = io::stdout().lock();
        (stdin, stdout)
    }

@the8472
Copy link
Member

the8472 commented Jul 25, 2023

I'm not quite sure why the discussion is trying to prevent you from shooting yourself in the foot.

Because the existing stdout types either guarantee synchronized access (Stdout) or exclusive access (StdoutLock). Adding this API without any mitigations would break the guarantees of the other types.

And then there's the windows console issue which makes the raw type thread-unsafe.

So to uphold the guarantees of the existing types we need to do something else. Or we need to make them unsafe.

   let mut stdin = io::stdin().lock();
   let mut stdout = io::stdout().lock();

Ok and why not do that on unix too?

@SUPERCILEX
Copy link
Author

Because the existing stdout types either guarantee synchronized access (Stdout) or exclusive access (StdoutLock).

I'm confused, how can this possibly be a guarantee when I can dup the stdout descriptor?

And then there's the windows console issue which makes the raw type thread-unsafe.

As in at the OS level or in the type? If it's in the type, shouldn't we be able to make it not send?

Ok and why not do that on unix too?

If I'm calling this as a helper method in a loop, now I have to think about the locking cost and try to write my code so the streams get passed in from the top level to avoid the loop. So maybe this is a bit OCD, but with raw APIs I wouldn't have to think about this since there's no cost to creating them (on unix).

@the8472
Copy link
Member

the8472 commented Jul 25, 2023

I'm confused, how can this possibly be a guarantee when I can dup the stdout descriptor?

That is a good question. It used to require unsafe to do that through the RawFd API. With the BorrowedFd API that changed but maybe the implications of the change weren't considered here? Or maybe the wording of the guarantees should have been relaxed?

CC @sunfishcode

As in at the OS level or in the type? If it's in the type, shouldn't we be able to make it not send?

The type. And sure, we can do that, but that's exactly the kind of question that should be considered when proposing an API!

If I'm calling this as a helper method in a loop, now I have to think about the locking cost

As I said earlier, uncontended locks are extremely cheap. Much cheaper than the IO calls. And if you're fighting for every single instruction then hoisting it out of the innermost loop should be enough, that's what the .lock() method is intended for.

@ChrisDenton
Copy link
Member

I think the problem with proposals that are essentially just "make internal type public" is that the internal type was not designed to be public.

I think in cases like these it's better to treat it as if it's a completely new API and write out exactly the API surface. I don't think pointing to the current unix implementation is helpful (except as prior art) as it leaves a lot for people to guess at or speculate about.

@sunfishcode
Copy link
Member

@the8472 As I understand it, the justification for the mutex is to protect the buffer. It's plain memory safety. I guess the rationale for StdoutLock/StdinLock existing is that, well, we already have a mutex, so why not give users a little more control over it, in a way that's really useful in common cases?

There's an implied rule that libraries shall never touch std::io::Std* or println etc. unless they explicitly document that they do. No amount of mutexing can make that robust. Libraries by their nature aren't aware of what main is doing with its stdin/stdout. So StdinLock/StdoutLock aren't a Rust-wide guarantee of the absence of a particular kind of bug. They're just really useful for avoiding bugs in practice.

Perhaps a hypothetical new raw stdio API could also look to this rule, and just carefully document the hazards, which already exist in Rust.


Alternatively, I don't know how big of a scope Rust would be willing to consider here, but if you're talking completely new APIs...

use std::io::{StreamReader, StreamWriter};

fn main(stdin: StreamReader, stdout: StreamWriter) {
      // Wrap `stdin`/`stdout` in a `BufReader`/`BufWriter` if desired. And/or a `Mutex` if desired.
}

No implied buffering or mutexes (except the minimum needed for UTF-8 I/O on Windows, following what's discussed above), and the implied rule becomes an explicit rule, only except that std::io::Std* etc. would still exist. Rust has changed the signature of main before, allowing it to return Result. This is not a complete proposal, but I thought I'd sketch the idea in case there's interest.

@the8472
Copy link
Member

the8472 commented Jul 27, 2023

As I understand it, the justification for the mutex is to protect the buffer. It's plain memory safety.

That might be the original motivation. But the documentation promises locking. People may rely on that for things.

There's an implied rule that libraries shall never touch std::io::Std* or println etc. unless they explicitly document that they do. No amount of mutexing can make that robust.

It can still prevent interleaving of outputs! E.g. if you do json-lines based logging it's fine to have some non-json junk on some lines. But each line, no matter how long, shouldn't be torn.

I'll open a rust issue for this.

Edit: rust-lang/rust#114140

@sunfishcode
Copy link
Member

As I understand it, the justification for the mutex is to protect the buffer. It's plain memory safety.

That might be the original motivation. But the documentation promises locking. People may rely on that for things.

To clarify, I just meant that memory safety is the motivation for having a lock in the first place. Then, yes, given that it has a lock, the standard library gives users a little more control over it, because it's really useful to do so in practice.

There's an implied rule that libraries shall never touch std::io::Std* or println etc. unless they explicitly document that they do. No amount of mutexing can make that robust.

It can still prevent interleaving of outputs! E.g. if you do json-lines based logging it's fine to have some non-json junk on some lines. But each line, no matter how long, shouldn't be torn.

I'll open a rust issue for this.

Edit: rust-lang/rust#114140

What I meant there was that unless it makes special arrangements, a library has no idea if the main program is using its stdout for json-lines or jpeg or javascript or janky joke juxtapositions, or anything else, and acquiring a mutex doesn't help with that.

@the8472
Copy link
Member

the8472 commented Jul 27, 2023

Acquiring a mutex does protect the line from being torn. It doesn't protect what happens before/after that... but if you care about that you could hold the mutex for the whole program duration, then no other thread should be able to print. And AsFd bypasses that guarantee.

@sunfishcode
Copy link
Member

I'm specifically talking about the relationship between libraries and the main function.

A library has no idea if the stdout output is even line-oriented. It could be a jpeg, in which case any bytes written by a library could corrupt it, and the mutex isn't a solution to that problem.

@the8472
Copy link
Member

the8472 commented Jul 27, 2023

and the mutex isn't a solution to that problem.

But it is? Main can lock it and then exit(0) or leak it at the end of the scope. Then the library cannot acquire it and thus will not print a single byte for the runtime of the program.

@sunfishcode
Copy link
Member

Stdout's lock is reentrant, so it wouldn't stop library code running on the same thread from corrupting stdout. It would stop library code on other threads from corrupting stdout, but it would do so by stopping them from ever doing anything else ever.

Also, "I can acquire StdoutLock" does not imply "I know what format the stdout data is being written in, and I can add to it".

@the8472
Copy link
Member

the8472 commented Jul 27, 2023

I think this is a bug. But we should continue the discussion on the new issue.

@the8472
Copy link
Member

the8472 commented Aug 2, 2023

I think StdOut::replace(&mut self, with: Into<OwnedFd>) -> Result<OwnedFd, io::Error> (or OwnedHandle on windows) should avoid most of the questions.

Internally it would do a dup/dup2 dance to swap out the descriptors and the returned one would have a new value.

It doesn't guarantee exclusive access if someone used try_clone before, but at least it retains exclusive access going forward if nothing else did it previously. And it doesn't interfere with locking once it has been taken.

A use could then be let output: File = stdout.replace(OpenOptions::new().write(true).open("/dev/null")).unwrap().into().

For cross-platform convenience we could also add a take() method that uses the appropriate null device.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api
Projects
None yet
Development

No branches or pull requests

6 participants