Maybe stream should be broken with None after an error #222

rib · 2020-01-09T21:29:03Z

Within a Tonic server, handling a request where a client opens a 'keep alive' stream, I got bitten today by only checking for None to determine if a client had disappeared and the stream was closed.

I.e. I had a loop like this to read the stream:

tokio::spawn(async move {
    while let Some(_) = keep_alive_stream.next().await {
        // SNIP: ensure server keeps multi-user session state alive
    }
});

but after closing the client, it ended up being considered an error within Tonic that actually left my server code stuck in an infinite loop after that client dissappeared, since I had neglected to check the inner Result status returned by .next().await.

Arguably it's just my bug that I ignored the result status, but also perhaps arguably the stream has ended, and maybe Tonic could recognise this and limit how many errors may be reported before marking the stream end with None?

Just for reference, this is the error I saw earlier:
tonic::codec::decode: decoder inner stream error: Status { code: Unknown, message: "error reading a body from connection: protocol error: stream no longer needed" }
Though it wasn't entirely clear to me what the error really meant.

and the preceding h2 tracing made it seem to me like a fairly clean break by the client:

Jan 09 17:34:22.950 DEBUG h2::codec::framed_read: received; frame=Reset { stream_id: StreamId(5), error_code: CANCEL }
Jan 09 17:34:22.951 DEBUG h2::codec::framed_read: received; frame=Reset { stream_id: StreamId(3), error_code: CANCEL }
Jan 09 17:34:22.951 DEBUG hyper::proto::h2::server: stream error: operation was canceled
Jan 09 17:34:22.964 DEBUG hyper::proto::h2::server: stream error: operation was canceled

The fix for my server was simple enough, after realising the issue - but filing this in case it makes sense to tweak Tonic's error handling to ensure a disconnected client won't leave a stream in an endless error state.

The text was updated successfully, but these errors were encountered:

LucioFranco · 2020-01-11T19:46:16Z

So I think in this case you probably want to use TryStreamExt::try_next but there is also https://docs.rs/tonic/0.1.0-beta.1/tonic/struct.Streaming.html#method.message which returns a Result on the outter. I'm not sure if the behavior should be to end the stream on the first error or not? In theory, you should be checking the error and exiting on your own and if you poll again it should error. But not sure exactly what the right behavior should be.

rib · 2020-01-11T23:13:12Z

Ah, neat, yeah try_next is a nice option for moving the result to the outside.

Thanks for the pointer about message too and I'll see about using one or the other here.

Yeah, I'm not entirely sure about the ideal way to handle stream termination here, but my gut feeling is that if Tonic knows that a stream has effectively been terminated (even if it's due to some error condition) then it should probably report None within a finite amount of time.

If Tonic knows that an error is fatal with respect to a stream then I guess after reporting the error it could internally mark the stream as terminated and that status would have higher precedence than any persistent error condition, and None could be sent on the next read.

Rust's stream polling seems to be designed to allow for streams to contain multiple errors and in general errors within a stream don't have to imply that the stream has ended.

Earlier versions of Stream (https://docs.rs/futures/0.2.1/futures/stream/trait.Stream.html) which used to return a Result from poll_next were documented with this:

Streams, like futures, also bake in errors through an associated Error type. An error on a stream does not terminate the stream. That is, after one error is received, another value may be received from the same stream (it's valid to keep polling). Thus a stream is somewhat like an Iterator<Item = Result<T, E>>, and is always terminated by returning None.

Now stream polling is even more orthogonal from error reporting, but the above idea still seems to makes sense with the final Stream API too.

Considering that at the lowest level stream polling doesn't care about errors flowing through. then I think that gives me some expectation that None would be the most reliable, highest precedence, indicator of a stream ending and errors are per-stream implementation details. I would think about checking None before checking for errors because the former is a more fundamental state for the stream and errors are optional (in general for rust I mean - not in this specific case).

Semantically it seems to me like Tonic is not terminating the stream for some stream-fatal error conditions, and it's instead transitioning the stream in to a perpetual error stream. Maybe that's fine (it's simple enough to handle if this is known) but maybe it goes against the design spirit of rust streams a bit, to require errors to be interpreted as stream terminators?

Playing devils advocate really ;-p I haven't really looked at lots of apis to have a good sense of what the best choice is and I'm probably biased now anyway :)

LucioFranco · 2020-01-14T01:38:42Z

So I am going to remove this from the 0.1 milestone. One, thing to think about here in the future is that we can't just treat all connections as being http2 and based on hyper. Thus we need to respect the traits, the traits don't define that if you error its over so I'm not sure if changing that behavior should be done here or at a more foundational level. I'd like to continue thinking about this but we can push it to a future release.

This relates to hyperium/tonic#222

sfackler · 2021-09-17T19:57:32Z

While you can use try_* methods to short circuit, the current behavior of infinitely returning the same error is extremely footgun-y. Doing a for_each_concurrent to process the stream rather than a try_for_each_concurrent will cause you to peg a core at 100% if an error occurs.

LucioFranco · 2021-09-20T14:48:50Z

Once a stream hits an error it will report it once then return None I believe this should fix this issue? Here is the relevant commit 737ace3

sfackler · 2021-09-20T14:50:42Z

I think that fixes this issue when an outgoing stream of responses reports errors, but not when an incoming stream of requests reports errors.

LucioFranco · 2021-09-20T15:57:43Z

incoming stream of requests reports errors

I'm not totally following, you mean a client sending messages on a client side stream -> server? There shouldn't be any errors in that case iirc. Also Monday so may not be thinking correctly just yet :D

sfackler · 2021-09-20T16:43:32Z

client.rs:

pub mod echo {
    tonic::include_proto!("grpc.examples.echo");
}

use echo::{echo_client::EchoClient, EchoRequest};
use futures::stream;
use tonic::transport::Endpoint;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let channel = Endpoint::from_static("http://[::1]:50051")
        .connect()
        .await?;

    let mut echo_client = EchoClient::new(channel);

    echo_client
        .client_streaming_echo(stream::repeat(EchoRequest {
            message: "hello".into(),
        }))
        .await;

    Ok(())
}

server.rs:

use futures::{Stream, StreamExt};
use std::pin::Pin;
use tonic::{transport::Server, Request, Response, Status};

pub mod echo {
    tonic::include_proto!("grpc.examples.echo");
}

use echo::{
    echo_server::{Echo, EchoServer},
    EchoRequest, EchoResponse,
};

type ResponseStream = Pin<Box<dyn Stream<Item = Result<EchoResponse, Status>> + Send + Sync>>;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let addr = "[::1]:50051".parse().unwrap();

    let echo = EchoServer::new(MyEcho::default());

    Server::builder().add_service(echo).serve(addr).await?;

    Ok(())
}

#[derive(Default)]
pub struct MyEcho;

#[tonic::async_trait]
impl Echo for MyEcho {
    async fn unary_echo(&self, _: Request<EchoRequest>) -> Result<Response<EchoResponse>, Status> {
        Err(Status::unimplemented("Not yet implemented"))
    }

    type ServerStreamingEchoStream = ResponseStream;

    async fn server_streaming_echo(
        &self,
        _: Request<EchoRequest>,
    ) -> Result<Response<Self::ServerStreamingEchoStream>, Status> {
        Err(Status::unimplemented("Not yet implemented"))
    }

    async fn client_streaming_echo(
        &self,
        request: Request<tonic::Streaming<EchoRequest>>,
    ) -> Result<Response<EchoResponse>, Status> {
        let mut stream = request.into_inner();
        while let Some(r) = stream.next().await {
            println!("{:#?}", r);
        }

        Ok(Response::new(EchoResponse {
            message: "hi".to_string(),
        }))
    }

    type BidirectionalStreamingEchoStream = ResponseStream;

    async fn bidirectional_streaming_echo(
        &self,
        _: Request<tonic::Streaming<EchoRequest>>,
    ) -> Result<Response<Self::BidirectionalStreamingEchoStream>, Status> {
        Err(Status::unimplemented("Not yet implemented"))
    }
}

Run both, then kill the client. The server will infinitely print

Err(
    Status {
        code: Unknown,
        message: "error reading a body from connection: broken pipe",
        source: None,
    },
)

LucioFranco · 2021-10-06T21:07:24Z

Btw I am still looking into this but was originally having trouble reproducing this in a test. So will continue to look into this.

LucioFranco · 2022-02-22T20:14:47Z

Ok here is an attempted fix that should in general just return None when we've seen an error from poll_data. So it should go Some(Err) thenNone.

LucioFranco modified the milestone: 0.1 Jan 12, 2020

nmdanny added a commit to nmdanny/DistributedProject that referenced this issue Mar 8, 2021

Fix tokio runtime starvation+log spamming due to TCP broken pipe errors

ed42ca1

This relates to hyperium/tonic#222

dzfranklin mentioned this issue Mar 27, 2021

Suggestion: Allow users to handle transport errors differently from app errors #587

Closed

LucioFranco added this to the 0.6 milestone Oct 6, 2021

LucioFranco added A-tonic C-bug Category: Something isn't working labels Oct 13, 2021

LucioFranco modified the milestones: 0.6, 0.7 Oct 20, 2021

LucioFranco mentioned this issue Feb 22, 2022

fix(codec): Return None after poll_data error #921

Merged

LucioFranco closed this as completed Feb 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maybe stream should be broken with None after an error #222

Maybe stream should be broken with None after an error #222

rib commented Jan 9, 2020

LucioFranco commented Jan 11, 2020

rib commented Jan 11, 2020

LucioFranco commented Jan 14, 2020

sfackler commented Sep 17, 2021

LucioFranco commented Sep 20, 2021

sfackler commented Sep 20, 2021

LucioFranco commented Sep 20, 2021 •

edited

Loading

sfackler commented Sep 20, 2021

LucioFranco commented Oct 6, 2021

LucioFranco commented Feb 22, 2022

Maybe stream should be broken with None after an error #222

Maybe stream should be broken with None after an error #222

Comments

rib commented Jan 9, 2020

LucioFranco commented Jan 11, 2020

rib commented Jan 11, 2020

LucioFranco commented Jan 14, 2020

sfackler commented Sep 17, 2021

LucioFranco commented Sep 20, 2021

sfackler commented Sep 20, 2021

LucioFranco commented Sep 20, 2021 • edited Loading

sfackler commented Sep 20, 2021

LucioFranco commented Oct 6, 2021

LucioFranco commented Feb 22, 2022

LucioFranco commented Sep 20, 2021 •

edited

Loading