Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

content-encoding aware string parsing & encoding #34

Open
yoshuawuyts opened this issue Dec 12, 2019 · 6 comments
Open

content-encoding aware string parsing & encoding #34

yoshuawuyts opened this issue Dec 12, 2019 · 6 comments

Comments

@yoshuawuyts
Copy link
Member

Similar to http-rs/surf#101 and http-rs/surf#108 we should make string parsing and encoding aware of the Content-Type header.

cc/ @goto-bus-stop; pinging you here since you contributed the patch to Surf; we're planning to move Tide and Surf both over to http-types, so figured it'd make sense to include the work you've done here.

@rylev
Copy link
Collaborator

rylev commented Dec 19, 2019

Talked with @yoshuawuyts, and we believe the point of integration is overwriting to the read_to_string method for AsynRead on Request and Response. Instead of returning an error if the underlying bytes are not utf-8, we'll make a best attempt to convert to utf-8 based on the content-type header

@goto-bus-stop
Copy link
Member

goto-bus-stop commented Feb 26, 2020

I think this should be a property of the Body now right? like res.into_body().read_to_string(&mut s)

i'm tinkering with making a encoding_rs_io but for AsyncRead, so you can do res.into_body().into_utf8_reader() to stream utf-8 transcoded bytes, or into_utf8_reader().read_to_string() to collect it all into a big string.

on the http-types end, the Body needs to be made aware of the actual mime type, since right now it appears to always be octet-stream?

@yoshuawuyts
Copy link
Member Author

@goto-bus-stop What I'm wondering about is: are there any cases where we wouldn't want to decode based on the encoding type?

I suspect the answer here might be "probably not", which is: maybe we should handle this out of the box? So the proposed API would change:

Previous proposal

let req = Request::new(Url::new("https://example.com"));
let reader = req.into_utf8_reader();
let mut s = String::new();
reader.read_to_string(&mut s).await?;

Current proposal

let req = Request::new(Url::new("https://example.com"));
let mut s = String::new();
req.read_to_string(&mut s).await?;

The biggest question remaining then seems: do we need an escape hatch? I'm not sure tbh. Perhaps there could be a flag on the Body that disables the decoder?

let req = Request::new(Url::new("https://example.com"));

let mut body = req.into_body();
body.decoder(false);

let mut s = String::new();
body.read_to_string(&mut s).await?;

@yoshuawuyts
Copy link
Member Author

on the http-types end, the Body needs to be made aware of the actual mime type, since right now it appears to always be octet-stream?

When encoding a stream we indeed don't assume anything, and set octet-stream. But for strings we correctly set text/plain; utf-8, and JSON similarly application/json; utf-8.

When decoding a stream we don't do anything with text encodings. Most of my previous comment was about decoding only. Is there anything about encoding you think we should be doing?

@goto-bus-stop
Copy link
Member

goto-bus-stop commented Mar 2, 2020

for encoding we probably don't need non-utf8 support (certainly not initially). it would be kinda strange in 2020 to write a server in a cutting-edge language and framework if you need to service clients that don't support utf8 (do they exist?).

i was primarily wondering about decoding as well, since the Body does not know what its encoding is when it receives a response (assuming http-types use on the client-side). I think the Response type could apply the mime type to the Body from inside the set_header function but i'm not totally sure if that's correct.

I don't think we need a custom escape hatch since you can already get the raw bytes. If you really want to use utf8 for some reason even if the server explicitly advertises a different encoding, you can do String::from_utf8(body.read_to_end()), i guess.

@yoshuawuyts
Copy link
Member Author

I think the Response type could apply the mime type to the Body from inside the set_header function but i'm not totally sure if that's correct.

I think we could indeed use set_header for this. There's a slight nuance because Request is shared between both client and server, and we only want to use encodings on the client. But that's perhaps a matter of exposing a flag we can enable/disable on one side, but not the other.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants