-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use appropriate charset in body_string()
#108
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I love this a lot. Thanks so much! 💯
Found a website that doesn't use UTF-8 in the year of our lord: https://shop-pro.jp/ Unfortunately, it still doesn't work with this PR, because it has a bunch of bytes that probably are UTF-8 or some other encoding… Another fallback strategy that folks could use would be to use encoding_rs to just decode it with the detected encoding anyways and accept that there will be UTF-8 replacement characters in the output, like Firefox does: I've had some trouble finding good examples… a lot of sites that use non-UTF-8 encodings just don't add a For comparison, the cURL CLI doesn't handle encoding at all, so both of the above sites "work" but output garbage if you're using a terminal emulator that uses utf8. |
I'm not sure how to get the |
anyhow is not the right abstraction for libraries, thiserror / snafu are. |
So thinking about the above I think this is probably a good start as is:
|
@CryZe it's the right fit here though, as middleware can return arbitrary errors which are up to the user, and we don't want to enforce a generic param to capture any specific error. @goto-bus-stop the summary you provided makes sense! Thanks for checking this against real-world examples; agree with utf-8 being a sensible default. Thank you! |
This is an initial attempt at addressing #101. Opening for comments—feel free to critique the direction etc and tell me to throw it all out if it's not what you're looking for! Leaving a few checkboxes…
A new default feature
"encoding"
opts in to using the appropriate character encoding.encoding_rs
is used. If you know you're only talking to an API that always uses utf-8, you can turn off this feature and save the binary size that is taken up by the encoding tables.TextDecoder
API. AFAIK This always makes a copy, so you can turn it off if you don't want to copy big strings between JS and wasm.String::from_utf8()
directly if the charset is utf-8, and only use TextDecoder for other less common encodings.Turning off the
"encoding"
feature reverts to the current behaviour, which is assuming utf-8.This patch also assumes utf-8 if no charset is specified in the response headers.
This patch introduces an error type that's quite unlike anything Surf has right now, so I'm not sure if that direction fits in with the design. I also wrapped the new error type in an
io::ErrorKind::InvalidData
everywhere which is mostly cargo culting, I don't think that's actually necessary? I carried it over from the oldbody_string()
implementation, where it did make sense.