IRI output #323

ghost · 2023-08-20T14:04:40Z

If I run this command, I get encoded output:

> xh -h https://wiktionary.org/wiki/Ῥόδος
HTTP/2.0 301 Moved Permanently
location: "https://www.wiktionary.org/wiki/\xe1\xbf\xac\xcf\x8c\xce\xb4\xce\xbf\xcf\x82"

but with this Go code:

package main

import (
   "net/http"
   "os"
)

func main() {
   req, _ := http.NewRequest("HEAD", os.Args[1], nil)
   res, _ := new(http.Transport).RoundTrip(req)
   println(res.Header.Get("location"))
}

I get expected output:

> go run http.go https://wiktionary.org/wiki/Ῥόδος
https://www.wiktionary.org/wiki/Ῥόδος

https://wikipedia.org/wiki/Internationalized_Resource_Identifier

The text was updated successfully, but these errors were encountered:

blyxxyz · 2023-08-20T18:50:29Z

This one's tricky. Historically header values were defined to be ISO-8859-1. But the latest recommendation is this:

A recipient SHOULD treat other octets in field content (obs-text) as opaque data.

And that's how the library we use handles it. It only allows string conversion if it's pure-ASCII.
And otherwise we print it in the form you see there.

HTTPie decodes it as ISO-8859-1, which is wrong in this case:

$ http -h https://wiktionary.org/wiki/Ῥόδος
HTTP/1.1 301 Moved Permanently
[...]
location: https://www.wiktionary.org/wiki/á¿¬ÏÎ´Î¿Ï

And that's also how the dev tools of Firefox and Chromium handle it:

So I see a few options:

Change nothing, pedantically follow the RFC.
Try to decode as UTF-8 (with std::str::from_utf8(value.as_bytes())). Other bytes are somewhat unlikely to be valid UTF-8, so this won't have that many false negatives, hopefully.
Try to decode as ISO-8859-1, to match the browser tools and HTTPie.

I'm torn between all three, I don't know what's best.

(The current behavior deserves a mention on #322.)

ghost · 2023-08-20T19:06:26Z

since a header line is expected to end in a newline, to me that puts header values in the category of "text data", rather than "binary data". I think it would ultimately be a net positive to at least try to decode the values as UTF-8. if I visit the original link in a browser:

https://wiktionary.org/wiki/Ῥόδος

the values are not escaped in the address bar. also especially if Greek was someones native language, they would probably prefer to hand write the URL without encoding, and view it that way as well. so ideally to me this behavior should persist throughout the stack. however as was keenly pointed out, its not required by the spec, so I will respect if the decision is "do nothing".

blyxxyz · 2023-08-20T22:24:45Z

We know it's text, but we don't know the text's encoding (maybe ISO-8859-1, maybe UTF-8, maybe even some other ASCII superset), and it can be dangerous to guess.

It kind of pains me but I now think that decoding as ISO-8859-1 is the way to go. If you send an HTTP request from a browser then this is the result:

> let r = await fetch('https://wiktionary.org/wiki/Ῥόδος', { redirect:'manual' })
> r.headers.get('location')
'https://www.wiktionary.org/wiki/á¿¬Ï\x8CÎ´Î¿Ï\x82'

At my day job I use xh to test APIs for use in the browser. It can be a big headache if something looks fine on one side and falls apart on the other. It's best not to sweep the brokenness under the rug.

It would still be nice to somehow separately display the common-sense URL that the redirect points to. But I don't want to mislead people into thinking they can reliably extract it from the header.

ghost · 2023-08-20T22:41:51Z

ISO-8859-1 is hands down the worst option of the three. the original representation is lost, and the hex values aren't returned either, making that part of the URL essentially meaningless garbage to a human, and only useful to a computer. its a lose lose situation. please reconsider that opinion.

blyxxyz · 2023-08-20T22:54:36Z

The one benefit is that you know what to expect if you try to handle the header programmatically. IMO that's very important, maybe important enough to outweigh everything else.

But the current output also hints that something is wrong, so maybe we don't need to go that far.

ghost · 2023-08-20T23:13:19Z

not sure if it helps, but I also have my own encoding that might work here:

https://github.com/1268/strconv/blob/main/encode.go

its similar to percent-encoding, but will only encode instances of the escape character (%) or "binary data" defined (by me) as control characters, excluding whitespace. it has the nice property that it can be decoded by a normal percent-encoding function.

blyxxyz · 2024-06-21T21:49:44Z

Maybe we could show both versions if it successfully decodes as UTF-8? Something like

location: "https://www.wiktionary.org/wiki/\xe1\xbf\xac\xcf\x8c\xce\xb4\xce\xbf\xcf\x82" (https://www.wiktionary.org/wiki/Ῥόδος)

or

location: https://www.wiktionary.org/wiki/á¿¬ÏÎ´Î¿Ï (https://www.wiktionary.org/wiki/Ῥόδος)

or

location (ISO 8859-1): https://www.wiktionary.org/wiki/á¿¬ÏÎ´Î¿Ï
location (UTF-8): https://www.wiktionary.org/wiki/Ῥόδος

Possibly with a warning to explain the issue a little more?

ducaale · 2024-06-21T22:45:28Z

The third option seems the least ambiguous for users.

Possibly with a warning to explain the issue a little more?

If our warning included the header value in both bytes and ISO-8859-1, I wonder if we could then have only the UTF-8 decoded value in the main output. What do you think?

Interestingly enough, cURL decodes the header value as UTF-8:

$ curl -v https://wiktionary.org/wiki/Ῥόδος
...
< HTTP/2 301
< date: Fri, 21 Jun 2024 21:53:52 GMT
< server: mw-web.eqiad.main-65687cfdb7-lhsw8
< location: https://www.wiktionary.org/wiki/Ῥόδος
< expires: Sun, 21 Jul 2024 21:53:52 GMT
< content-length: 251
< content-type: text/html; charset=iso-8859-1
< vary: X-Forwarded-Proto
< age: 39
< x-cache: cp3071 miss, cp3071 hit/3
< x-cache-status: hit-front
...
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="https://www.wiktionary.org/wiki/Ῥόδος">here</a>.</p>
</body></html>
* Connection #0 to host wiktionary.org left intact

blyxxyz · 2024-06-21T23:06:12Z

I feel like a lot of people would ignore or not notice the warning if they see that the main output looks good. So I'd like to make the mangled/undecoded version as prominent or more prominent than the fixed one.

One scenario I'm thinking of is:

User gets mangled value in browser
User tries to debug with xh
xh shows that the browser is not wrong
- and hopefully helps the user understand the fundamental problem

cURL decodes the header value as UTF-8

I believe it doesn't decode it at all and just sends the bytes straight to your terminal, which then treats them like UTF-8.

Notice also that the charset is declared as iso-8859-1 but you get Greek characters in the output anyway. If it decoded it you'd get a mangled web page body as well, like xh gives (and a web browser would show):

$ xh https://wiktionary.org/wiki/Ῥόδος
HTTP/2.0 301 Moved Permanently
[...]
<p>The document has moved <a href="https://www.wiktionary.org/wiki/á¿¬ÏŒÎ´Î¿Ï‚">here</a>.</p>

unless you redirect the output, in which case we also pass through the bytes as is:

$ xh https://wiktionary.org/wiki/Ῥόδος | cat
xh: warning: HTTP 301 Moved Permanently
[...]
<p>The document has moved <a href="https://www.wiktionary.org/wiki/Ῥόδος">here</a>.</p>

blyxxyz · 2024-06-24T20:59:49Z

I'm now thinking

location: https://www.wiktionary.org/wiki/á¿¬ÏÎ´Î¿Ï (UTF-8: https://www.wiktionary.org/wiki/Ῥόδος)

with the mangled version colored red. For people with colors enabled (almost everyone?) that'd hopefully make it clear both that something known to be undesirable is going on and that the parenthetical is not part of the header value.

We could even turn the text "UTF-8" into a hyperlink to some page that explains this issue. Not all terminals support that but it's not as spammy as a warning. (Cargo has also been using hyperlinks for a while now with supports-hyperlinks.)

ducaale mentioned this issue Aug 27, 2023

Differences from HTTPie #322

Open

This comment was marked as off-topic.

Sign in to view

zuisong mentioned this issue Jun 21, 2024

decode header value with utf-8 #375

Merged

1 task

blyxxyz mentioned this issue Jul 4, 2024

Decode headers as latin1/UTF-8, show real reason phrase #377

Merged

ducaale closed this as completed in #377 Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IRI output #323

IRI output #323

ghost commented Aug 20, 2023

blyxxyz commented Aug 20, 2023

ghost commented Aug 20, 2023

blyxxyz commented Aug 20, 2023

ghost commented Aug 20, 2023 •

edited by ghost

Loading

blyxxyz commented Aug 20, 2023

ghost commented Aug 20, 2023 •

edited by ghost

Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

blyxxyz commented Jun 21, 2024

ducaale commented Jun 21, 2024 •

edited

Loading

blyxxyz commented Jun 21, 2024

blyxxyz commented Jun 24, 2024

IRI output #323

IRI output #323

Comments

ghost commented Aug 20, 2023

blyxxyz commented Aug 20, 2023

ghost commented Aug 20, 2023

blyxxyz commented Aug 20, 2023

ghost commented Aug 20, 2023 • edited by ghost Loading

blyxxyz commented Aug 20, 2023

ghost commented Aug 20, 2023 • edited by ghost Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

blyxxyz commented Jun 21, 2024

ducaale commented Jun 21, 2024 • edited Loading

blyxxyz commented Jun 21, 2024

blyxxyz commented Jun 24, 2024

ghost commented Aug 20, 2023 •

edited by ghost

Loading

ghost commented Aug 20, 2023 •

edited by ghost

Loading

ducaale commented Jun 21, 2024 •

edited

Loading