You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We can see here that in ipfs object data, the 4th byte is c3. In ipfs object get this byte appears encoded as three bytes instead: ef bf bd which is U+FFFD, the unicode replacement character (�). This signifies an encoding problem.
Now, when querying the API using curl (http://localhost:5001/api/v0/object), I get the same characters, but encoded as a JSON escape sequence instead (\ufffd).
I would expect the character to be encoded to U+00C3 which is à and encoded as c3 83 in UTF-8, but this is open to debate. We have two choices here:
Encode every byte > 127 to the corresponding unicode character in the U+0080 .. U+00FF range. Real UTF-8 characters would be encoded and not recognizable.
The current choice: assume the binary data is UTF-8 and replace every malformed UTF-8 sequence (c3 is one of them) by the unicode replacement character (U+FFFD �). This is a destructive operation.
In a file like /ipfs/QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz79ojWnPbdG/readme, for a character like ╗ (u+2557) appearing in UTF-8 as e2 95 97 this implies:
For the first solution, it will be encoded as three characters U+00e2, U+0095 and U+0097. In UTF-8, C3 A2, C2 95 and C2 97, appearing as �
For the second solution, it will be encoded in UTF-8 as its own character ╗ (e2 95 97)
I would say, JSON is not suitable for representing binary data, but on the web, we might not have the choice. Perhaps we should think more on what is the good option here. Perhaps we should not even try to encode the binary data in JSON and just tell people to use some other format.
The text was updated successfully, but these errors were encountered:
I'm trying to decode a unixfs node I got from the Gateway API on the browser. I have problems because I can't figure out how the data is encoded:
We can see here that in
ipfs object data
, the 4th byte isc3
. Inipfs object get
this byte appears encoded as three bytes instead:ef bf bd
which is U+FFFD, the unicode replacement character (�). This signifies an encoding problem.Now, when querying the API using curl (
http://localhost:5001/api/v0/object
), I get the same characters, but encoded as a JSON escape sequence instead (\ufffd
).I would expect the character to be encoded to U+00C3 which is à and encoded as
c3 83
in UTF-8, but this is open to debate. We have two choices here:c3
is one of them) by the unicode replacement character (U+FFFD �). This is a destructive operation.In a file like
/ipfs/QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz79ojWnPbdG/readme
, for a character like ╗ (u+2557) appearing in UTF-8 ase2 95 97
this implies:C3 A2
,C2 95
andC2 97
, appearing as�
e2 95 97
)I would say, JSON is not suitable for representing binary data, but on the web, we might not have the choice. Perhaps we should think more on what is the good option here. Perhaps we should not even try to encode the binary data in JSON and just tell people to use some other format.
The text was updated successfully, but these errors were encountered: