How to know if a string was parsed as utf-8? #406

pboettch · 2016-12-29T11:55:57Z

For my schema-validator I needed to check the length of a string value. std::length() gives the character-count which is not OK if the string is utf-8.

I wrote my own-function which works for ascii and utf-8.

Could I do it differently? Should nlohmann::json somehow inform (with a method) me about the fact that a unicode-string had been parsed?

nlohmann · 2016-12-29T14:31:15Z

I do not really understand the issue. Can you please provide an example where std::basic_string::length are not sufficient?

pboettch · 2016-12-29T16:51:14Z

Of course. This code

std::cerr << "string: " << instance << ", "
          << "length: " << instance.get<std::string>().length() << ", "
          << "size: " << instance.get<std::string>().size() << ", "
          << "utf8-size: " << utf8_length(instance) << "\n";

gives

string: "💩💩", length: 8, size: 8, utf8-size: 2

on

"data":"\uD83D\uDCA9\uD83D\uDCA9"

The validator expects 2. I know this is not a JSON-HPP issue so I'm unsure who to blame ;-) .

nlohmann · 2016-12-29T23:56:15Z

I see. I wonder if your function is actually correct in counting the UTF-8 characters - is it really so simple?

nlohmann · 2016-12-30T11:34:29Z

(Yes, it seems to be so simple ;-) http://stackoverflow.com/questions/7298059/how-to-count-characters-in-a-unicode-string-in-c)

nlohmann · 2016-12-30T11:37:32Z

From my point of view, I think this counting issue is out of scope of this library. Though a "count UTF-8 character" function is handy, I fear that it may bloat the API.

pboettch · 2017-01-03T08:30:50Z

This library parses Unicode and UTF-8-strings silently into a std::string. Thus, one should never use size() or length() (== byte-count) to check the string-length but a function similar to the one I'm using. Always.

A method (bool is_utf8()) could indicate whether this is a UTF-8-string or not. This information could then be used to check the size in a correct manner.

Maybe explaining it in the documentation is enough.

nlohmann · 2017-01-03T21:52:41Z

I don't quite understand: JSON is defined to used Unicode (though this library only supports UTF-8), so I would not know what except true to return for is_utf8. I understand that you'd like either a proper character/glyph/whatever count (which std::string::size() will not be able to provide) or at least a bool contains_multibyte_encoded_codepoints() function.

Am I wrong?

pboettch · 2017-01-03T22:14:32Z

I'd like to know which counting method I need to apply based on what and how it has been parsed into the std::string.

The utf-8-counting method works, but needs to be located on the user-side.

How to prevent users in the future from falling into the same trap as I did? How many users really need the real character-count and are not aware of multibyte-encoding-problems?

jaredgrubb · 2017-01-03T22:45:44Z

std::string has no concept of encoding. You can put UTF8, ISO8859-1, UCS2, UTF32, or whatever you like into a std::string. You have to keep track of the encoding external to the string (or, better, just assume UTF8 everywhere and convert from/to it at the "boundaries" of your program). If your program has to handle data and doesn't know what the encoding is, there are algorithms that can try to guess, but they're not foolproof and you're in scary territory at that point. There are very few cases where you should be unsure of what you're getting -- a text editor or web browser is a good legitimate example, but there are many bad ones, and you should never guess without giving User UI to have a user confirm what you've done.

I don't think adding Unicode tools to a JSON library is helpful. It's a slippery slope (for example, counting code points can include or not include the "combining" modifiers like ◌ͤ, handling surrogates, coalation, normalization, locales, etc). There are entire C++ libraries for Unicode handling because it's hard, and if you need them, you should use them -- even for "simple" UTF8.

nlohmann · 2017-01-03T23:09:03Z

I agree with @jaredgrubb. All the library can do is to document that it in fact stores strings as UTF-8 and the user has an interface to the stored bytes as std::string. Anything beyond this (i.e., providing a string type with a nice Unicode-friendly interface) is out of scope of a JSON library.

pboettch · 2017-01-04T06:33:33Z

Coming back to my original question: How to know if a string was parsed as utf-8? The answer is: you don't, but you should assume that within this library std::string-value is always multibyte-encoded and take the necessary precautions.

nlohmann · 2017-01-04T08:22:24Z

So it's a documentation issue?

nlohmann · 2017-01-04T12:00:24Z

I shall add notes to the documentation about the encoding of the stored strings.

nlohmann · 2017-01-04T17:39:34Z

Added a note to the readme and the string type (http://nlohmann.github.io/json/classnlohmann_1_1basic__json_ab63e618bbb0371042b1bec17f5891f42.html#ab63e618bbb0371042b1bec17f5891f42).

nlohmann added the kind: question label Dec 29, 2016

nlohmann added the state: please discuss please discuss the issue or vote for your favorite option label Dec 30, 2016

nlohmann added documentation and removed state: please discuss please discuss the issue or vote for your favorite option labels Jan 4, 2017

nlohmann added this to the Release 2.0.11 milestone Jan 4, 2017

nlohmann self-assigned this Jan 4, 2017

nlohmann added a commit that referenced this issue Jan 4, 2017

📝 added documentation wrt. UTF-8 strings #406

4765070

nlohmann closed this as completed Jan 4, 2017

nlohmann modified the milestones: Release 2.0.11, Release 2.1.0 Jan 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to know if a string was parsed as utf-8? #406

How to know if a string was parsed as utf-8? #406

pboettch commented Dec 29, 2016 •

edited

Loading

nlohmann commented Dec 29, 2016

pboettch commented Dec 29, 2016

nlohmann commented Dec 29, 2016

nlohmann commented Dec 30, 2016

nlohmann commented Dec 30, 2016

pboettch commented Jan 3, 2017 •

edited

Loading

nlohmann commented Jan 3, 2017

pboettch commented Jan 3, 2017

jaredgrubb commented Jan 3, 2017

nlohmann commented Jan 3, 2017

pboettch commented Jan 4, 2017

nlohmann commented Jan 4, 2017

nlohmann commented Jan 4, 2017

nlohmann commented Jan 4, 2017

How to know if a string was parsed as utf-8? #406

How to know if a string was parsed as utf-8? #406

Comments

pboettch commented Dec 29, 2016 • edited Loading

nlohmann commented Dec 29, 2016

pboettch commented Dec 29, 2016

nlohmann commented Dec 29, 2016

nlohmann commented Dec 30, 2016

nlohmann commented Dec 30, 2016

pboettch commented Jan 3, 2017 • edited Loading

nlohmann commented Jan 3, 2017

pboettch commented Jan 3, 2017

jaredgrubb commented Jan 3, 2017

nlohmann commented Jan 3, 2017

pboettch commented Jan 4, 2017

nlohmann commented Jan 4, 2017

nlohmann commented Jan 4, 2017

nlohmann commented Jan 4, 2017

pboettch commented Dec 29, 2016 •

edited

Loading

pboettch commented Jan 3, 2017 •

edited

Loading