-
-
Notifications
You must be signed in to change notification settings - Fork 6.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to know if a string was parsed as utf-8? #406
Comments
I do not really understand the issue. Can you please provide an example where |
Of course. This code std::cerr << "string: " << instance << ", "
<< "length: " << instance.get<std::string>().length() << ", "
<< "size: " << instance.get<std::string>().size() << ", "
<< "utf8-size: " << utf8_length(instance) << "\n"; gives
on "data":"\uD83D\uDCA9\uD83D\uDCA9" The validator expects 2. I know this is not a JSON-HPP issue so I'm unsure who to blame ;-) . |
I see. I wonder if your function is actually correct in counting the UTF-8 characters - is it really so simple? |
(Yes, it seems to be so simple ;-) http://stackoverflow.com/questions/7298059/how-to-count-characters-in-a-unicode-string-in-c) |
From my point of view, I think this counting issue is out of scope of this library. Though a "count UTF-8 character" function is handy, I fear that it may bloat the API. |
This library parses Unicode and UTF-8-strings silently into a A method ( Maybe explaining it in the documentation is enough. |
I don't quite understand: JSON is defined to used Unicode (though this library only supports UTF-8), so I would not know what except Am I wrong? |
I'd like to know which counting method I need to apply based on what and how it has been parsed into the The utf-8-counting method works, but needs to be located on the user-side. How to prevent users in the future from falling into the same trap as I did? How many users really need the real character-count and are not aware of multibyte-encoding-problems? |
I don't think adding Unicode tools to a JSON library is helpful. It's a slippery slope (for example, counting code points can include or not include the "combining" modifiers like ◌ͤ, handling surrogates, coalation, normalization, locales, etc). There are entire C++ libraries for Unicode handling because it's hard, and if you need them, you should use them -- even for "simple" UTF8. |
I agree with @jaredgrubb. All the library can do is to document that it in fact stores strings as UTF-8 and the user has an interface to the stored bytes as |
Coming back to my original question: How to know if a string was parsed as utf-8? The answer is: you don't, but you should assume that within this library std::string-value is always multibyte-encoded and take the necessary precautions. |
So it's a documentation issue? |
I shall add notes to the documentation about the encoding of the stored strings. |
Added a note to the readme and the string type (http://nlohmann.github.io/json/classnlohmann_1_1basic__json_ab63e618bbb0371042b1bec17f5891f42.html#ab63e618bbb0371042b1bec17f5891f42). |
For my schema-validator I needed to check the length of a string value. std::length() gives the character-count which is not OK if the string is utf-8.
I wrote my own-function which works for ascii and utf-8.
Could I do it differently? Should nlohmann::json somehow inform (with a method) me about the fact that a unicode-string had been parsed?
The text was updated successfully, but these errors were encountered: