Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parse error #3403

Closed
1 of 5 tasks
SigmaStar opened this issue Mar 24, 2022 · 14 comments
Closed
1 of 5 tasks

parse error #3403

SigmaStar opened this issue Mar 24, 2022 · 14 comments
Labels
solution: invalid the issue is not related to the library

Comments

@SigmaStar
Copy link

What is the issue you have?

some json string are well formated but just because it contain unreadable data,it was unable to parse.

Please describe the steps to reproduce the issue.

js.parse("{"key_name": "\udd281\u00c4(\u3ca3\u7724\u114a\u7722"}");
this cannot work well when the keyname is just above. not a error. this work well in python json.loads function.

Can you provide a small but working code example?

js.parse("{"key_name": "\udd281\u00c4(\u3ca3\u7724\u114a\u7722"}");

What is the expected behavior?

just load it to json value

And what is the actual behavior instead?

it reports error.

Which compiler and operating system are you using?

Visual Studio 2019

Which version of the library did you use?

  • latest release version 3.10.5
  • other release - please state the version: ___
  • the develop branch

If you experience a compilation error: can you compile and run the unit tests?

  • yes
  • no - please copy/paste the error message below
@SigmaStar
Copy link
Author

I think there should be a switch. Sometimes I just wanna store the literal string into the object string. I don't care wheather it's a valid UTF8 string or not.

@nlohmann
Copy link
Owner

For reference, this is the exception:

[json.exception.parse_error.101] parse error at line 1, column 20: syntax error while parsing value - invalid string: surrogate U+DC00..U+DFFF must follow U+D800..U+DBFF; last read: '"\udd28'

The text describes it: the input is invalid UTF-8. The library is strict about this, as the JSON standard requires it.

@nlohmann nlohmann added solution: invalid the issue is not related to the library and removed kind: bug labels Mar 24, 2022
@SigmaStar
Copy link
Author

No, It's not a exception. i mean I put that invalid UTF8 string on purpose. Right now I just change the code which deal with the unicode string as
add('\u'); break;
instead.

@nlohmann nlohmann reopened this Mar 24, 2022
@nlohmann
Copy link
Owner

I do not understand what you mean. Can you please explain in detail what you tried, what happened, and what you would have expected.

@SigmaStar
Copy link
Author

Actually this is a part of the report of Cuckoo sandbox. You know that many people would store binary data as unicode format in json. Not all of them are readable or resolvable so I think there should be a switch or something to disable unicode string parsing. I think this would make nlohmann/json the best. Right now I do things very stupid like this:

change line 6721 to:
// unicode escapes
case 'u':
{
add('\u');
break;

I cann't upload pics sorry.

@nlohmann
Copy link
Owner

So then your original issue remains that the library throws a parse error in case of invalid UTF-8, right?

The library supports multiple binary formats (https://json.nlohmann.me/features/binary_formats/) which in turn support binary values (https://json.nlohmann.me/features/binary_values/).

@SigmaStar
Copy link
Author

No, The file I tried to parse is just json format plain text. But just part of this json contains invalid utf8 string.

@nlohmann
Copy link
Owner

Understood. But then your file is invalid JSON, hence the exception.

@SigmaStar
Copy link
Author

Ahh, No. the file was produced by python program. In python they are valid and work well. In python the \u prefix means unicode string and it can contain all possible value from 0x0000 to 0xffff without any restrictions. So I think this project may have the same behavior as Python program do.

@SigmaStar
Copy link
Author

Right now I've got no time but later I can do some modification to the function json::parse and add new parameters to choice whether parse utf8 string to literal string or not. May I do a pull request later?

@gregmarr
Copy link
Contributor

Ahh, No. the file was produced by python program. In python they are valid and work well.

That doesn't mean that it created valid JSON. That just means that it created something that it could understand.

@SigmaStar
Copy link
Author

SigmaStar commented Mar 25, 2022

But actually this is the case. You don't know the encoding of the value. for example:
in python json.dumps({'a':'中国'}) creates '{"a": "\u4e2d\u56fd"}'
but in jsonpp parse creates: {"a":"涓浗"}
The problem is that we shouldn't just treat the text as utf8 encoding. There are many types of encoding to encode Chinese or Other languages. For example, GBK, GB2312, UTF16-LE/BE, UNICODE.
The best way to avoid this is just add a switch to ignore string starts with \u and let users to choose the way they wanna process the string.
This program is obviously not a text editor class program so it cannot support all types of encoding.
Again, It would be best if jsonpp behaves just the same as json parser in python or fastjson in java.
By the way, avoid parsing \u string will significantly improves the overall performance of jsonpp.

@SigmaStar
Copy link
Author

oops, hard encode causes exception:
the following code will cause exception:
json js;
js = json::parse("{"a": "中国"}");
cout << js << endl;

@nlohmann
Copy link
Owner

The library supports only UTF-8, see https://json.nlohmann.me/home/faq/#parse-errors-reading-non-ascii-characters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solution: invalid the issue is not related to the library
Projects
None yet
Development

No branches or pull requests

3 participants