-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strict mode for JSON parsing #2295
Comments
As mentioned in #494 (comment) that only seems to apply to lenient mode. Footnotes
|
Thanks @Marcono1234, I've updated the earlier comment to talk about JSON arrays rather than objects. |
Now I'm very confused. Apparently trailing commas are already disallowed in both JSON objects and JSON arrays? So the list quoted in my earlier comment is correct without addition. Anyway I still do think we should have a strict mode, that you can easily get using |
My project group and I will try to resolve this issue for a software engineering course at KTH, Sweden. What should be the interaction between |
@marten-voorberg, thanks for the proposal! I think there are several things that need to happen here, not necessarily in order:
I've created a branch Concerning the API, I am actually thinking it might be cleaner to use an Another more general approach would be to have an public enum Leniency {
COMMENTS, CAPITALIZATION, ESCAPES, ...;
public static final Set<Leniency> LENIENT = EnumSet.of(...);
public static final Set<Leniency> DEFAULT = EnumSet.of(...);
public static final Set<Leniency> NONE = EnumSet.ofNone(Leniency.class);
} Then I'm not sure that this extra generality would really be worthwhile. It would allow users to specify exactly which non-standard constructs they need to accept, while rejecting everything else. But I'm inclined to think you either accept the exact standard or you accept any random crap (which is what lenient mode means today). The main reason for |
@eamonnmcmanus Thanks for the response! I've submitted a WIP PR which throws exceptions when trying to parse any of the non-spec compliant values discussed in the issue description. The API for |
@marten-voorberg, maybe it would also be interesting to add unit tests for the following:
|
The JsonWriter also has a
|
That behavior is actually allowed by the specification, see RFC 8259, section 71:
Maybe there are use cases where users don't want Footnotes
|
I agree with @Marcono1234: we can escape I also agree that RFC 8259 is a better reference than the ECMA spec I linked to earlier. Only one small caveat: there is an erratum with a grammar that seems to suggest that I also found this excellent resource about tricky areas in JSON parsing, including an excellent set of tests. |
You mean the erratum itself is wrong, right? Because the current JSON grammar seems consistent with the text above of it, saying "the characters that MUST be escaped: quotation mark, reverse solidus, and the control characters". So to my understanding,
Out of curiosity I have locally adjusted that project to run with the latest Gson version from (Hacky) JSONTestSuite integration and resultsimport java.io.*;
import java.nio.charset.*;
import java.nio.file.*;
import com.google.gson.*;
import com.google.gson.stream.*;
public class Test {
public static boolean parseJsonElement(String s, boolean lenient) throws IOException {
Gson gson = new Gson();
TypeAdapter<JsonElement> adapter = gson.getAdapter(JsonElement.class);
JsonReader reader = new JsonReader(new StringReader(s));
reader.setLenient(lenient);
try {
adapter.read(reader);
return reader.peek() == JsonToken.END_DOCUMENT;
} catch (JsonParseException | MalformedJsonException | EOFException e) {
e.printStackTrace();
return false;
}
}
private static boolean jsonReaderSkip(String s) throws IOException {
JsonReader reader = new JsonReader(new StringReader(s));
try {
reader.skipValue();
return reader.peek() == JsonToken.END_DOCUMENT;
} catch (JsonParseException | MalformedJsonException | EOFException e) {
e.printStackTrace();
return false;
}
}
public static void main(String[] args) {
Path path = Paths.get(args[1]);
String s = null;
try {
s = Files.readString(path);
} catch (CharacterCodingException e) {
System.out.println("Skipping testcase with malformed UTF-8 for: " + path);
e.printStackTrace();
System.exit(0);
} catch (Throwable t) {
t.printStackTrace();
System.exit(2);
}
try {
boolean isValid = false;
if (args[0].equals("parse-lenient")) {
isValid = parseJsonElement(s, true);
} else if (args[0].equals("parse-strict")) {
isValid = parseJsonElement(s, false);
} else if (args[0].equals("skip")) {
isValid = jsonReaderSkip(s);
} else {
System.out.println("Invalid arg: " + args[0]);
System.exit(2);
}
if (isValid) {
System.out.println("valid");
System.exit(0);
} else {
System.out.println("invalid");
System.exit(1);
}
} catch (Throwable t) {
t.printStackTrace();
System.exit(2);
}
}
} Then adjust programs = {
"gson-lenient":
{
"url":"...",
"commands":["java", "-cp", "gson-2.10.2-SNAPSHOT.jar", "Test.java", "parse-lenient"]
},
"gson-strict":
{
"url":"...",
"commands":["java", "-cp", "gson-2.10.2-SNAPSHOT.jar", "Test.java", "parse-strict"]
},
"gson-skip":
{
"url":"...",
"commands":["java", "-cp", "gson-2.10.2-SNAPSHOT.jar", "Test.java", "skip"]
}
} (assumes you are using JDK >= 11) Results: |
It would be nice to have a common behavior for escaping or not escaping |
@vitorpamplona, do you really have to recreate the JSON document with Gson? Is there no way you can directly access the original JSON file and calculate the hash over that? Because there might also be other incompatibility issues, such as different floating point number parsing and conversion of floating point numbers to string between languages, and preserving the order of object properties. Personally I think if such a setting is really needed it should not be tied to the strictness mode since this is not really related to strictness. Maybe it could be part of the newly added |
The issue is that the JSON is being passed around in multiple servers before it gets to me. It all depends on how those servers recreate that JSON in every step along the way. The JSON is just the serialization of the data model I have to work with. It shouldn't matter what they do. On my side, what I have to do is to receive the JSON, parse it into the data model, and rebuild the JSON as per the protocol's hashing spec (no escaping). I bumped into this thread because today our app was rejecting a message with |
@vitorpamplona I Agree with @Marcono1234 that I don't think that dealing with If we want to start treating |
I will go back to my initial point. Whatever you choose to do, it would be nice to have a common behavior/setting name for escaping or not escaping |
Some serialization formats do define a canonical form, so that the same input will always produce the exact same output. Plain JSON is not one of those formats. There are all sorts of things that can vary: you can insert whitespace between tokens; you can write a quote in a string as There is a standard JSON Canonicalization Scheme (JCS), which defines exactly what bytes should be produced for any given object. However, I think that's different from what we're trying to do here. We could imagine later having a |
I agree, but like it or not, JSON is an extremely common canonicalization form. Most of us don't have any choice in the matter. However, the issue here is not about an overarching canonical form for JSON. It's much, much simpler. It's simply a practically-helpful way of escaping Strings from the native representation of every language into the JSON String. That's it. No big dreams. All it takes is lib devs to come together and agree on 2-3 schemes, with commonly supported options, and ways to activate and deactivate them in the lib. It's not a standard. Just a lib implementation consensus. |
It seems to me that the case you described requires a strict canonical format, which I think should be JCS. If we just canonicalize what string literals look like we're only solving half the problem. |
Then solve half of the problem. I can guarantee you the other half is fully solved already :) |
@eamonnmcmanus, this can be closed now since #2437 was merged, right? Since this issue was only about parsing, I guess it would be good to create a separate issue if you want any change to how Gson emits JSON data, as mentioned above by Éamonn:
|
As noted in the documentation for
JsonReader.setLenient
, there are a number of areas where Gson accepts JSON inputs that are not strictly conformant, even when lenient mode is off:To which we could add that it allows a trailing comma in JSON arrays, for example
["foo", "bar",]
as noted in #494.We could imagine that
GsonBuilder
would acquire a methodsetStrict()
, mutually exclusive withsetLenient()
and thatJsonReader
would likewise acquiresetStrict(boolean)
andisStrict()
. We can't make strict mode the default, for fear of breaking existing users, but we could recommend it.The text was updated successfully, but these errors were encountered: