Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lint: consider the JSON parsing/deserialization design #312

Open
ee7 opened this issue May 6, 2021 · 14 comments
Open

lint: consider the JSON parsing/deserialization design #312

ee7 opened this issue May 6, 2021 · 14 comments
Labels
cmd: lint kind: design Discussing overall direction

Comments

@ee7
Copy link
Member

ee7 commented May 6, 2021

Main options:

  1. std/json
    1a. The approach so far: parse into a JsonNode and work only with that.
    1b. Parse into a JsonNode, then unmarshall into some object using to.
    1c. Plus std/jsonutils
  2. Araq/packedjson - keeps everything as a string. Lower memory usage than std/json, and sometimes faster.
  3. planetis-m/eminim - deserializes using std/streams directly to an object. Doesn't fully support object variants, but maybe that isn't a problem for us.
  4. status-im/nim-json-serialization - deserializes using nim-faststreams directly to an object. Probably the most mature third-party option. Currently has a large dependency tree, including chronos and bearssl.
  5. treeform/jsony - deserializes from string directly to an object.

(Note that disruptek/jason is serialization-only).

There are also some more obscure ones that I haven't tried, and don't know anything about:

Some of the above are possibly too lenient or require special handling in some edge cases.

Summary:

Library Permits a trailing comma? Permits comment? Duplicate key handling
Ruby stdlib json Uses last value
std/json Uses last value
std/json patched Uses last value
packedjson Uses first value
eminim Uses last value
json_serialization Uses last value
jsony Uses last value

For example:

  • There is no correct behaviour for a duplicate key - one library may produce an error, another may silently use the value of the first key, and another may silently use the value of the last key.
  • std/json permits a trailing comma, and comments with // and /* */. This is the main reason that it took a while to tick the boxes for "the file must be valid JSON" in lint: implement remaining linting rules #249. But we now have own patched std/json with stricter parsing. configlet lint must exit with a non-zero exit code for a trailing comma because the Ruby library that parses it later produces an error for a trailing comma.
  • Some libraries may use a default value when the key is missing, which we might want to distinguish from e.g. a value that is the empty string.
  • Edge cases around a literal null.

See also:

I'd suggest that jsony or nim-json-serialization might be best in the long-term. But maybe it's better to stick with the current approach until we've implemented all the linting rules, and refactor it later.

One advantage of the current approach is that it's more low-level, which might better ensure that we're "checking the JSON file itself" rather than "checking that each value is valid when parsed with library X".

@ErikSchierboom
Copy link
Member

ErikSchierboom commented May 6, 2021

I would also be in favor of trying to use the standard library first, and see how far we can take it. I don't mind us writing a bit of verbose code.

@ee7
Copy link
Member Author

ee7 commented May 7, 2021

@ErikSchierboom updated the top post with some investigation of behavior in edge cases.

We probably want to define "valid JSON" in the spec, and perhaps explicitly forbid duplicate keys.

@ErikSchierboom
Copy link
Member

I'd suggest that jsony or nim-json-serialization might be best in the long-term. But maybe it's better to stick with the current approach until we've implemented all the linting rules, and refactor it later.

I don't know, both jsony and nim-json-serialization seem to be maintained by relatively few people and I'd be hesitant to use those libraries instead of the built-in JSON library. Its also telling that so far, no track has actually had this issue with trailing commas, which leads me to believe that it is not that big of a deal.

But it's probably simple to use a modified std/json with stricter parsing.

You mean forking the existing code? Would this be something that you could PR to Nim itself?

@ee7
Copy link
Member Author

ee7 commented May 9, 2021

But it's probably simple to use a modified std/json with stricter parsing.

You mean forking the existing code?

I meant that we do some workaround such that import std/json instead uses our own modified version of lib/pure/json.nim and/or lib/pure/parsejson.nim. We can do this by either:

  1. Adding the modified file(s) to this repo, and when running our GitHub Actions workflows, replace file(s) in the Nim installation directory.
  2. Adding the modified file(s) to this repo, then add a call to patchFile in our config.nims file. This is better because it also affects the local development environment in the same way.
  3. Apply some transformation to the AST of the relevant procs at compile-time. This is clever, but less obvious and less self-documenting.
  4. Like 3, but import the Nim compiler and apply the transformation as an extra compiler pass. Again, probably too clever.

Would this be something that you could PR to Nim itself?

Yes, it's possible to add it to Nim itself. But it wouldn't be available until Nim 1.6.0 anyway, which might take a while. (We shouldn't build a configlet release with the devel Nim compiler).

It would be added an opt-in strict mode, since making parseJson or parseFile strict by default would completely break backwards compatibility.

I'd suggest we should do option 2 in the meantime regardless. The main downside is that we wouldn't immediately get upstream bug fixes in the patched files, unless we backport the latest changes manually. But such upstream changes to std/json would rarely affect us anyway, and backporting should be trivial (given that our diff is probably small) if necessary.

@ErikSchierboom
Copy link
Member

@ee7 I think that sounds like a good plan! 👍

@ee7 ee7 added kind: design Discussing overall direction cmd: lint labels May 9, 2021
ee7 added a commit that referenced this issue Jun 10, 2021
The Nim stdlib JSON modules parse JSON in a lenient way: they silently
ignore:
- a comma after the last item in an array or object
- a line comment with `//` 
- a potentially-multi-line comment with `/* */`

This is useful for parsing real-world JSON, but is bad for our use case
of validating JSON. JSON that contains a trailing comma or such comments
is technically invalid [1], and less portable. Most JSON parsers produce
an error for a trailing comma, including the Ruby library that Exercism
uses. Therefore `configlet lint` should produce an error for a trailing
comma, so that a track doesn't merge a PR that contains one.

This commit adds our own copies of the Nim JSON modules to our repo, and
patches them to raise an exception for these cases. Each module was
copied from the latest version on Nim's `version-1-4` branch. In the
future, we will pull in upstream changes to these files when we update
the Nim version.

This approach is better than calling something like `jq` because it:
- doesn't require/prefer the user to have an extra program installed
- doesn't special-case running in CI
- doesn't require refactoring our codebase
- is much faster

We use the somewhat-obscure `nimscript.patchFile` [2] mechanism to
override the location of the `json` and `parsejson` modules, so that we
can keep `std/json` in our import statements.

Instead of the `patchFile` approach, it would also be possible to apply
some transformation to the AST of the relevant procs at compile-time,
but that's less readable and harder to maintain.

Our approach might be obsoleted in the future - for example, if we ever
move to some other JSON library, or if the upstream `parseJson` gains
e.g. an `allowTrailingCommas = true` argument.

[1] See e.g. https://www.json.org/json-en.html
[2] https://nim-lang.org/docs/nimscript.html#patchFile%2Cstring%2Cstring%2Cstring

Closes: #345
See also: #312
@ee7
Copy link
Member Author

ee7 commented Sep 12, 2021

In the meantime, we:

  • Merged 15c8403, which forks std/json and std/parsejson
  • Started using jsony for the "multi-key" checks in configlet lint

However, I'm still undecided about the best overall design for a refactor (not a high priority). For example, we could:

  1. Use only std/json. Here's, it's probably best to do a first pass to check the types, then use json.to to get an object.
  2. Do all the "single-key" checks with std/json, then more complex checks with jsony. This is the current approach.
  3. Check only the types using std/json, and do all the other checks after deserializing via jsony to an object.
  4. Avoid std/json entirely, and use only jsony. This is attractive from a performance perspective: we avoid the allocation of some dynamic JsonNode, which consumes nearly all of the configlet lint runtime, and instead directly populate an object. Performance isn't my top priority, but this would probably also help make the codebase more robust/readable/maintainable.

The latter two options have the downside of increasing our dependence on a non-stdlib package. However: the jsony source code is pretty short, and the author is a prolific and well-known member of the Nim community, who parses a lot of JSON.

The latter options also give us less control over error messages. For example, if we use jsony only, the most straightforward implementation means we'll only get one error message for a file that has a type error as well as other problems. And unless jsony gains a strict mode, we'll have to fork it to disallow at least:

  • Trailing commas (assuming that stays necessary for later parsing by Ruby)
  • Different key name capitalization

@ErikSchierboom
Copy link
Member

Avoid std/json entirely, and use only jsony. This is attractive from a performance perspective: we avoid the allocation of some dynamic JsonNode, which consumes nearly all of the configlet lint runtime, and instead directly populate an object. Performance isn't my top priority, but this would probably also help make the codebase more robust/readable/maintainable.

I honestly don't care for performance as configlet is already incredibly fast. Robustness/readability/maintainability are much more important for configlet.

The latter options also give us less control over error messages. For example, if we use jsony only, the most straightforward implementation means we'll only get one error message for a file that has a type error as well as other problems.

So if I'm interpreting this correctly, jsony is different from std/json in that it returns an error message if a type mismatch between the JSON content and the type to serialize to occurs? And in that case jsony only returns one (the first?) error? If so, I'd be totally fine with that. We'd be able to remove tons of validation code and type errors should be quite rare.

And unless jsony gains a strict mode, we'll have to fork it to disallow at least:

Would that be a lot of work?

@ee7
Copy link
Member Author

ee7 commented Apr 15, 2022

jsony is different from std/json in that it returns an error message if a type mismatch between the JSON content and the type to serialize to occurs?

Yes. We can also fail fast for e.g. seeing a slug that is not kebab-case.

And in that case jsony only returns one (the first?) error?

Yes. Although I imagine it's technically possible to output all the type mismatches - but probably not worth it.

If so, I'd be totally fine with that. We'd be able to remove tons of validation code and type errors should be quite rare.

I was thinking the same. Another advantage is better error messages: we'd find type errors at the time of parsing, and so we still have direct access to line number information.

Would that be a lot of work?

I'd guess/hope that it wouldn't be too bad. It might even be simpler than having a separate first pass that checks the JSON is valid and that key name capitalization is correct. We could also try maintaining a patch, if the diff is small (this is what we do with cligen's parseopt3.nim currently).

@ErikSchierboom
Copy link
Member

Another advantage is better error messages: we'd find type errors at the time of parsing, and so we still have direct access to line number information.

This is a very important point.

We could also try maintaining a patch, if the diff is small

This has worked out well with parseopt3.nim. That file is from the standard library though, isn't it? I'm asking, because that file probably changes less than the jsony source code (see its commits).

I've looked at what should be patched:

  • Trailing commas (assuming that stays necessary for later parsing by Ruby)
  • Different key name capitalization

The first one is still required, as I've just checked it with the latest Ruby version.
The second one, what is that about? Is jsony strict about the casing of keys?

@ee7
Copy link
Member Author

ee7 commented Apr 20, 2022

This has worked out well with parseopt3.nim. That file is from the standard library though, isn't it?

It's from cligen/parseopt3, which is indeed derived from std/parseopt.

For us, one of the main differences is that parseopt3 supports separating a short option and its value with a space, like:

configlet sync -e bob

(see a897d05cb0b4 for background).

I'm asking, because that file probably changes less than the jsony source code

Yes, jsony will probably see more churn than parseopt3. But, in the same way that cligen receives lots of work that doesn't touch parseopt3, maybe a patch would touch only relatively stable code (from the below, it only needs to forbid trailing commas for a jsony-only approach).


[checking for trailing commas] is still required, as I've just checked it with the latest Ruby version.

OK - thanks.

Is jsony strict about the casing of keys?

It's not completely strict, but it's stricter than I thought/remembered.

It turns out that the only looseness is "the value of a snake_case JSON key does set the value of a camelCase nim object field".

See the jsony docs, and the relevant jsony code.

I've tried to illustrate below how jsony behaves. Feel free to stare at this:

import pkg/jsony

type
  ObjA = object
    foo_bar: int

  ObjB = object
    anotherField: int

func init(T: typedesc[ObjA | ObjB], s: string): T =
  fromJson(s, T)

# Summary: jsony is stricter when the object field name is snake_case style.

func main =
  block:
    let t = ObjA.init """{"foo_bar": 1}"""
    doAssert t.foo_bar == 1

  # The value of a camelCase JSON key DOES NOT set the value of a corresponding snake_case field.
  block:
    let t = ObjA.init """{"fooBar": 1}"""
    doAssert t.foo_bar == 0 # The default value.

  # And other capitalization is also not accepted.
  block:
    let t = ObjA.init """{"foo_Bar": 1}"""
    doAssert t.foo_bar == 0

  block:
    let t = ObjA.init """{"foobar": 1}"""
    doAssert t.foo_bar == 0

  # ----------------------------------------------------------------------------
  # The value of a snake_case JSON key DOES set the value of a corresponding camelCase field.
  block:
    let t = ObjB.init """{"another_field": 1}"""
    doAssert t.anotherField == 1

  block:
    let t = ObjB.init """{"anotherField": 1}"""
    doAssert t.anotherField == 1

  # But other capitalization is not accepted.
  block:
    let t = ObjB.init """{"another_Field": 1}"""
    doAssert t.anotherField == 0

  block:
    let t = ObjB.init """{"anotherfield": 1}"""
    doAssert t.anotherField == 0

main()

Summary: I think that unpatched parsing with jsony alone is sufficient to check JSON key names (and everything except trailing commas), as long as our spec for Exercism JSON files has no uppercase character in any key name, and we do one of these:

  • Use snake_case for our Nim object field names that jsony uses (and silence the stylecheck hint - as we already do e.g. here).
  • Use camelCase for our Nim object field names that jsony uses, and patch or fork jsony's default parsing of JSON objects.
  • Use camelCase for our Nim object field names that jsony uses, and provide a different parseHook for our specific objects.

I'd suggest the first. Which leaves us doing one of these:

  1. Parse only with jsony, and patch it to error for a trailing comma.
  2. Do almost everything with jsony, but do one pass with our existing patched std/json just to error for a trailing comma.
  3. Do almost everything with jsony, but do one pass with some other Nim code, just to error for a trailing comma.
  4. Do almost everything with jsony, but make configlet lint call some non-Nim code, just to error for a trailing comma. For example, we could run find . -name '*.json' -exec jq '.' {} + > /dev/null when the CI environment variable exists (or when jq is installed, which it is in CI). This is simple, but means that a trailing comma may be detected only in CI, and not locally.
  5. Do almost everything with jsony, but add an org-wide workflow that runs the jq command in 4. I think this is bad.
  6. Parse only with jsony, and use a different ruby library (or use the current one, but change its options/patch it).

I think 1 is best, but we can do 2, 3, or 4 as a first implementation if it turns out that 1 is difficult.

There is some subtlety though: if a user runs configlet fmt when there is a trailing comma, should configlet error or remove it? What about configlet sync? We could consider being permissive in what we accept, and strict in what we output (robustness principle). So maybe a jsony patch with configurable trailing comma behavior...

@ErikSchierboom
Copy link
Member

I think 1 is best, but we can do 2, 3, or 4 as a first implementation if it turns out that 1 is difficult.

Agreed.

There is some subtlety though: if a user runs configlet fmt when there is a trailing comma, should configlet error or remove it? What about configlet sync? We could consider being permissive in what we accept, and strict in what we output (robustness principle). So maybe a jsony patch with configurable trailing comma behavior...

I wouldn't mind erroring on a trailing commas for configlet sync or configlet fmt, as the official spec does not support trailing commas so we would just be following the spec :)

@kotp
Copy link
Member

kotp commented Apr 22, 2022

@ErikSchierboom do you have reference link to the "does not support trailing commas"? I did not find mention of trailing commas, so I was not able to figure out if it is simply not mentioned or not supported or disallowed? I checked (searched for the word "trailing") v1.0 and v1.1 as it is here and did not find any mention of this. I might have missed it though.

@ee7
Copy link
Member Author

ee7 commented Apr 22, 2022

@kotp That link is a spec for JSON APIs. For the standards for JSON itself, see the railroad diagrams on https://www.json.org:

object
array

And:

They're allowed in JavaScript, though. See:

@kotp
Copy link
Member

kotp commented Apr 22, 2022

Thanks @ee7. I saw that as well, though still can not find anything about trailing commas being either unsupported or allowed, or even a "should" statement regarding this. The https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Trailing_commas document states that it (JSON) disallows trailing commas but does not show where that information is made known.

So I would vote to not, if it is true that it is disallowed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cmd: lint kind: design Discussing overall direction
Projects
None yet
Development

No branches or pull requests

3 participants