Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow, and strip, comments (#, // or /*) #2548

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

liquidaty
Copy link

No description provided.

@liquidaty
Copy link
Author

Possibly this failure has nothing to do with the actual code changes. Here is what the log says:

C:\msys64\usr\bin\bash -lc "pacman-key --verify msys2-keyring-r21.b39fb11-1-any.pkg.tar.xz.sig"
==> Checking msys2-keyring-r21.b39fb11-1-any.pkg.tar.xz.sig...
gpg: no valid OpenPGP data found.
gpg: the signature could not be verified.
Please remember that the signature file (.sig or .asc)
should be the first file given on the command line.
==> ERROR: The signature identified by msys2-keyring-r21.b[39](https://ci.appveyor.com/project/stedolan/jq/builds/46391082/job/ngd77yg97ie5ls1f#L39)fb11-1-any.pkg.tar.xz.sig could not be verified.
Command exited with code 1
cat config.log
cat: config.log: No such file or directory

@itchyny
Copy link
Contributor

itchyny commented Jun 3, 2023

Ref: #1785

@nicowilliams
Copy link
Contributor

Oof, I'm not sure that C-style comments in jq code are a good idea, but I understand that if we were to accept them in JSON inputs then maybe we should accept them in jq.

The other thing is that we really do need an option for JSON input strictness.

I find JSON comments to be useless. We couldn't preserve them in jq because... what do we attach them to? I guess maybe we could have has_comment/0/get_comment/0, but there's problems with that: to what values do comments attach? do we want the bloat hit on jv? do we want to output comments attached to values?

So what I do when I need to write comments in JSON texts is this:

{
   "comment0": "This string is a comment",
   "a":"whatever",
   "comment1": "This is another comment, and the only reason I change the comment name is in case there's any processors out there that reject duplicate keys"
}

which does not allow me to have comments inside arrays, but hey, it works.

@wader
Copy link
Member

wader commented Jul 5, 2023

@nicowilliams there is a proposal by @pkoppstein about a strictness option #2643

@itchyny
Copy link
Contributor

itchyny commented Jul 5, 2023

I'm very worried about the conflict against the alternative operator.

@nicowilliams
Copy link
Contributor

I'm very worried about the conflict against the alternative operator.

Yes, we really can't have // comment, that much is clear, but I don't think I want even /* comment */ in the jq language, though I'd be ok with stripping those from input JSON texts.

@liquidaty
Copy link
Author

liquidaty commented Jul 5, 2023

@nicowilliams @wader @itchyny I think it's worth further dividing the target use cases / proposals into sub-scopes, given that the arguments for and against may be different for each.

  1. Stripping (as opposed to parsing) only:
  • 1a. Strip from JSON input. All we are doing is allowing JSON input to have comments, and ignoring those comments. Everything downstream from there is the same as if the input JSON never had comments to begin with. Example: echo '{ "hi": "there" /* this is a comment */ }' | jq '.' -c yields {"hi":"there"}
    • Note 1: the concern re alternative operator conflict is N/A here
    • Note 2: the original intended scope of this issue Allow, and strip, comments (#, // or /*) #2548 is limited to this sub-scope, and was not intended to extend any further
  • 1b. Strip from comments from jq filters. Same as above but for filters rather than JSON input. Example: echo '{ "hi": "there" }' | jq '. /* this is a comment */ ' -c yields {"hi":"there"}
  1. Parsing comments, which then become part of the parse tree and are available to be included in the output
  • 2a. Parse from JSON input. Example (with extraneous spaces in the JSON input intentionally added): echo '{ "hi": "there" /* this is a comment */ }' | jq '.' -c yields {"hi":"there" /* this is a comment */ }
    • Raises several issues, including where the comment attaches to and whether the comment output style should reflect the original style
  • 2b. Parse from jq filter.
    - Raises several issues, including where the comment attaches to, whether the comment output style should reflect the original style, and C++-style comments potentially conflicting with the alternative operator

Can we all agree that the limited scope of simply stripping comments from JSON input (1a above) is cut & dry? The comments can just be ignored and we are done. It seems to me that all the various concerns that have been raised are only relevant in the other 3 scopes (1b, 2a, 2b)

@nicowilliams
Copy link
Contributor

Yes, ignoring JSON "comments" on input is fine, but I would like a command-line option for strictness, and we'll probably want strict and non-strict versions of fromjson so that jq code can also benefit from strictness. The next question is: which must come first, options for strictness of comment stripping?

I understand that some users have an urgent need to strip out JSON comments. I also imagine that some users like to validate JSON with jq and don't want jq to become more permissive than it is right now. That's the tension here for me.

All the other concerns have to do with the jq language and/or with how to represent comments internally. We can certainly punt on those and not attempt to add new types of comments to the jq language, and we can also not attempt to preserve comments in JSON parsing. (I suspect we'll never want to preserve comments in JSON parses for the simple reason that it's way too bloating for jv.)

@itchyny
Copy link
Contributor

itchyny commented Jul 6, 2023

Instead of extending the current lenient parser, and instead of just skipping the JavaScript-like comments, I would like to make the parser conforming to well-defined specifications. My preference is to make the default parser to follow RFC 8259, and support --format=json5 following https://spec.json5.org/ (if we really want to support JSON with comments), and maybe --format=yaml for https://yaml.org/spec/, and --format=jq-json for transition from the current lenient parser (if someone really want).

@liquidaty
Copy link
Author

@itchyny I think it is a great idea to target well-defined specifications. Re json5, the immediate impact with of that, with respect to this particular issue, is that # would no longer be a comment delimiter, and both C- and C++-style comments would remain in scope.

Beyond that, json5 would also support single- or multi-line strings strings that are unquoted, or that are enveloped in single quotes, and while I agree that those would be useful features, I would also propose treating them as a separate issue / PR for practical purposes. In particular, looking at jv_parse.c it seems to me that the changes necessary to accommodate these string parsing features will be significantly more invasive, and even more so if parsing yaml (so much so that if I were to take on those tasks-- which I probably cannot do, alas-- I might find it easier to swap out the JSON parser to use a parser generator rather than a handwritten one-- in which case that parser change might itself be a standalone issue/PR).

@nicowilliams
Copy link
Contributor

Note that JSON5 is neither an ECMA standard nor an IETF standard. Not that jq only implements standards, mind you, just that JSON5 isn't one. That shouldn't stop us adopting optional support for JSON5, but we probably don't want it to be the default. Indeed, I would like to have JSON5 support if for no other reason that object keys that are identifiers can be left unquoted, that and the trailing comma feature -- these going to be nice to have.

BTW, one thing that @stedolan told me long ago is that all these command-line options are kinda ugly, that we should aim to make it possible to handle all of them in jq code. So for example, -s is the same as -n + starting the jq program with [inputs] | , and not using -s is the same as using -n and starting the program with inputs | . Now, to really make that possible for most / all jq command-line options will take a lot of work, perhaps including finishing the co-routines features.

Now, for jq code itself one of the things we'll need is a form of fromjson/tojson for each supported format. And I agree that it'd be very nice to have not just spec conformance, but also YAML support, and maybe even XML. So maybe we need fromjson5/tojson5.

@nicowilliams
Copy link
Contributor

Also, I think we need to separate all the parsers for all flavors of JSON. I'd rather have lots of code duplication than to have lots of branches complicating and slowing down parsing. Ditto printing. The price to pay in bloat might be annoying, so we might need ./configure options for disabling various features.

@liquidaty
Copy link
Author

OK this PR has been updated and now covers the following:

  • C- and C++-style comments (but not # comments) are stripped
  • this behavior is only enabled when a --strip-comments option is enabled
  • added a couple tests in tests/commentstest

I understand there may not yet be consensus around the use of a --strip-comments option, but figured I'd include it for now and you can decide when you merge if you don't want it, or want to change it (or lmk and I can update the PR).

src/lexer.l Outdated Show resolved Hide resolved
src/main.c Outdated Show resolved Hide resolved
@nicowilliams
Copy link
Contributor

@liquidaty ah, do please get rid of the merge commit. Neither @stedolan nor I like them -- we prefer linear history upstream. If you don't know how to get rid of it I could walk you through it, or I could force push to this PR.

@itchyny
Copy link
Contributor

itchyny commented Jul 9, 2023

Will this PR be going to 1.7? We need more discussions on this topic. As a user, it does not mention the formal definition about the comment syntax, so it's not clear which form of comments are stripped; e.x. multiline comments can be nested or not. Dealing with comments in JSON will soon raise a feature request to re-format a JSON file with keeping comments, but I'm afraid such request is hard to implement within the current parser.

@nicowilliams
Copy link
Contributor

Will this PR be going to 1.7? We need more discussions on this topic. As a user, it does not mention the formal definition about the comment syntax, so it's not clear which form of comments are stripped; e.x. multiline comments can be nested or not. Dealing with comments in JSON will soon raise a feature request to re-format a JSON file with keeping comments, but I'm afraid such request is hard to implement within the current parser.

Great question! I think yeah, it's probably best to wait till after 1.7 to integrate this.

Because the jv representation of JSON values really can't represent comments without a lot of bloat, I think we'll never agree to preserve comments.

@liquidaty
Copy link
Author

@liquidaty ah, do please get rid of the merge commit [...] If you don't know how to get rid of it I could walk you through it, or I could force push to this PR

@nicowilliams I unfortunately do not know (for sure) how to get rid of it and would be happy to try, but I think at least as likely to make it worse than better. Please feel free to walk me thru it or force push yourself, whatever is more convenient for you

@liquidaty
Copy link
Author

Will this PR be going to 1.7? We need more discussions on this topic

probably best to wait till after 1.7 to integrate this

@itchyny @nicowilliams: regarding the limited scope of just stripping C- and C++-style behavior is (i.e. 1a from #2548 (comment)), then given:

  • how uncontroversial that limited scope is
  • how many users want this (this has got to be a top-10 most asked-for feature, maybe top 3, maybe top 1: see e.g. JSON comments-- again #1571, Comments in Json #402, support comments in json #695, Comments #1398), and
  • easy it would be to include in 1.7 (since we already have here a fully working PR that passes all checks, with the only remaining difference of opinion being over what description is used in help and/or man pages-- and I'm fairly confident that the majority of users looking for this solution won't mind whatever outcome you decide on that)

then, I hope you will consider, for your own sakes as well as that of others waiting on this feature: just accept it in 1.7, thereby putting the whole bunch of users looking for this at ease, and buying yourself plenty of time to later decide on all the other issues in the broader comment-related scope

@nicowilliams
Copy link
Contributor

nicowilliams commented Jul 10, 2023

@liquidaty here's what I did:

  • git fetch origin pull/2548/head:pr2548
  • git checkout pr2548
  • git rebase -i origin/master
    • fix conflicts
  • the merge commit was dropped automatically by the rebase
  • reviewed and noticed that one commit was mostly the reversal of the other
  • git rebase -i origin/master
    • mark 669cac5 for edit, save, let the rebase run
    • git checkout -f HEAD^ -- src/main.c
    • git commit --amend
    • git checkout 669cac5 -- src/main.c
    • git commit --fixup HEAD^
    • git rebase --continue
  • git rebase -i origin/master again
    • move the update --strip-comments description; remove lexer changes commit to right after add c-style comment stripping from jq program input, then save and let the rebase run
    • notice that now there are no diffs between allow, and strip, comments (#, // or /*) and update --strip-comments description; remove lexer changes
  • git rebase -i origin/master again and mark those to commits to be droped, let it run
  • notice that the last commit should be part of the previous commit, so git rebase -i origin/master once again and squash the last commit into the previous (I could just have done git checkout HEAD^ && git commit --amend instead), but also mark the first commit for reword and capitalize the first letter of its synopsis, then do the same for the now last commit
  • git remote add liquidaty https://github.com/liquidaty/jq
  • git remote set-url --push liquidaty git@github.com:liquidaty/jq
  • git push -f liquidaty HEAD:strip-comments

@nicowilliams
Copy link
Contributor

@liquidaty you'll want to review the new PR HEAD, and you'll want to fetch it as well and compare to your local branch so you can use git diff to see the differences. The only differences should be: a) all the things in jqlang/jq's master that you didn't already have, b) the lexer changes that got undone.

@liquidaty
Copy link
Author

@nicowilliams awesome! super helpful and definitely not something I would have done correctly, so thank you.

I do see some minor changes that were dropped, that I probably should add back in-- they are minor in that they will not show up in the current checks where input is piped directly into jq, but without these changes added back in, they would show up if, like in jq_test.c, the input was first parsed using jv_parse(), which relies on the return code being 0 and not OK.

Just to confirm, if I now just make those changes (touching a few lines to jv_parse.c to return 0 instead of OK) to origin/strip-comments and push them as a new commit, it will retain the history the way you prefer, correct?

@nicowilliams
Copy link
Contributor

Just to confirm, if I now just make those changes (touching a few lines to jv_parse.c to return 0 instead of OK) to origin/strip-comments and push them as a new commit, it will retain the history the way you prefer, correct?

You'll want to apply those changes on top of the commit I pushed. Something like:

  • git fetch origin pull/2548/head:pr2548
  • git checkout pr2458
  • git add ...
  • git commit -m ...
  • git push $your_repos_name pr2548:strip-comments

and then whatever your old branch was in your local clone, you can now abandon it, delete it, or make its HEAD be the same as pr2548's (git branch -f $that_local_branch pr2548).

@liquidaty
Copy link
Author

@nicowilliams good to go now, thanks for your help!

@nicowilliams
Copy link
Contributor

One reason to wait till after 1.7 for this is that I've been fuzzing the current state of the parser, so I'd rather not touch the parser until after 1.7 :)

@liquidaty
Copy link
Author

@nicowilliams I see. Totally off-topic: out of curiosity, what fuzzing tools are you using? am new to the art of fuzzing and looking to learn more about how it's being done in practice. Also have been considering suggesting some simd-related improvements to jq's parser, so might come in handy to understand the fuzzing context in which such suggestions might be considered

@nicowilliams
Copy link
Contributor

@nicowilliams I see. Totally off-topic: out of curiosity, what fuzzing tools are you using? am new to the art of fuzzing and looking to learn more about how it's being done in practice. Also have been considering suggesting some simd-related improvements to jq's parser, so might come in handy to understand the fuzzing context in which such suggestions might be considered

I'm using AFL, because I know how to use it and I've used it before. But see #2255 and #2688.

Some fuzzers like LibFuzzer and HongFuzz require that you provide a function that takes a pointer to data and a length, and then you link with their object files to produce an executable that fuzzes that function, and when you run it you have to provide an initial test corpus. AFL requires (mostly) that you build with a version of gcc or clang that instruments the code, and then you you run the executable under afl-fuzz with options that specify an initial test corpus.

@liquidaty liquidaty mentioned this pull request Jul 10, 2023
@liquidaty
Copy link
Author

@nicowilliams thank you, much appreciated

@nicowilliams
Copy link
Contributor

@liquidaty can you rebase this?

@wader
Copy link
Member

wader commented Feb 17, 2024

If it's ok i can rebase this but not sure if i can push to the PR branch?

@God-damnit-all
Copy link

God-damnit-all commented Feb 24, 2024

If it's ok i can rebase this but not sure if i can push to the PR branch?

What do you mean? Just rebase it on a fork and submit it as a PR, it'll create a "branch" automatically. You don't need to ask for permission. Maybe I'm misunderstanding something, though.

@wader
Copy link
Member

wader commented Feb 24, 2024

@ImportTaste Sure could do that. But it would nice to push a rebase to this PR (2548). How to is described here https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/committing-changes-to-a-pull-request-branch-created-from-a-fork and i think i have persimmon to do it but it's probably good manners to ask first

@God-damnit-all
Copy link

@ImportTaste Sure could do that. But it would nice to push a rebase to this PR (2548). How to is described here https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/committing-changes-to-a-pull-request-branch-created-from-a-fork and i think i have persimmon to do it build it's probably good manners to ask first

Well, given the lack of response, you may want to just go ahead and do it, I'm guessing the maintainers aren't looking too closely at a stale PR where the submitter has been MIA for a while now.

@wader wader force-pushed the strip-comments branch 2 times, most recently from b274bda to b7a41bb Compare February 24, 2024 17:41
@wader
Copy link
Member

wader commented Feb 24, 2024

Rebased and pushed. Hope i got things right. There were two commits that had conflicting or related changes:

71e7bcd Revert "lexer: temporarily revert #\ patch; keep CR in comment bug fix"
3a1ba0c Only care about RS when parsing with --seq

@wader wader changed the title allow, and strip, comments (#, // or /*) Allow, and strip, comments (#, // or /*) Feb 24, 2024
src/jv_parse.c Outdated
@@ -703,11 +702,7 @@ static pfunc scan_json(struct jv_parser* p, char ch, jv* out) {
presult answer = 0;
p->last_ch_was_ws = 0;
if (p->st == JV_PARSER_NORMAL) {
if(ch == '#') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So line comments are gone? Can you squash this with the other commit?

Copy link
Member

@wader wader Feb 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap can do that. When i rebased i tried to keep the existing commits as intact as possible as a first step.

I can't find any discussion why it was removed. Do we think ppl will expect this to skip #-line comments also? json5 seems to have them but the various "jsonc" variants seem not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: Remember to update PR title and commits to reflect if # is support or not

Copy link
Author

@liquidaty liquidaty Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually see that to begin with in the original PR code at https://github.com/liquidaty/jq/blob/strip-comments/src/jv_parse.c, so I'm not sure how it got there to begin with before it was removed-- but maybe I'm confusing myself and I put it in some earlier or later version than I'm now looking at.

Looking back and trying to refresh my memory on this, I think I removed the # line comments only because I was trying to make the PR as narrow as possible to get consensus on having the PR accepted.

I don't believe there was any technical reason to remove it (other than having the same flaw in the return value that all of the similar code had and that was subsequently fixed)-- so to reinstate, it should just be:

static pfunc scan_json(struct jv_parser* p, char ch, jv* out) {
  ...
    if(ch == '/' && (p->flags & JV_PARSE_STRIP_COMMENTS)) {
      p->scan = scan_slash_comment;
      return answer;
    }
+    if(ch == '#' && (p->flags & JV_PARSE_STRIP_COMMENTS)) {
+      p->scan = scan_line_comment;
+      return answer;
+    }

PS I have no opinion on whether it should be reinstated or not

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it seems i was confused about json5 supporting #-comments (it does not) my opinion is that we can skip support them for now

@@ -649,7 +649,7 @@ static int stream_is_top_num(struct jv_parser* p) {
static pfunc scan_line_comment(struct jv_parser* p, char ch, jv* out) {
if(ch == '\n')
p->scan = scan_json;
return OK;
return 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have some commentary in the commit message about why this change?

Copy link
Member

@wader wader Feb 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to understand the 0 -> OK change. OK is static const presult OK = "output produced"; so i guess something else must have changed before the change? only reference i can find is in #2548 (comment). Will digg more later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I see that, but I'm not sure why that's needed, and anyways, there's places where we look for OK that need to be changed correspondingly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc the reason for this is fairly straightforward, which is that in the below block of jv_parse.c:

jv jv_parser_next(struct jv_parser* p) {
...
  while (!msg && p->curr_buf_pos < p->curr_buf_length) {
    ...
    msg = scan(p, ch, &value);
  }

the desired behavior, when a comment is encountered, is to simply ignore it and keep on going. Therefore the while loop must keep on going, which requires that !msg is true. Any other result will break the while loop, which would be incorrect.

That said, given that the return type of scan() is a pointer, it might be better for the scan_xxx() functions to all return NULL where they currently are returning of 0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liquidaty thanks for the explanation. think i tried to return OK instead when played around a bit with the code and pretty sure it also worked for some reason, was confusing

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be useful to define IGNORE_AND_CONTINUE as a presult equal to NULL (or alternatively, not null, with other corresponding changes e.g. !msg becomes msg == IGNORE_AND_CONTINUE)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants