Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

revamp sub/3 to resolve most issues with gsub (and sub with "g") #2624

Closed
wants to merge 4 commits into from
Closed

revamp sub/3 to resolve most issues with gsub (and sub with "g") #2624

wants to merge 4 commits into from

Conversation

pkoppstein
Copy link
Contributor

The primary purpose of this commit is to rectify most problems with gsub (and also sub with "g"), and in particular
#1425 (\b), #2354 (lookahead), and #2532 (regex == "^(?!cd ).*$|^cd "; ""))

The revision also partly resolves #2148 and #1206 in that gsub no longer loops infinitely, but because the new gsub depends on match(_;"g"), the behavior when regex == "" is sometimes non-standard.

Since the new sub/3 relies on uniq/1, that has been added as well; this should be a non-issue as the line count of the revised builtin.jq has been significantly reduced overall.

Also, _nwise/1 has been tweaked to take advantage of TCO.

The primary purpose of this commit is to rectify most problems with
gsub (and also sub with "g"), and in particular
#1425 (\b), #2354 (lookahead), and #2532 (regex == "^(?!cd ).*$|^cd "; ""))

The revision also partly resolves #2148 and #1206 in that gsub no
longer loops infinitely, but because the new gsub depends on
match(_;"g"), the behavior when regex == "" is sometimes non-standard.

Since the new sub/3 relies on uniq/1, that has been added as well;
this should be a non-issue as the line count of the revised builtin.jq
has been significantly reduced overall.

Also, _nwise/1 has been tweaked to take advantage of TCO.
space-efficiency enhancement
The prior attempt to enhance _nwise was botched.
@pkoppstein pkoppstein requested a review from itchyny June 21, 2023 03:08
Copy link
Contributor

@itchyny itchyny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for my late response, but I reply in short.

  • This PR changes the behavior of "a" | sub("a"; "b","c") and "aaa" | gsub("a"; "b","c"). We don't want to change these behaviors just to fix the empty matching issue.
  • Why do you need to uniq the matches? I confirmed that make check passed without it, but could you give any examples that require uniq here?
  • Having unique/0 and uniq/1 is confusing. Also maybe we will need unique_by/2 too? If you add a public (oops, poor nwise!) filter, needs tests for itself, and manual updates.
  • I want to fix space inconsistency in the test case queries but this is just a nitpick.

@pkoppstein
Copy link
Contributor Author

pkoppstein commented Jun 25, 2023

@itchyny - Thanks for your review. Most of the issues you raise can be easily dealt with
if the core issues are resolved, so I'll focus on those.

  1. The functionality of uniq is necessary because of the non-standard behavior of match.
    Specifically, omitting it results in:
Test #23: 'gsub("(?=u)"; "u")' at line number 100
*** Expected "quux", but got "quuux" for test at line number 102: gsub("(?=u)"; "u")

The root issue here is that '"qux" | match("(?=u)"; "g")' emits two identical JSON objects
rather than just one.

  1. uniq mimics the functionality and name of the very standard
    Unix/Linux utility, so amongst the *ix crowd at least, there's little
    room for confusion. For the rest, the confusion should be greatly
    alleviated by the different signatures: uniq/1 versus unique/0

That is, uniq/1 is stream-oriented (like *ix) whereas unique/0 is array-oriented (like its
close relative, sort/0).

  1. The changed behavior of sub("a"; "b","c") is an unintended
    side-effect that unfortunately may not be so easy to resolve.

The reason is that we cannot simply use s as $s | ... as we would
normally do, because the string interpolation that's required in
general must be performed in the context of the result of the
".captures" object, as can be seen from some of the test cases, e.g.

"Abcabc" | gsub("(?<x>.)[^a]*"; "+\(.x)-")

I'll include a possible workaround in my next post,
but in my opinion, it would be acceptable to introduce relatively
minor changes in behavior for the sake of an overwhelmingly
better version of gsub, especially if the behavior
in question was not documented as a feature, and
and especially for a pre-1.7 release.

@pkoppstein
Copy link
Contributor Author

pkoppstein commented Jun 26, 2023

@itchyny - This is an addendum to the previous post, and is in response to your remark:

... needs tests for itself, and manual updates.

In my (as yet still private) jq repository, I've made changes to the manual including additional examples, and noticed that when I regenerate tests/man.test, the result is quite different from the old man.test, as though that file has not been regenerated recently. This leads to the question:

In general, when changes to manual.yml are made, do you want the man.test file regenerated as well?

Also, there was one line in the regenerated man.test file that was wrong: one occurrence of:

input: 'null'

was mistranslated as None. I spent some time checking the manual.yml file and verified that other occurrences of the exact same line were not mistranslated. Very odd. Have you ever noticed such a problem?

@pkoppstein
Copy link
Contributor Author

pkoppstein commented Jun 28, 2023

@itchyny - I seem to have resolved the main issue - the streaming semantics when the second argument of sub or gsub is a stream. I'll do a bit more testing and then will probably replace this PR with another one.

Two questions:
(1) Should I regenerate man.test ?

(2) Using gojq, the query "p" | sub("(?<a>x)"; "a", "b") only produces one result. By the usual jq laws of "regularity", it should produce two. What should jq do?

@pkoppstein
Copy link
Contributor Author

pkoppstein commented Jun 29, 2023

I'm closing this PR in favor of #2641

@itchyny - This new PR is intended to address all the issues that have been raised, but please note that it follows gojq in the example above, so e.g. "p" | sub("q"; "a", "b") only produces one result. However, it would only requite a small one-line tweak to get it to produce one result for each value in position 2.

@pkoppstein pkoppstein closed this Jun 29, 2023
pkoppstein added a commit to pkoppstein/jq that referenced this pull request Jun 29, 2023
… uniq(stream)

The primary purpose of this commit (which supercedes PR
jqlang#2624) is to rectify most problems
with `gsub` (and also `sub` with the "g" option), in particular jqlang#1425
('\b'), jqlang#2354 (lookahead), and jqlang#2532 (regex == "^(?!cd ).*$|^cd ";"")).

This commit also partly resolves jqlang#2148 and jqlang#1206 in that `gsub` no
longer loops infinitely; however, because the new `gsub` depends
critically on match(_;"g"), the behavior when regex == "" is sometimes
non-standard. [*1]

Since the new sub/3 relies on uniq/1, that has been added as well [*2].

The documentation has been updated to reflect the fact that `sub` and
`gsub` are intended to be regular in the second argument. [*3]

Also, _nwise/1 has been tweaked to take advantage of TCO.

Footnotes:

[*1] Using the new gsub, '"a" | gsub( ""; "a")' emits "aa" rather than
"aaa" as would be standard.  This is nevertheless better than the
infinite loop behavior of jq 1.6 in this case.

With one exception (as explained in [*2]), the new gsub is implemented
as though match/2 behavior is correct.  That is, bugs in `gsub`
behavior will most likely have their origin in `match/2`.

[*2] `uniq/1` adopts the Unix/Linux name and semantics; it is needed for the following test case:

gsub("(?=u)"; "u")
"qux"
"quux"

Without this functionality:

Test jqlang#23: 'gsub("(?=u)"; "u")' at line number 100
*** Expected "quux", but got "quuux" for test at line number 102: gsub("(?=u)"; "u")

The root of the problem here is `match`: if `match` is fixed, then gsub would not need `untie`.

The addition of `uniq` as a top-level function should be a non-issue
relative to general concern about builtins.jq bloat: the line count of
the new builtin.jq is significantly reduced overall, and the number of
defs is actually reduced by 1 (from 111 (ignoring a redundant def) to 110).

[*3] See e.g. jqlang#513 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gsub loops infinitely with "^" or ""
2 participants