Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split output array length inconsistency #552

Closed
joelpurra opened this issue Aug 15, 2014 · 12 comments
Closed

Split output array length inconsistency #552

joelpurra opened this issue Aug 15, 2014 · 12 comments
Assignees
Labels
Milestone

Comments

@joelpurra
Copy link
Contributor

Splitting a string where the string to split on matches the last character in the input string returns an array that's shorter than other results. Compare with a split on the first character.

echo '"xyz"' | jq 'split("x"), split("y"), split("z")'
[
  "",
  "yz"
]
[
  "x",
  "z"
]
[
  "xy"
]

Shouldn't it return an empty string instead of omitting the last result?

@pkoppstein
Copy link
Contributor

@joelpurra - Hi, Joel. You'll be glad to know, I hope, that the current (regex-based) version of split conforms with your expectations -- at least in this particular case:

$ jq --version
jq-1.4-132-g287e2c9

$ jq -nc '"xyz" | split("z")'
["xy",""]

Another interesting case is: '"x" | "split("x")` -- should that produce an array of length 0 (as does ruby), 1 (jq 1.4) or 2 (regex-based jq)?

The approach currently taken in jq is not to make a "special case" out of any of these edge cases -- i.e. to let the chips fall where they may.

By the way, I hope you will reconsider using a recent version of jq 1.4+. There are so many really useful enhancements, and there's no downside that I know of except having to build it. At worst, you might be tempted to "git pull" at some point :-)

Note to @nicowilliams: We ought to document the changed behavior of split/1 if we retain it as is.

@joelpurra
Copy link
Contributor Author

@pkoppstein: good to know about this change, thanks!

I'd run more recent versions if I wasn't using at least one server system that's out of my control. Given time constraints on finishing my degree, I don't think I'll get to it just yet. Let's hope I'll find a job where I can work with more json data processing, and hopefully help out with some jq development =)

And yes, updated documentation, please. Perhaps running a test suite from 1.4 on the 1.5 code base and vice versa, to spot differences?

@nicowilliams
Copy link
Contributor

I need to revisit the split thing, yes. Also, I do think I screwed up by
not making explode output a stream rather than an array (for a
sufficiently large input string this would make a huge difference).

If we're going to match modules to jq version numbers (and if we add a
logical import-on-jq-version directive) then we might as well permit more
backwards-incompatible breaks anyways. But I'm generally loathe to making
such changes.

(I'm likely to introduce an explodes that outputs a stream, and an
implode/1 that consumes a stream output by a closure.)

@pkoppstein
Copy link
Contributor

@nicowilliams - Did you know that somehow some parts of v1.4+ have escaped into the wild -- http://stedolan.github.io/jq/manual/ includes sections on match, test, sub, gsub ...

On the one hand, this is a bit scary. On the other, it would be great if jq 1.5 could be released sooner rather than later, even if it did not contain everything we'd want.

As for streams, arrays, and backwards compatibility - personally, I'm comfortable with having split and splits, explode and explodes, etc, indefinitely, but having these pairs also provides a path forward that is consistent with stedolan's recommendations regarding the introduction of major changes.

@joelpurra
Copy link
Contributor Author

I'll revisit this, as I just now experienced problems with length and join/1 for the split strings in the original issue. Please note that both the original and this issue have been (re-)tested using jq --version jq-1.5rc1.

jq-1.5rc1

Original split/1 issue from above

split("z") produces a single element array instead of two.

echo '"xyz"' | jq 'split("x"), split("y"), split("z")'
[
  "",
  "yz"
]
[
  "x",
  "z"
]
[
  "xy"
]

length versus join/1 inconsistencies

What happens if I compare length and join/1? Further inconsistencies. Notice the difference in the first example - length 2, but join doesn't have a dot in it. It seems empty strings are counted by length, but not used by join/1.

echo '"xyz"' | jq 'split("x"), split("y"), split("z") | length'
2
2
1
echo '"xyz"' | jq 'split("x"), split("y"), split("z") | join(".")'
"yz"
"x.z"
"xy"

Expected behavior

echo '"xyz"' | jq 'split("x"), split("y"), split("z")'
[
  "",
  "yz"
]
[
  "x",
  "z"
]
[
  "xy",
  ""
]
echo '"xyz"' | jq 'split("x"), split("y"), split("z") | length'
2
2
2
echo '"xyz"' | jq 'split("x"), split("y"), split("z") | join(".")'
".yz"
"x.z"
"xy."

Comparison to javascript

"xyz".split("x")
// ["", "yz"]

"xyz".split("y")
// ["x", "z"]

"xyz".split("z")
// ["xy", ""]
"xyz".split("x").length
// 2

"xyz".split("y").length
// 2

"xyz".split("z").length
// 2
"xyz".split("x").join(".")
// ".yz"

"xyz".split("y").join(".")
// "x.z"

"xyz".split("z").join(".")
// "xy."

As you can see, empty strings are counted and joined.

@joelpurra
Copy link
Contributor Author

It seems @pkoppstein's confirmed fix doesn't work - but guess it's because of #576.

@joelpurra joelpurra reopened this Jan 10, 2015
@nicowilliams
Copy link
Contributor

Thanks for the bug report! keep banging on it. I'll fix this sometime
this coming week, and if i can fix the remaining module system problems in
a couple of hours then we'll have a 1.5rc2 soon enough.

@joelpurra
Copy link
Contributor Author

Released a bugfix package, which will alleviate problems for now.
https://github.com/joelpurra/jq-bugfix-jq-552

import "joelpurra/jq-bugfix-jq-552" as BugfixJq552;

# These tests fail for vanilla split/join, but they are detected and shows debug output.
#
# BugfixJq552::warningSplit: Single string to two empty parts. SHOWS DEBUG OUTPUT.
"a" | BugfixJq552::warningSplit("a") # Expected [ "", "" ], Actually [ "" ]

# BugfixJq552::warningSplit: Single string to two parts z. SHOWS DEBUG OUTPUT.
"xyz" | BugfixJq552::warningSplit("z") # [ Expected "xy", "" ], Actually [ "xy" ]

# BugfixJq552::warningJoin: Single string to two empty parts. SHOWS DEBUG OUTPUT.
[ "", "" ] | BugfixJq552::warningJoin("a") # Expected "a", Actually ""

# BugfixJq552::warningJoin: Two parts x. SHOWS DEBUG OUTPUT.
[ "", "yz" ] | BugfixJq552::warningJoin("x") # Expected "xyz", Actually "yz"


# Instead use the functions which attempt to fix these inconsistencies.
#
# BugfixJq552::attemptFixSplit: Single string to two empty parts.
"a" | BugfixJq552::attemptFixSplit("a") # [ "", "" ]

# BugfixJq552::attemptFixSplit: Single string to two parts z.
"xyz" | BugfixJq552::attemptFixSplit("z") # [ "xy", "" ]

# BugfixJq552::attemptFixJoin: Single string to two empty parts.
[ "", "" ] | BugfixJq552::attemptFixJoin("a") # "a"

# BugfixJq552::attemptFixJoin: Two parts x.
[ "", "yz" ] | BugfixJq552::attemptFixJoin("x") # "xyz"

@joelpurra
Copy link
Contributor Author

@nicowilliams: this fix broke another case, as detected by joelpurra/jq-bugfix-jq-552.

Updated the package with additional tests (including plain split/join tests) and fixes for the current version 60d1ecb. See tests/all.sh for test cases and jq/main.jq for how I worked around them. Might add some more tests later.

These splits now work:

# split: Single string to two empty parts.
jq -n '"a" | split("a")' # Expected [ "", "" ]

# split: Single string to two parts z.
jq -n '"xyz" | split("z")' # [ Expected "xy", "" ]

This split does not work:

# split: Single string split with empty string.
jq -n '"a" | split("")' # Expected [ "a" ], Actually [ "a", "" ]

These joins do not work:

# join: Single string to two empty parts.
jq -n '[ "", "" ] | join("a")' # Expected "a", Actually ""

# join: Two parts x.
jq -n '[ "", "yz" ] | join("x")' # Expected "xyz", Actually "yz"

@nicowilliams
Copy link
Contributor

On Tue, Jan 13, 2015 at 06:56:58AM -0800, Joel Purra wrote:

@nicowilliams: this fix broke another case, as detected by joelpurra/jq-bugfix-jq-552.

Thanks.

Shouldn't splitting on the empty string yield an array of single-character (well, single-codepoint; jq doesn't know about characters) strings, with an empty string in the front and an empty one at the end?

"Obviously" in this case we can just drop the leading and trailing empty strings. But it's not so obvious to me.

This split does not work:

# split: Single string split with empty string.
jq -n '"a" | split("")' # Expected [ "a" ], Actually [ "a", "" ]

Why shouldn't we expect ["","a",""]?

These joins do not work:

# join: Single string to two empty parts.
jq -n '[ "", "" ] | join("a")' # Expected "a", Actually ""
# join: Two parts x.
jq -n '[ "", "yz" ] | join("x")' # Expected "xyz", Actually "yz"

I didn't touch join. I'll take a look. These are definitely bugs.

@joelpurra
Copy link
Contributor Author

@nicowilliams: split seems to work now, thanks!

@joelpurra
Copy link
Contributor Author

@nicowilliams wrote:

Shouldn't splitting on the empty string yield an array of single-character (well, single-codepoint; jq doesn't know about characters) strings, with an empty string in the front and an empty one at the end?

Yes - forgot this was the standard thing for split on an empty string, and didn't get to writing multi-character test cases. "abc" | split("") # ["a","b","c"] Looks good, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants