Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

faster/simpler version of transpose, with stream-oriented min and max #2679

Closed
wants to merge 2 commits into from

Conversation

pkoppstein
Copy link
Contributor

@pkoppstein pkoppstein commented Jul 9, 2023

Define stream-oriented min/1 max/1 so that a much faster and much shorter implementation of transpose can be provided, while preserving symmetry between min and max.

(Nothing in this change would preclude adding stream-oriented versions of min_by and max_by in the future.)

See #2472

…n and max

Define stream-oriented min/1 max/1 so that a much faster and much shorter implementation of `transpose` can be provided, while preserving symmetry between min and max.

(Nothing in this change would preclude adding stream-oriented versions of min_by and max_by in the future.)
@pkoppstein pkoppstein requested a review from itchyny July 9, 2023 07:56
src/builtin.jq Outdated
@@ -6,6 +7,8 @@ def sort_by(f): _sort_by_impl(map([f]));
def group_by(f): _group_by_impl(map([f]));
def unique: group_by(.) | map(.[0]);
def unique_by(f): group_by(f) | map(.[0]);
def min(s): reduce s as $x (first(s); if $x < . then $x else . end);
def max(s): reduce s as $x (first(s); if $x > . then $x else . end);
Copy link
Contributor

@itchyny itchyny Jul 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using stream twice causes problem with input. For example, users will expect echo 1 | jq -n 'min(input)' to behave as same as echo 1 | jq -n '[input] | min', but the former yields break error.

Copy link
Contributor

@nicowilliams nicowilliams Jul 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the thunk s is side-effect-free, then this is OK, but a) we do have some side-effect-having builtins as @itchyny points out, b) if and when we add an FFI this will become more of a problem.

I suggest: def min(xs): reduce xs as $x ([]; if length == 0 then [$x] elif $x < .[0] then setpath([0]; $x) else . end) | select(length > 0)[0];. Similarly for max/1. We use this reduce xs as $x ([]; ...) trick where if the reduction state ends up being the empty array then that means there were no xs (xs was empty).

BTW, I like the Haskell convention of using one letter for the item and "pluralizing" it to get the name of the stream we're iterating. xs is the plural of "x" ($x). I will be adopting that convention in my jq code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And not evaluating twice to produce the first $x also improves performance even for side-effect-free xs. Probably not much in most cases, but hey.

min(1,2,0.1)
null
0.1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add tests with max, and also with empty as its argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

src/builtin.jq Outdated
| reduce range(0; $max) as $j
([]; . + [reduce range(0;$length) as $i ([]; . + [ $in[$i][$j] ] )] )
end;
def transpose: [range(0; max(.[]|length)) as $i | [.[][$i]]];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very readable. Thanks!

It turns out that the boxing technique is noticeably faster than using setpath.

Also revise transpose to use max/0 as that is faster than using the new max(s).
@@ -6,6 +7,19 @@ def sort_by(f): _sort_by_impl(map([f]));
def group_by(f): _group_by_impl(map([f]));
def unique: group_by(.) | map(.[0]);
def unique_by(f): group_by(f) | map(.[0]);
# max(s) and min(s) use boxing technique for the sake of `input`:
def max(s):
reduce (s|[.]) as $x (null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version in my earlier comment didn't allocate an array for every $x.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but setpath can slow it down, as it did in my tests. Fixing input would be so much nicer!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean "fixing input would be so much nicer"?

As for setpath... we're using it on a zero-or-one element array, so it should not allocate at all after the first time... I'll check that out later.

@itchyny
Copy link
Contributor

itchyny commented Jul 9, 2023

Add a test that min(null) yields null.

@pkoppstein
Copy link
Contributor Author

@itchyny @nicowilliams - Thanks for your reviews. I've done some performance comparisons, and have updated this branch accordingly.

Specifically, according to my one-platform timings:

  • it turns out that the simple boxing technique for min(s) and max(s) is faster than using setpath;
  • it turns out that using max/0 is faster than using max(s) after all.

So transpose now uses max/0, which still makes it MUCH faster than the old (1.6) transpose, but it does
mean that max/1 is not actually needed.

In the spirit of supporting stream-orientation, though, I propose
leaving min/1 and max/1 in. If and when input is fixed ( :-) ), we can revisit the implementation.

Should there be an ER to change input so that it just emits empty at EOS? The current behavior gives rise to so many headaches....

@itchyny
Copy link
Contributor

itchyny commented Jul 9, 2023

The last part can be .[], using select & .[0] is faster?

@pkoppstein
Copy link
Contributor Author

@itchyny wrote:

The last part can be .[]

But we have to take care of the case of null. (null[] is an error.)

@nicowilliams
Copy link
Contributor

it turns out that the simple boxing technique for min(s) and max(s) is faster than using setpath;

That's rather surprising.

@pkoppstein
Copy link
Contributor Author

@nicowilliams wrote:

That's rather surprising

As I mentioned somewhere, I was just using some simple test cases.

As also mentioned, I'm hoping this kerfuffle will lead to a rethinking of input, which is as best I can tell the sole source of actual and immediate concerns.

@nicowilliams
Copy link
Contributor

As also mentioned, I'm hoping this kerfuffle will lead to a rethinking of input, which is as best I can tell the sole source of actual and immediate concerns.

There is no problem with input (or inputs for that matter). Yes, it has a side-effect, and that side-effect is all-important.

def max(s):
reduce (s|[.]) as $x (null;
if . == null then $x
else if $x > . then $x end # for speed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Omitting else clause is a new feature, so we shouldn't use it in builtins? I'm ok with this anyway.

@pkoppstein
Copy link
Contributor Author

pkoppstein commented Jul 10, 2023

@nicowilliams wrote:

There is no problem with input (or inputs for that matter).

That's an odd thing to say here, as the whole basis of this kerfuffle is the difference between the two. It would simplify so many things if input had been specified to behave more like inputs.

However, I hold little hope that input will change any time soon. But is it too late to revisit isempty? If it was free of side-effects, it would be much more useful. Could that be accomplished for 1.7?

@nicowilliams
Copy link
Contributor

@nicowilliams wrote:

There is no problem with input (or inputs for that matter).

That's an odd thing to say here, as the whole basis of this kerfuffle is the difference between the two. It would simplify so many things if input had been specified to behave more like inputs.

Both input and inputs have side-effects, therefore you wouldn't want to have min(input), min(inputs), max(input), max(inputs), etc. evaluate the stream more than once.

However, I hold little hope that input will change any time soon. But is it too late to revisit isempty? If it was free of side-effects, it would be much more useful. Could that be accomplished for 1.7?

There is no need for input to change, so it won't. isempty/1 only has side-effects if the given thunk has side-effects, so be careful how you use it, but that's not a problem with isempty/1 itself, and I can't see how we could "fix" it. The key is that jq builtins should not do multiple evaluation of their argument thunks.

@nicowilliams
Copy link
Contributor

Before I added input and inputs, jq only had pure functions, therefore multiple evaluation was safe -if wasteful- then. Now that a few builtins have side-effects (input, inputs, debug, stderr, halt, halt_error), multiple evaluation is not safe for jq builtins to do.

When there is no choice but to do multiple evaluation that should be documented in the builtin's docs. In this case there is an alternative to multiple evaluation.

Now, a pure-functions-only jq was very useful, and a jq that has impure functions is a bit harder to use in some cases -like this one-, but the bit of impurities I added allow an online alternative to jq -s which is jq -n with a program that uses input and/or inputs. An online alternative to jq -s was critical to being able to reduce over some (or all) inputs without first having to slurp all inputs. Essentially jq -n w/ a jq program that uses input/inputs is a lot like awk with BEGIN/END, but in a more generic way. I spent a lot of time figuring out how to do this, and I do believe that the side-effect-having input/inputs was the best solution -- I did explore adding awk-like BEGIN/END programs, and it felt very unnatural.

I admit that losing purity is sad, but mature functional programming languages generally have ways to have side-effects. The trick is generally to segregate those (like IO in Haskell) so that most code can be pure, but some code can be impure.

What I'd like to do about this at some point is have function attributes indicating whether a function is pure, and then we could have an ispure/1 that indicates whether a given expression is pure. Then we'd mark input and inputs as impure. It would also be nice to be able to declare a formal argument as having to be pure. Hopefully we can get that done at some point. In the meantime, what we need to do as maintainers, is avoid multiple evaluation in jq builtins.

@pkoppstein
Copy link
Contributor Author

pkoppstein commented Jul 10, 2023

@nicowilliams wrote:

There is no need for input to change

Yes, I understand that my original proposal for min/1 and max/1 is fundamentally incompatible with input,
and that we can't have a builtin that has builtin surprises. Thanks.

@pkoppstein
Copy link
Contributor Author

pkoppstein commented Jul 10, 2023

@nicowilliams - FYPI:

These are the u+s timings I got for nicomax (your original version), a setpath-free version thereof, and the "boxed" version. All three defs are shown below.

"nicomax(range(0;1000000))"
user	0m1.795s
sys	0m0.016s

"simplifiedmax(range(0;1000000))"
user	0m1.280s
sys	0m0.013s

"boxmax(range(0;1000000))"
user	0m0.924s
sys	0m0.008s

The curious thing is that when computing MAX(range(0;$n) | - .),
the ordering by u+s times is the same!


Using jq-1.6-226-g7d424fd-dirty

def nicomax(xs):
  reduce xs as $x ([];
    if length == 0 then [$x] elif $x > .[0] then setpath([0]; $x) else . end)
  | select(length > 0)[0];

# simplified nicomax
def simplifiedmax(xs):
  reduce xs as $x ([];
    if length == 0 or $x > .[0] then [$x] else . end)
  | select(length > 0)[0];

def boxmax(s):
  reduce (s|[.]) as $x (null;
    if . == null or $x > . then $x end )
  | select(.)[0];

@nicowilliams
Copy link
Contributor

Sorry, but I don't understand what is some important about input raising an error on EOS.

That's not the issue. The issue is that input and inputs consume inputs. This example should help:

printf '1\n2\n3\n' | jq -n '"Here is the first input: \(input)","Here is the second input: \(input)"'

Now, suppose you have a function min/1 whose argument is inputs:

printf '1\n5\n3\n'|./jq -n '
def min(s):
  reduce (s|[.]) as $x (null;
    if . == null then $x
    else if $x < . then $x end # for speed
    end )
  | select(.)[0];
min(inputs)'

But now suppose that min/1 evaluated f once for one output, then again for all outputs as your min/1 had it earlier... then the above would print 3, not 1!

End of stream has nothing to do with anything. It's the side-effect of reading from the input stream. It's a lot like getc() in C, but at least in C there's an ungetc() whereas jq does not have a way to undo the side-effects of any of its very few side-effect-having builtins.

@pkoppstein
Copy link
Contributor Author

@nicowilliams - Rest assured, I understand that my original proposal for min/1 and max/1 is fundamentally incompatible with input, and that we can't have a builtin that has builtin surprises. Thanks.

@itchyny
Copy link
Contributor

itchyny commented Jul 11, 2023

After all, what's the benefit of using reduce? On my laptop, the reduce definition of min runs 2.5 times slower than the following definition.

def min(s): [s] | if . != [] then min else empty end;

@nicowilliams
Copy link
Contributor

After all, what's the benefit of using reduce? On my laptop, the reduce definition of min runs 2.5 times slower than the following definition.

def min(s): [s] | if . != [] then min else empty end;

That's pretty funny, and sad. Surely there's some number of outputs of s at which this gets slower. Reduce ought to be fast...

@nicowilliams
Copy link
Contributor

Well, and of course min/0 is C-coded...

@pkoppstein
Copy link
Contributor Author

pkoppstein commented Jul 12, 2023

@itchyny wrote:

what's the benefit of using reduce?

Indeed, in the case of transpose, I reverted to using max/0.

And indeed, jq's reduce is lamentably slow.

But I thought it would be nice to leave the stream-oriented min and max in because, as @nicowilliams has often elaborated/explained/emphasized, stream-orientation is a goal in itself, e.g. to avoid memory issues. Consider also the potential short-circuiting possibilities. And the overhead of boxing is not so great.


Postscript re gojq:

It's interesting that reduce is faster than [] in gojq. So perhaps there's hope for jq after all.

Here are some timings, with apologies for the slightly cryptic presentation. (All the programs use range(0; $n) for 10000000 as $n.)

gojq []
99999999
user 0m55.905s
sys 0m7.554s

gojq max/0
99999999
user 1m0.376s
sys 0m8.268s

gojq max using reduce naively
99999999
user 0m42.617s
sys 0m1.067s

@pkoppstein
Copy link
Contributor Author

This PR is superseded by #2758
which only alters transpose.

Streaming versions of min/max can be the subject of a future PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants