Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the option to iteratively encode JSON. #29

Merged
merged 7 commits into from
Aug 13, 2020
Merged

Conversation

clokep
Copy link
Member

@clokep clokep commented Aug 6, 2020

This expands the APIs available from canonicaljson to include methods that return iterators (essentially calling iterencode instead of encode).

@clokep
Copy link
Member Author

clokep commented Aug 6, 2020

This is related to matrix-org/synapse#6998.

@clokep clokep marked this pull request as ready for review August 6, 2020 18:45
@clokep clokep requested a review from a team August 6, 2020 18:45
Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few thoughts here.

Nowadays, the speed benefit of setting ensure_ascii=True and fixing up afterwards is much reduced on simplejson, and non-existent on stdlib json (see #9 (comment)), and the thing about U+2028 is incorrect as of simplejson 3.14.0.

I think it would be entirely reasonable to drop the _unascii optimisation, with the expectation that anyone looking for optimal performance should use stdlib json.

I also suspect (without evidence) that it is more efficient to do a single str.encode("utf-8") on a larger result than it is to do many of them on smaller results.

In short: can we avoid the for/yield loops? (maybe we should just make _canonical_encoder and _pretty_encoder public?)

@clokep
Copy link
Member Author

clokep commented Aug 7, 2020

Nowadays, the speed benefit of setting ensure_ascii=True and fixing up afterwards is much reduced on simplejson, and non-existent on stdlib json (see #9 (comment)), and the thing about U+2028 is incorrect as of simplejson 3.14.0.

I think it would be entirely reasonable to drop the _unascii optimisation, with the expectation that anyone looking for optimal performance should use stdlib json.

Just to make sure I understand -- this would require bumping the minimum version of simplejson to 3.14.0 for correctness? I'll check this as a separate PR.

I also suspect (without evidence) that it is more efficient to do a single str.encode("utf-8") on a larger result than it is to do many of them on smaller results.

In short: can we avoid the for/yield loops? (maybe we should just make _canonical_encoder and _pretty_encoder public?)

We could also define the interface for the new functions as returning strings, not bytes, that would just differ from the old interface. 😄

My concern with making _canonical_encoder and _pretty_encoder public is if someone tries to import them, then use the set_json_library function...you have a separate reference to them and they won't be updated properly.

@clokep
Copy link
Member Author

clokep commented Aug 7, 2020

Just to make sure I understand -- this would require bumping the minimum version of simplejson to 3.14.0 for correctness? I'll check this as a separate PR.

From some testing locally this seems to be true. See #30

@clokep
Copy link
Member Author

clokep commented Aug 10, 2020

I also suspect (without evidence) that it is more efficient to do a single str.encode("utf-8") on a larger result than it is to do many of them on smaller results.
In short: can we avoid the for/yield loops? (maybe we should just make _canonical_encoder and _pretty_encoder public?)

We could also define the interface for the new functions as returning strings, not bytes, that would just differ from the old interface. 😄

My concern with making _canonical_encoder and _pretty_encoder public is if someone tries to import them, then use the set_json_library function...you have a separate reference to them and they won't be updated properly.

I think this is the remaining open question about this PR. I agree that it is probably more efficient to encode the longer string than a shorter one (although I'll try to do some quick perf checks of this).

@richvdh
Copy link
Member

richvdh commented Aug 10, 2020

My concern with making _canonical_encoder and _pretty_encoder public is if someone tries to import them, then use the set_json_library function...you have a separate reference to them and they won't be updated properly.

fair point. maybe better to do as you suggest and just define the new functions to return strings. Or possibly: supply two variants: iterencode_canonical_json for symmetry with the existing interface, and iterencode_canonical_json_str for efficiency.

@clokep
Copy link
Member Author

clokep commented Aug 10, 2020

I also suspect (without evidence) that it is more efficient to do a single str.encode("utf-8") on a larger result than it is to do many of them on smaller results.

I was curious to benchmark this a bit: https://gist.github.com/clokep/20c7cf34006099120bea5bbbb1c76c97

The results with Python 3.7.7 were:

Running benchmarks...
   encode once (large obj)...
      first run: 0.000002
      2000000 loops, best of 5: 0.000000 sec per loop (best total 0.138103)
   encode once (small objs)...
      first run: 0.000001
      2000000 loops, best of 5: 0.000000 sec per loop (best total 0.137819)
   encode each (large obj)...
      first run: 0.000001
      2000000 loops, best of 5: 0.000000 sec per loop (best total 0.137533)
   encode each (small objs)...
      first run: 0.000001
      2000000 loops, best of 5: 0.000000 sec per loop (best total 0.137760)

I think the tl;dr is that it doesn't matter too much which way we do it? Not sure if this changes your opinion or not!

@richvdh
Copy link
Member

richvdh commented Aug 11, 2020

I think the tl;dr is that it doesn't matter too much which way we do it? Not sure if this changes your opinion or not!

interesting. slightly surprising. In that case maybe we just stick with what you've got. Up to you, really.

@clokep
Copy link
Member Author

clokep commented Aug 11, 2020

I think the tl;dr is that it doesn't matter too much which way we do it? Not sure if this changes your opinion or not!

interesting. slightly surprising. In that case maybe we just stick with what you've got. Up to you, really.

I think keeping the return type consistent is nice. We can use map() or something if you specifically want to avoid the yield loop, but I think perf wise they should be similar.

@clokep clokep requested a review from richvdh August 11, 2020 18:30
Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@clokep clokep merged commit e40bd75 into master Aug 13, 2020
@clokep clokep deleted the clokep/iter-methods branch August 13, 2020 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants