Add the option to iteratively encode JSON. #29

clokep · 2020-08-06T17:41:18Z

This expands the APIs available from canonicaljson to include methods that return iterators (essentially calling iterencode instead of encode).

clokep · 2020-08-06T18:44:58Z

This is related to matrix-org/synapse#6998.

richvdh

A few thoughts here.

Nowadays, the speed benefit of setting ensure_ascii=True and fixing up afterwards is much reduced on simplejson, and non-existent on stdlib json (see #9 (comment)), and the thing about U+2028 is incorrect as of simplejson 3.14.0.

I think it would be entirely reasonable to drop the _unascii optimisation, with the expectation that anyone looking for optimal performance should use stdlib json.

I also suspect (without evidence) that it is more efficient to do a single str.encode("utf-8") on a larger result than it is to do many of them on smaller results.

In short: can we avoid the for/yield loops? (maybe we should just make _canonical_encoder and _pretty_encoder public?)

clokep · 2020-08-07T15:57:17Z

Nowadays, the speed benefit of setting ensure_ascii=True and fixing up afterwards is much reduced on simplejson, and non-existent on stdlib json (see #9 (comment)), and the thing about U+2028 is incorrect as of simplejson 3.14.0.

I think it would be entirely reasonable to drop the _unascii optimisation, with the expectation that anyone looking for optimal performance should use stdlib json.

Just to make sure I understand -- this would require bumping the minimum version of simplejson to 3.14.0 for correctness? I'll check this as a separate PR.

I also suspect (without evidence) that it is more efficient to do a single str.encode("utf-8") on a larger result than it is to do many of them on smaller results.

In short: can we avoid the for/yield loops? (maybe we should just make _canonical_encoder and _pretty_encoder public?)

We could also define the interface for the new functions as returning strings, not bytes, that would just differ from the old interface. 😄

My concern with making _canonical_encoder and _pretty_encoder public is if someone tries to import them, then use the set_json_library function...you have a separate reference to them and they won't be updated properly.

clokep · 2020-08-07T16:35:57Z

Just to make sure I understand -- this would require bumping the minimum version of simplejson to 3.14.0 for correctness? I'll check this as a separate PR.

From some testing locally this seems to be true. See #30

clokep · 2020-08-10T12:43:47Z

I also suspect (without evidence) that it is more efficient to do a single str.encode("utf-8") on a larger result than it is to do many of them on smaller results.
In short: can we avoid the for/yield loops? (maybe we should just make _canonical_encoder and _pretty_encoder public?)

We could also define the interface for the new functions as returning strings, not bytes, that would just differ from the old interface. 😄

My concern with making _canonical_encoder and _pretty_encoder public is if someone tries to import them, then use the set_json_library function...you have a separate reference to them and they won't be updated properly.

I think this is the remaining open question about this PR. I agree that it is probably more efficient to encode the longer string than a shorter one (although I'll try to do some quick perf checks of this).

richvdh · 2020-08-10T13:55:03Z

My concern with making _canonical_encoder and _pretty_encoder public is if someone tries to import them, then use the set_json_library function...you have a separate reference to them and they won't be updated properly.

fair point. maybe better to do as you suggest and just define the new functions to return strings. Or possibly: supply two variants: iterencode_canonical_json for symmetry with the existing interface, and iterencode_canonical_json_str for efficiency.

clokep · 2020-08-10T17:35:33Z

I also suspect (without evidence) that it is more efficient to do a single str.encode("utf-8") on a larger result than it is to do many of them on smaller results.

I was curious to benchmark this a bit: https://gist.github.com/clokep/20c7cf34006099120bea5bbbb1c76c97

The results with Python 3.7.7 were:

Running benchmarks...
   encode once (large obj)...
      first run: 0.000002
      2000000 loops, best of 5: 0.000000 sec per loop (best total 0.138103)
   encode once (small objs)...
      first run: 0.000001
      2000000 loops, best of 5: 0.000000 sec per loop (best total 0.137819)
   encode each (large obj)...
      first run: 0.000001
      2000000 loops, best of 5: 0.000000 sec per loop (best total 0.137533)
   encode each (small objs)...
      first run: 0.000001
      2000000 loops, best of 5: 0.000000 sec per loop (best total 0.137760)

I think the tl;dr is that it doesn't matter too much which way we do it? Not sure if this changes your opinion or not!

richvdh · 2020-08-11T11:52:50Z

I think the tl;dr is that it doesn't matter too much which way we do it? Not sure if this changes your opinion or not!

interesting. slightly surprising. In that case maybe we just stick with what you've got. Up to you, really.

clokep · 2020-08-11T18:30:06Z

I think the tl;dr is that it doesn't matter too much which way we do it? Not sure if this changes your opinion or not!

interesting. slightly surprising. In that case maybe we just stick with what you've got. Up to you, really.

I think keeping the return type consistent is nice. We can use map() or something if you specifically want to avoid the yield loop, but I think perf wise they should be similar.

richvdh

lgtm!

Add the option to iteratively encode JSON.

88f8019

clokep mentioned this pull request Aug 6, 2020

Iteratively encode JSON responses matrix-org/synapse#8013

Merged

Add tests.

5388bf2

clokep marked this pull request as ready for review August 6, 2020 18:45

clokep requested a review from a team August 6, 2020 18:45

clokep added 2 commits August 6, 2020 14:51

Move a comment closer to implementation.

e29fbd0

Lint.

bca38f7

richvdh reviewed Aug 7, 2020

View reviewed changes

clokep mentioned this pull request Aug 7, 2020

Remove obsolete workaround for slow encoding of unicode characters. #30

Merged

clokep added 2 commits August 10, 2020 08:40

Merge remote-tracking branch 'origin/master' into clokep/iter-methods

070160c

Remove obsolete comment.

825a031

clokep force-pushed the clokep/iter-methods branch from 4d71cdb to 825a031 Compare August 10, 2020 12:41

Fix bug from merging master.

dd5b6e6

clokep requested a review from richvdh August 11, 2020 18:30

richvdh approved these changes Aug 13, 2020

View reviewed changes

clokep merged commit e40bd75 into master Aug 13, 2020

clokep deleted the clokep/iter-methods branch August 13, 2020 17:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the option to iteratively encode JSON. #29

Add the option to iteratively encode JSON. #29

clokep commented Aug 6, 2020

clokep commented Aug 6, 2020

richvdh left a comment

clokep commented Aug 7, 2020

clokep commented Aug 7, 2020

clokep commented Aug 10, 2020

richvdh commented Aug 10, 2020

clokep commented Aug 10, 2020

richvdh commented Aug 11, 2020

clokep commented Aug 11, 2020 •

edited

Loading

richvdh left a comment

Add the option to iteratively encode JSON. #29

Add the option to iteratively encode JSON. #29

Conversation

clokep commented Aug 6, 2020

clokep commented Aug 6, 2020

richvdh left a comment

Choose a reason for hiding this comment

clokep commented Aug 7, 2020

clokep commented Aug 7, 2020

clokep commented Aug 10, 2020

richvdh commented Aug 10, 2020

clokep commented Aug 10, 2020

richvdh commented Aug 11, 2020

clokep commented Aug 11, 2020 • edited Loading

richvdh left a comment

Choose a reason for hiding this comment

clokep commented Aug 11, 2020 •

edited

Loading