-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Validate that integers are in the proper range while decoding JSON. #7356
Conversation
SyTest PR up at matrix-org/sytest#860 |
# Decode to Unicode so that simplejson will return Unicode strings on | ||
# Python 2 | ||
try: | ||
content_unicode = content_bytes.decode("utf8") | ||
except UnicodeDecodeError: | ||
logger.warning("Unable to decode UTF-8") | ||
raise SynapseError(400, "Content not JSON.", errcode=Codes.NOT_JSON) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this hunk is not necessary anymore due to Synapse no longer supporting Python 2.
The main problem with this is the existence of historical events (and hence rooms) which do not honour the spec. We'll essentially be breaking the ability to join those rooms, or to backfill in them if the homeserver is already joined. This may be acceptable collateral... but I'm not completely sure it is. We should probably at least check what the failure modes are? Also... does this mean that we end up using the pure-python json parsing implementation in simplejson rather than the optimised C version? that would probably be prohibitive in terms of performance overhead. |
I'll have to check more into this, I had only considered being able to load events that have already been stored in your database. I'll test making some broken rooms and see what happens.
I did a little bit of benchmarking and this seems to have a negligible effect on performance, but I am not sure how representative my dataset is. It still uses the C-speedups, but would cause it to bail on a fast-path of converting the string to the int (see this portion of Looking at the specification and some sample events it seems like the vast majority of JSON values are strings, not integers, so I'm hoping the effect of this change would be minimized. |
@richvdh and I discussed this a bit in #synapse-dev:matrix.org and there's a few major issues with it:
|
I'm going to close this PR and attempt a different approach. |
Per the Matrix specification, canonical JSON needs to limit integers in JSON to the range of [-2 ^ 53 + 1, 2 ^ 53 + 1] to match RFC 7159.
This PR enforces this restriction in Synapse for all "incoming" JSON data that is deserialized by Synapse, including:
This is mostly done via modifying the core areas that parse JSON from HTTP queries and HTTP responses.
This does NOT make any changes to:
My hope is that this will mean nothing in Synapse will break if a database already has bad data in it.
Note that I do not believe this is really part of the "canonicaljson" package since the APIs should not be using these invalid values anyway. After this change, anytime we generate a canonicaljson value it should be after the JSON has been validated, so it should already abide by these rules.