-
Notifications
You must be signed in to change notification settings - Fork 384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(database): Make zlib
-compressed values more explicit
#4167
base: master
Are you sure you want to change the base?
refactor(database): Make zlib
-compressed values more explicit
#4167
Conversation
Putting this into The time measurements for this patch will be ongoing soon... |
This prevents the escape of a terrible exception about the database having closed the connection in `IDLE_TIMEOUT`...
…, just fall back to the strongest one
Putting this back into draft because there are some weird performance characteristics on large real-world databases. I'll investigate... |
Caution Deferred until #4171 is implemented. Alright, it looks like this patch, as it stands right now, while correct in isolation, is intractable to execute in a timely manner for large enough (think multiple millions of rows) databases. While each individual action often takes less than 1 second, the number of records affected (because this is a per-record update that can't be grouped or optimised further), it becomes somewhere around 30 minutes per database. Which would be fine if you had one database or maybe two, but not if you have like 50 that each take half an hour, and then these migrations are executed in series. See #4171. |
Motivation
There are several columns that contain arbitrary binary data or arbitrary strings compressed with
zlib
. Unfortunately, this compression and the corresponding decompression is done by hand in the API handlers (mostlyreport_server.py
andmass_store_run.py
). The fact that the columns are compressed is not indicated by the schema file (usually just referring to them asColumn(Binary)
), and it is sort of a magical happenstance that one can figure out that the value in the database is in fact something compressed, only by looking at the places the column is accessed when handling API.This is bad because during development, it is easy to make the mistake of not knowing that we're dealing with a compressed value and end up writing patches that emit gibberish over the API.
Moreover, this implicitness of the conversion results in weird output when someone ends up manually inspecting the database contents, either directly or via a dump.
In PostgreSQL, we get a codepoint-encoded hexadecimal representation, which is technically safe, but absolutely unreadable:
SQLite (at least the
sqlite3
CLI on Ubuntu) is actually much worse, because it dumps tostdout
the full raw buffer. This could kill the terminal in extreme cases, but it is still completely unreadable, but now you are tricked into believing you see something there.Changes
This patch introduces the
ZLibCompressed
family of column types which are all implemented as SQLAlchemy type decorators. With them, the type system in the ORM layer is used to transcode the client-side value (used in Python variables and query expressions) and the database-side value (thezlib
-compressed equivalent).The transcoding is transparent to client code: API handlers should no longer need to
import zlib
and accurately compress and decompress during queries. Importantly due tozlib
's type signature, if the stored value is a string (as opposed to a raw blob), the client code no longer needs to.encode()
and.decode()
either: this is taken care of by using theZLibCompressedString
type.In addition, to aid developers and server operators in understanding what is going on inside the database, a small, explicit tag will be affixed to every value that is passed through the newly introduced column types. This header (e.g.,
zlib[text,9]@
) is prepended after the compression and is stored in an uncompressed form, and indicates that the actual payload itself iszlib
-compressed.Implemented types
ZLibCompressedBlob
: This is the core that deals with storing an arbitrary buffer (bytes
).ZLibCompressedString
:.encode()
and.decode()
str
to and frombytes
, and then piggybacks down toZLibCompressedBlob
.ZLibCompressedSerialisable
: Custom (user-defined) serialisation and deserialisation of arbitrary types to and fromstr
, then falls backZLibCompressedString
.ZLibCompressedJSON
: Using the built-injson
module's serialisation logic to handle arbitrary objects that can behave as if they were JSONs.Example
Simply speaking, the currently used codes follow this simplified format:
will now be trivially simplified to:
A cell will look like this when observed through the dump:
or