refactor(database): Make `zlib`-compressed values more explicit #4167

whisperity · 2024-02-14T18:36:22Z

Motivation

There are several columns that contain arbitrary binary data or arbitrary strings compressed with zlib. Unfortunately, this compression and the corresponding decompression is done by hand in the API handlers (mostly report_server.py and mass_store_run.py). The fact that the columns are compressed is not indicated by the schema file (usually just referring to them as Column(Binary)), and it is sort of a magical happenstance that one can figure out that the value in the database is in fact something compressed, only by looking at the places the column is accessed when handling API.

This is bad because during development, it is easy to make the mistake of not knowing that we're dealing with a compressed value and end up writing patches that emit gibberish over the API.

Moreover, this implicitness of the conversion results in weird output when someone ends up manually inspecting the database contents, either directly or via a dump.

In PostgreSQL, we get a codepoint-encoded hexadecimal representation, which is technically safe, but absolutely unreadable:

  612 | \x78dab591b14e03310c865f252fe0466d255455624095d85840cca79ccfbda64de2602745c7d313ca2195816e0c59fecfc9673b360b1f2df24078203c91a0b7bb970788e4b40a454a4561107f2681d57ab3be2eb57df56
1b0bb96ece624f8dee6a91c38fdaaec9063e46431f8459e8c4b2e4c1f646c6074c16612f55a9aaa1bf8f2cccd0eae41f7ce72d2ec905a1cd135dfd0c431fb4017a94b832e8ecac9c071796780ff4d8a9d90d6d0705fc7b6d5445093
ba3dc1be262c9e931a783300f3f4a206834b23143f4c2d46974b337ce3d61a702db99646d43919d5d812f3d7d1bc8c1b599d1ba1e4fa40e68610e61f00e4b4f7e39572fbe7aded336596f2c442af17f4f843ee8b54fa043edfd30d

SQLite (at least the sqlite3 CLI on Ubuntu) is actually much worse, because it dumps to stdout the full raw buffer. This could kill the terminal in extreme cases, but it is still completely unreadable, but now you are tricked into believing you see something there.

593|xڵ�Knð
          D��
             0B9B`(눖 /�܄=}
                          4���낀 8C̴X<ᅰ"ƠOɌ�(7ϡ�^�����繄 oO���J
�F +<IuH*'tʳ                                               g[�v)y��챤 T�Ŷu1.��|������K���ޗG̟
N5Dz@] ㋛ �򯐬�d�j+�)

Changes

This patch introduces the ZLibCompressed family of column types which are all implemented as SQLAlchemy type decorators. With them, the type system in the ORM layer is used to transcode the client-side value (used in Python variables and query expressions) and the database-side value (the zlib-compressed equivalent).

The transcoding is transparent to client code: API handlers should no longer need to import zlib and accurately compress and decompress during queries. Importantly due to zlib's type signature, if the stored value is a string (as opposed to a raw blob), the client code no longer needs to .encode() and .decode() either: this is taken care of by using the ZLibCompressedString type.

In addition, to aid developers and server operators in understanding what is going on inside the database, a small, explicit tag will be affixed to every value that is passed through the newly introduced column types. This header (e.g., zlib[text,9]@) is prepended after the compression and is stored in an uncompressed form, and indicates that the actual payload itself is zlib-compressed.

Implemented types

ZLibCompressedBlob: This is the core that deals with storing an arbitrary buffer (bytes).
ZLibCompressedString: .encode() and .decode() str to and from bytes, and then piggybacks down to ZLibCompressedBlob.
ZLibCompressedSerialisable: Custom (user-defined) serialisation and deserialisation of arbitrary types to and from str, then falls back ZLibCompressedString.
- ZLibCompressedJSON: Using the built-in json module's serialisation logic to handle arbitrary objects that can behave as if they were JSONs.

Example

Simply speaking, the currently used codes follow this simplified format:

import zlib

class Entity:
  field = Column(Binary)  # Actually a string, compressed with ZLib, hopefully in every case...

def store(value: str):
  db.add(Entity(field=zlib.compress(str.encode())))

def query(value_filter: str) -> List[str]:
  filter_value_in_db = zlib.compress(str.encode())

  return list(map(lambda ent: zlib.decompress(ent.field).decode(),
    db.query(Entity)
      .filter(Entity.field == filter_value_in_db)
      .all()
  ))

will now be trivially simplified to:

# Note: no import or use of zlib module!

class Entity:
  field = Column(ZLibCompressedString)  # No "actually!" here, this is explicit!

def store(value: str):
  db.add(Entity(field=str))

def query(value_filter: str) -> List[str]:
  return list(
    db.query(Entity)
      .filter(Entity.field == value_filter)
      .all()
  )

A cell will look like this when observed through the dump:

593|zlib[text,9]@xڵ�Knð
                       D��
                          0B9B`(눖 /�܄=}
                                       4���낀 8C̴X<ᅰ"ƠOɌ�(7ϡ�^�����繄 oO���J
�F +<IuH*'tʳ                                                            g[�v)y��챤 T�Ŷu1.��|������K���ޗG̟
N5Dz@] ㋛ �򯐬�d�j+�)

or

593 | zlib[text,9]@x\332\265\217Kn\3030\014D\257\242\0130B\033\247Y\0279B\017`(\024\353\310\326/\244\334\304=}\3514\013\257\272\353\202\2008C\314\033\331\312e\264X<\341\205p"\306`O\037\357\220\310\311\314\224(7\001\317\341\213\030^\367]\267=\265\3479DoO\252\234\236J\014g[\227v)y\277\275\354\261\244T\262\305\030vu1.\273\270|\223\261\261\240\213\266\022K\220\246\250\336\227G\314\237\015\266F\177+<IuH*'t\312\363\012N5Dz@]\366\262\033\245d\003\343\313\233\201\362oP\354\231d\216j+\223)\000\362\232\357\000\211[w0p5\000\317\237\263\030\214.\017\320\202_TFW\233\246\377\332Z\013\312\334\352\334\324\021\347x\020c[\252\353\034\247\341&\323\365h\200\314\232\013t\257\335\001\320\2007\236>\235\362\327\247P\226\320\264\375\272\320\275\255\305\177\000\205\365\264F

…migrations

whisperity · 2024-02-19T15:14:00Z

Putting this into 6.23.2 so far, because this should not be merged together with #4089 to ensure that the migrations that need to be performed when the server is upgraded to not take too much time.

The time measurements for this patch will be ongoing soon...

…lumns...

This prevents the escape of a terrible exception about the database having closed the connection in `IDLE_TIMEOUT`...

…, just fall back to the strongest one

whisperity · 2024-02-22T17:00:29Z

Putting this back into draft because there are some weird performance characteristics on large real-world databases. I'll investigate...

whisperity · 2024-02-25T14:35:07Z

Caution

Deferred until #4171 is implemented.

Alright, it looks like this patch, as it stands right now, while correct in isolation, is intractable to execute in a timely manner for large enough (think multiple millions of rows) databases. While each individual action often takes less than 1 second, the number of records affected (because this is a per-record update that can't be grouped or optimised further), it becomes somewhere around 30 minutes per database. Which would be fine if you had one database or maybe two, but not if you have like 50 that each take half an hour, and then these migrations are executed in series. See #4171.

whisperity added 2 commits February 14, 2024 19:26

wip(database): ZLibCompressed-types

9dd067b

fix: Generate the tag properly

d1c5817

whisperity added database 🗄️ Issues related to the database schema. server 🖥️ refactoring 😡 ➡️ 🙂 Refactoring code. WIP 💣 Work In Progress labels Feb 14, 2024

whisperity mentioned this pull request Feb 14, 2024

feat(server): Store information about available checkers to the database #4089

Merged

whisperity added 12 commits February 15, 2024 13:36

chore(database): Unit tests for ZLibCompressed

d8c50cc

feat(migration): Infrastructure skeleton for progress logging during …

669fb8c

…migrations

feat(migration): Upgrade & Downgrade logic for zlib columns

c8fbf19

feat(migration): Per record generic migration wrapper

aff4d23

feat(migration): analysis_info.analyzer_command

a0e9568

chore: PyLint...

c292220

chore: PyCodeStyle...

867064f

fix: Exception format, ctor is not LOG()...

4eea972

refactor: analyzer_statistics (version, failed_files)

fa99f19

chore: Type-annotate DB model entity constructors for changed fields

9452e6e

refactor: file_contents (content, blame_info)

f239e94

chore: PyCodeStyle...

7e2719f

whisperity removed the WIP 💣 Work In Progress label Feb 19, 2024

whisperity modified the milestones: release 6.25.0, release 6.23.2 Feb 19, 2024

whisperity marked this pull request as ready for review February 19, 2024 15:14

whisperity requested review from bruntib and vodorok as code owners February 19, 2024 15:14

whisperity requested review from dkrupp and Szelethus February 19, 2024 15:14

fix: Do NOT convert None into "null" first for DBMS-NULLABLE co…

fe85da1

…lumns...

whisperity added 3 commits February 21, 2024 17:19

fix: Do not keep config DB connection alive throughout the migration

bee99bc

This prevents the escape of a terrible exception about the database having closed the connection in `IDLE_TIMEOUT`...

fix: Do not do multiple rounds of compression to find the exact ratio…

e4dd52e

…, just fall back to the strongest one

chore: PyLint...

20704ac

whisperity marked this pull request as draft February 22, 2024 17:00

whisperity added 2 commits February 22, 2024 19:39

refactor: Try to speed the queries a bit, and add profiling printouts

455dedc

chore: Do not query ID column, we already know that value...

f72b5c7

whisperity modified the milestones: release 6.23.2, release 6.25.0 Mar 2, 2024

whisperity marked this pull request as ready for review March 9, 2024 13:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(database): Make `zlib`-compressed values more explicit #4167

refactor(database): Make `zlib`-compressed values more explicit #4167

whisperity commented Feb 14, 2024 •

edited

Loading

whisperity commented Feb 19, 2024 •

edited

Loading

whisperity commented Feb 22, 2024

whisperity commented Feb 25, 2024

refactor(database): Make zlib-compressed values more explicit #4167

Are you sure you want to change the base?

refactor(database): Make zlib-compressed values more explicit #4167

Conversation

whisperity commented Feb 14, 2024 • edited Loading

Motivation

Changes

Implemented types

Example

whisperity commented Feb 19, 2024 • edited Loading

whisperity commented Feb 22, 2024

whisperity commented Feb 25, 2024

refactor(database): Make `zlib`-compressed values more explicit #4167

refactor(database): Make `zlib`-compressed values more explicit #4167

whisperity commented Feb 14, 2024 •

edited

Loading

whisperity commented Feb 19, 2024 •

edited

Loading