Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(database): Make zlib-compressed values more explicit #4167

Open
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

whisperity
Copy link
Contributor

@whisperity whisperity commented Feb 14, 2024

Motivation

There are several columns that contain arbitrary binary data or arbitrary strings compressed with zlib. Unfortunately, this compression and the corresponding decompression is done by hand in the API handlers (mostly report_server.py and mass_store_run.py). The fact that the columns are compressed is not indicated by the schema file (usually just referring to them as Column(Binary)), and it is sort of a magical happenstance that one can figure out that the value in the database is in fact something compressed, only by looking at the places the column is accessed when handling API.

This is bad because during development, it is easy to make the mistake of not knowing that we're dealing with a compressed value and end up writing patches that emit gibberish over the API.

Moreover, this implicitness of the conversion results in weird output when someone ends up manually inspecting the database contents, either directly or via a dump.

In PostgreSQL, we get a codepoint-encoded hexadecimal representation, which is technically safe, but absolutely unreadable:

  612 | \x78dab591b14e03310c865f252fe0466d255455624095d85840cca79ccfbda64de2602745c7d313ca2195816e0c59fecfc9673b360b1f2df24078203c91a0b7bb970788e4b40a454a4561107f2681d57ab3be2eb57df56
1b0bb96ece624f8dee6a91c38fdaaec9063e46431f8459e8c4b2e4c1f646c6074c16612f55a9aaa1bf8f2cccd0eae41f7ce72d2ec905a1cd135dfd0c431fb4017a94b832e8ecac9c071796780ff4d8a9d90d6d0705fc7b6d5445093
ba3dc1be262c9e931a783300f3f4a206834b23143f4c2d46974b337ce3d61a702db99646d43919d5d812f3d7d1bc8c1b599d1ba1e4fa40e68610e61f00e4b4f7e39572fbe7aded336596f2c442af17f4f843ee8b54fa043edfd30d

SQLite (at least the sqlite3 CLI on Ubuntu) is actually much worse, because it dumps to stdout the full raw buffer. This could kill the terminal in extreme cases, but it is still completely unreadable, but now you are tricked into believing you see something there.

593|xڵ�Knð
          D��
             0B9B`(눖 /�܄=}
                          4���낀 8C̴X<ᅰ"ƠOɌ�(7ϡ�^�����繄 oO���J
�F +<IuH*'tʳ                                               g[�v)y��챤 T�Ŷu1.��|������K���ޗG̟
N5Dz@] ㋛ �򯐬�d�j+�)

Changes

This patch introduces the ZLibCompressed family of column types which are all implemented as SQLAlchemy type decorators. With them, the type system in the ORM layer is used to transcode the client-side value (used in Python variables and query expressions) and the database-side value (the zlib-compressed equivalent).

The transcoding is transparent to client code: API handlers should no longer need to import zlib and accurately compress and decompress during queries. Importantly due to zlib's type signature, if the stored value is a string (as opposed to a raw blob), the client code no longer needs to .encode() and .decode() either: this is taken care of by using the ZLibCompressedString type.

In addition, to aid developers and server operators in understanding what is going on inside the database, a small, explicit tag will be affixed to every value that is passed through the newly introduced column types. This header (e.g., zlib[text,9]@) is prepended after the compression and is stored in an uncompressed form, and indicates that the actual payload itself is zlib-compressed.

Implemented types

  • ZLibCompressedBlob: This is the core that deals with storing an arbitrary buffer (bytes).
  • ZLibCompressedString: .encode() and .decode() str to and from bytes, and then piggybacks down to ZLibCompressedBlob.
  • ZLibCompressedSerialisable: Custom (user-defined) serialisation and deserialisation of arbitrary types to and from str, then falls back ZLibCompressedString.
    • ZLibCompressedJSON: Using the built-in json module's serialisation logic to handle arbitrary objects that can behave as if they were JSONs.

Example

Simply speaking, the currently used codes follow this simplified format:

import zlib

class Entity:
  field = Column(Binary)  # Actually a string, compressed with ZLib, hopefully in every case...

def store(value: str):
  db.add(Entity(field=zlib.compress(str.encode())))

def query(value_filter: str) -> List[str]:
  filter_value_in_db = zlib.compress(str.encode())

  return list(map(lambda ent: zlib.decompress(ent.field).decode(),
    db.query(Entity)
      .filter(Entity.field == filter_value_in_db)
      .all()
  ))

will now be trivially simplified to:

# Note: no import or use of zlib module!

class Entity:
  field = Column(ZLibCompressedString)  # No "actually!" here, this is explicit!

def store(value: str):
  db.add(Entity(field=str))

def query(value_filter: str) -> List[str]:
  return list(
    db.query(Entity)
      .filter(Entity.field == value_filter)
      .all()
  )

A cell will look like this when observed through the dump:

593|zlib[text,9]@xڵ�Knð
                       D��
                          0B9B`(눖 /�܄=}
                                       4���낀 8C̴X<ᅰ"ƠOɌ�(7ϡ�^�����繄 oO���J
�F +<IuH*'tʳ                                                            g[�v)y��챤 T�Ŷu1.��|������K���ޗG̟
N5Dz@] ㋛ �򯐬�d�j+�)

or

593 | zlib[text,9]@x\332\265\217Kn\3030\014D\257\242\0130B\033\247Y\0279B\017`(\024\353\310\326/\244\334\304=}\3514\013\257\272\353\202\2008C\314\033\331\312e\264X<\341\205p"\306`O\037\357\220\310\311\314\224(7\001\317\341\213\030^\367]\267=\265\3479DoO\252\234\236J\014g[\227v)y\277\275\354\261\244T\262\305\030vu1.\273\270|\223\261\261\240\213\266\022K\220\246\250\336\227G\314\237\015\266F\177+<IuH*'t\312\363\012N5Dz@]\366\262\033\245d\003\343\313\233\201\362oP\354\231d\216j+\223)\000\362\232\357\000\211[w0p5\000\317\237\263\030\214.\017\320\202_TFW\233\246\377\332Z\013\312\334\352\334\324\021\347x\020c[\252\353\034\247\341&\323\365h\200\314\232\013t\257\335\001\320\2007\236>\235\362\327\247P\226\320\264\375\272\320\275\255\305\177\000\205\365\264F

@whisperity whisperity removed the WIP 💣 Work In Progress label Feb 19, 2024
@whisperity whisperity modified the milestones: release 6.25.0, release 6.23.2 Feb 19, 2024
@whisperity
Copy link
Contributor Author

whisperity commented Feb 19, 2024

Putting this into 6.23.2 so far, because this should not be merged together with #4089 to ensure that the migrations that need to be performed when the server is upgraded to not take too much time.

The time measurements for this patch will be ongoing soon...

@whisperity whisperity marked this pull request as ready for review February 19, 2024 15:14
This prevents the escape of a terrible exception about the database
having closed the connection in `IDLE_TIMEOUT`...
@whisperity
Copy link
Contributor Author

Putting this back into draft because there are some weird performance characteristics on large real-world databases. I'll investigate...

@whisperity whisperity marked this pull request as draft February 22, 2024 17:00
@whisperity
Copy link
Contributor Author

Caution

Deferred until #4171 is implemented.

Alright, it looks like this patch, as it stands right now, while correct in isolation, is intractable to execute in a timely manner for large enough (think multiple millions of rows) databases. While each individual action often takes less than 1 second, the number of records affected (because this is a per-record update that can't be grouped or optimised further), it becomes somewhere around 30 minutes per database. Which would be fine if you had one database or maybe two, but not if you have like 50 that each take half an hour, and then these migrations are executed in series. See #4171.

@whisperity whisperity modified the milestones: release 6.23.2, release 6.25.0 Mar 2, 2024
@whisperity whisperity marked this pull request as ready for review March 9, 2024 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
database 🗄️ Issues related to the database schema. refactoring 😡 ➡️ 🙂 Refactoring code. server 🖥️
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant