Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compressed Materialization #7644

Merged
merged 138 commits into from
Jun 23, 2023
Merged

Conversation

lnkuiper
Copy link
Contributor

This PR implements the CompressedMaterializationOptimizer which compresses and decompresses data on-the-fly during execution if there is a materializing operator like sort/join/aggregate/etc.

This is useful when we, for example, have the following table:

┌───────┬─────────┐
│  id   │  name   │
│ int32 │ varchar │
├───────┼─────────┤
│   300 │ alice   │
│   301 │ bob     │
│   302 │ eve     │
│   303 │ mallory │
│   304 │ trent   │
└───────┴─────────┘

Here the id column has type int32, which has a width of 4 bytes. The maximum id we can store in this type is 2,147,483,647, but we only store 300 through 304. We keep statistics on each column in the catalog, and using these statistics, we can convert this column to a uint8 at runtime by subtracting the minimum value (300), bringing it to a range of 0 to 4. This reduces the width of the column down to 1 byte.

This is only useful when there could be memory pressure, which is sometimes the case for materializing operators.

We can also compress down the name column. The maximum string length is 7, but our string_t type that is used for strings during execution is 16 bytes wide. We can compress these to a uint64, which is only 8 bytes wide, by encoding names like so:

alice   -> alice005
bob     -> bob00003
eve     -> eve00003
mallory -> mallory7
trent   -> trent005

The length of the name is stored within a single byte in the uint64. Then, we flip the bytes around so that they are properly comparable and sortable as uint64 on big-endian machines:

alice005 -> 500ecila
bob00003 -> 30000bob
eve00003 -> 30000eve
mallory7 -> 7yrollam
trent005 -> 500tnert

Currently, this is only implemented for sorting and aggregates, as these operators have only one input, which makes compression more straightforward. If materializing operators are chained, the optimizer removes redundant subsequent compressions, allowing the compressed data to flow from one operator to the next.

This can also be applied to joins, but this is harder to get right, and heuristics are likely needed to not negatively impact performance when the build side fits in memory. I've left this for a future PR.

Performance

Sorting:

SELECT * FROM lineitem ORDER BY l_shipdate;
SF Old New
1 0.63s 0.52s
10 58.0s 30.5s

The lineitem table benefits a lot from this compression.

The query

SELECT count(*) FROM (SELECT DISTINCT * FROM lineitem);

Wich is an aggregate, shows a similar performance improvement.

TPC-H Q1 performance is also improved, by ~30%, as we can now group by integers rather than strings.

Other changes

  1. I've shuffled the order of the optimizers around a bit, because Compressed Materialization creates projections, which impacts the ColumnLifetimeAnalyzer optimizer.
  2. By assuming that a column can only appear once in a GROUP BY, this PR also exposed that we sometimes introduce duplicate group columns by eliminating columns in other optimizers, namely Deliminator and RemoveUnusedColumns. I've added a RemoveDuplicateGroups to remove these.
  3. Refactor the Deliminator optimizer, which is now much more readable/maintainable.
  4. Push down join statistics as filters, i.e., if we join a table with a very small range of ids with a table with a very large range of ids, we can create a filter from the statistics that we have and push it into the scan of the large table. This can greatly speed up specific queries (requested in Doing HASH_JOIN instead of SEQ_SCAN even when driving table has single record #4974).

Happy to receive feedback! No rush though, this is a pretty big PR.

@lnkuiper
Copy link
Contributor Author

So there's still a regression, but it passes our regression test threshold!

benchmark/h2oai/group/q02.benchmark
Old timing: 0.357944
New timing: 0.436101

I think this PR is finally good to go when CI passes.

@lnkuiper
Copy link
Contributor Author

Somehow, changes meant for a different branch ended up in this one. I think I've removed all of it now

@lnkuiper
Copy link
Contributor Author

Somehow, changes meant for a different branch ended up in this one. I think I've removed all of it now

Apologies, I somehow contaminated my own feature branch, which caused these issues.

@lnkuiper lnkuiper marked this pull request as draft June 21, 2023 07:24
@lnkuiper lnkuiper marked this pull request as ready for review June 21, 2023 07:24
@lnkuiper lnkuiper marked this pull request as draft June 22, 2023 09:08
@lnkuiper lnkuiper marked this pull request as ready for review June 22, 2023 09:08
@lnkuiper
Copy link
Contributor Author

I think this is ready to go!

@Mytherin Mytherin merged commit 983659b into duckdb:feature Jun 23, 2023
@Mytherin
Copy link
Collaborator

Thanks!

@v1gnesh
Copy link
Contributor

v1gnesh commented Sep 27, 2023

Happy to see the mention of 'big-endian machines' 👍

Please let me know if the gang needs access to one, running either Linux on Z or z/OS (reference here, here, & here).

GH Actions currently doesn't have a native runner for s390x, but VM(s) can be made available (for direct use or indirectly reaching them via some other GH Action) in case it's helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants