Compressed Materialization #7644

lnkuiper · 2023-05-23T13:28:33Z

This PR implements the CompressedMaterializationOptimizer which compresses and decompresses data on-the-fly during execution if there is a materializing operator like sort/join/aggregate/etc.

This is useful when we, for example, have the following table:

┌───────┬─────────┐
│  id   │  name   │
│ int32 │ varchar │
├───────┼─────────┤
│   300 │ alice   │
│   301 │ bob     │
│   302 │ eve     │
│   303 │ mallory │
│   304 │ trent   │
└───────┴─────────┘

Here the id column has type int32, which has a width of 4 bytes. The maximum id we can store in this type is 2,147,483,647, but we only store 300 through 304. We keep statistics on each column in the catalog, and using these statistics, we can convert this column to a uint8 at runtime by subtracting the minimum value (300), bringing it to a range of 0 to 4. This reduces the width of the column down to 1 byte.

This is only useful when there could be memory pressure, which is sometimes the case for materializing operators.

We can also compress down the name column. The maximum string length is 7, but our string_t type that is used for strings during execution is 16 bytes wide. We can compress these to a uint64, which is only 8 bytes wide, by encoding names like so:

alice   -> alice005
bob     -> bob00003
eve     -> eve00003
mallory -> mallory7
trent   -> trent005

The length of the name is stored within a single byte in the uint64. Then, we flip the bytes around so that they are properly comparable and sortable as uint64 on big-endian machines:

alice005 -> 500ecila
bob00003 -> 30000bob
eve00003 -> 30000eve
mallory7 -> 7yrollam
trent005 -> 500tnert

Currently, this is only implemented for sorting and aggregates, as these operators have only one input, which makes compression more straightforward. If materializing operators are chained, the optimizer removes redundant subsequent compressions, allowing the compressed data to flow from one operator to the next.

This can also be applied to joins, but this is harder to get right, and heuristics are likely needed to not negatively impact performance when the build side fits in memory. I've left this for a future PR.

Performance

Sorting:

SELECT * FROM lineitem ORDER BY l_shipdate;

SF	Old	New
1	0.63s	0.52s
10	58.0s	30.5s

The lineitem table benefits a lot from this compression.

The query

SELECT count(*) FROM (SELECT DISTINCT * FROM lineitem);

Wich is an aggregate, shows a similar performance improvement.

TPC-H Q1 performance is also improved, by ~30%, as we can now group by integers rather than strings.

Other changes

I've shuffled the order of the optimizers around a bit, because Compressed Materialization creates projections, which impacts the ColumnLifetimeAnalyzer optimizer.
By assuming that a column can only appear once in a GROUP BY, this PR also exposed that we sometimes introduce duplicate group columns by eliminating columns in other optimizers, namely Deliminator and RemoveUnusedColumns. I've added a RemoveDuplicateGroups to remove these.
Refactor the Deliminator optimizer, which is now much more readable/maintainable.
Push down join statistics as filters, i.e., if we join a table with a very small range of ids with a table with a very large range of ids, we can create a filter from the statistics that we have and push it into the scan of the large table. This can greatly speed up specific queries (requested in Doing HASH_JOIN instead of SEQ_SCAN even when driving table has single record #4974).

Happy to receive feedback! No rush though, this is a pretty big PR.

…gress on CompressedMaterialization

…gfixes

…ns and make them not callable from catalog

lnkuiper · 2023-06-16T14:11:19Z

So there's still a regression, but it passes our regression test threshold!

benchmark/h2oai/group/q02.benchmark
Old timing: 0.357944
New timing: 0.436101

I think this PR is finally good to go when CI passes.

lnkuiper · 2023-06-19T07:14:10Z

Somehow, changes meant for a different branch ended up in this one. I think I've removed all of it now

lnkuiper · 2023-06-20T09:46:52Z

Somehow, changes meant for a different branch ended up in this one. I think I've removed all of it now

Apologies, I somehow contaminated my own feature branch, which caused these issues.

lnkuiper · 2023-06-23T06:56:46Z

I think this is ready to go!

Mytherin · 2023-06-23T09:26:47Z

Thanks!

v1gnesh · 2023-09-27T03:43:18Z

Happy to see the mention of 'big-endian machines' 👍

Please let me know if the gang needs access to one, running either Linux on Z or z/OS (reference here, here, & here).

GH Actions currently doesn't have a native runner for s390x, but VM(s) can be made available (for direct use or indirectly reaching them via some other GH Action) in case it's helpful.

lnkuiper added 30 commits April 3, 2023 21:17

implement compressed materialization function for integers

71ea794

implement and test compressed materialization scalar functions

3f767e6

Merge branch 'master' into compressed_materialization

4686b63

merge with master after make_uniq changes

6a7a8a4

init compressed materialization optimizer:

dfe7652

Merge branch 'master' into compressed_materialization

7c71e73

create ColumnBindingReplacer as this functionality is common, and pro…

d980737

…gress on CompressedMaterialization

branchless string compress/decompress

b0fcde3

const ref

a94e5a4

Merge branch 'master' into compressed_materialization

9bcf3e7

progress on compressed materialization

4fe3929

some code cleanup

f0b2105

Merge branch 'master' into compressed_materialization

17a35be

perform same check as in propagate_and_compress.cpp and some other bu…

f2651a3

…gfixes

Merge branch 'master' into compressed_materialization

05e9253

properly replace columbindings after pushing the projection

761cddb

compressed materialization working for ORDER BY

5882798

compressing, sorting, then decompressing lineitem works

aad0370

Merge branch 'master' into compressed_materialization

29234ec

aggregate compression seems to work

d399a45

compress comparison joins too (but not yet the conditions)

5357fb2

Merge branch 'master' into compressed_materialization

986499c

perfect hash aggregate for q1 now

f44ed86

fixing some tests

b1d9428

Merge branch 'master' into compressed_materialization

0681f6b

create filters based on join statistics

bf3b042

remove redundant projections

0e1020a

Merge branch 'master' into compressed_materialization

f39f45a

add (de)serialization functions to compressed materialization functio…

58c4c3c

…ns and make them not callable from catalog

follow binding chain to resolve types

661caa5

lnkuiper added 4 commits June 15, 2023 17:43

optimizer string (de)compress

8abfd20

Merge branch 'feature' into compressed_materialization

68e489b

fix refactor bugs

ac9988e

fix (hopefully) last bugs with string compression refactor

da0ce9c

lnkuiper added 2 commits June 19, 2023 08:56

Merge branch 'feature' into compressed_materialization

a426704

undo changes (should have been in other branch)

e412073

lnkuiper added 6 commits June 19, 2023 13:03

undo changes to arena allocator and fix spurious coverage failure

c1abc10

Merge branch 'feature' into compressed_materialization

b40de8e

Merge branch 'feature' of github.com:duckdb/duckdb into feature

f2f1600

Merge branch 'feature' into compressed_materialization

37da53b

Merge branch 'feature' into compressed_materialization

d183c79

undo changes

bc2d3e5

lnkuiper added 2 commits June 21, 2023 08:36

Merge branch 'feature' into compressed_materialization

93e3b7f

rename to coverage for more coverage

b1e241c

lnkuiper marked this pull request as draft June 21, 2023 07:24

lnkuiper marked this pull request as ready for review June 21, 2023 07:24

lnkuiper added 2 commits June 21, 2023 16:43

merge with feature

79beb6b

fix CI

43c72b0

lnkuiper marked this pull request as draft June 22, 2023 09:08

lnkuiper marked this pull request as ready for review June 22, 2023 09:08

Mytherin merged commit 983659b into duckdb:feature Jun 23, 2023

lnkuiper deleted the compressed_materialization branch June 26, 2023 09:06

pdet mentioned this pull request Aug 2, 2023

Disable Compressed Materialization substrait-io/duckdb-substrait-extension#62

Merged

lnkuiper mentioned this pull request Aug 13, 2024

Compressed materialization for joins #13402

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compressed Materialization #7644

Compressed Materialization #7644

lnkuiper commented May 23, 2023

lnkuiper commented Jun 16, 2023

lnkuiper commented Jun 19, 2023

lnkuiper commented Jun 20, 2023

lnkuiper commented Jun 23, 2023

Mytherin commented Jun 23, 2023

v1gnesh commented Sep 27, 2023 •

edited

Loading

Compressed Materialization #7644

Compressed Materialization #7644

Conversation

lnkuiper commented May 23, 2023

Performance

Other changes

lnkuiper commented Jun 16, 2023

lnkuiper commented Jun 19, 2023

lnkuiper commented Jun 20, 2023

lnkuiper commented Jun 23, 2023

Mytherin commented Jun 23, 2023

v1gnesh commented Sep 27, 2023 • edited Loading

v1gnesh commented Sep 27, 2023 •

edited

Loading