Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(storage): maintain per table bloom filter inside a SST #7187

Closed
wants to merge 5 commits into from

Conversation

wcy-fdu
Copy link
Contributor

@wcy-fdu wcy-fdu commented Jan 4, 2023

I hereby agree to the terms of the Singularity Data, Inc. Contributor License Agreement.

What's changed and what's your intention?

As we have remove table_id and Vnode in bloom filter key, there may be some corner case that in one sst, two table have same key. So we can maintain per table bloom filter inside a SST.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • All checks passed in ./risedev check (or alias, ./risedev c)

Documentation

If your pull request contains user-facing changes, please specify the types of the changes, and create a release note. Otherwise, please feel free to remove this section.

Types of user-facing changes

Please keep the types that apply to your changes, and remove those that do not apply.

  • Installation and deployment
  • Connector (sources & sinks)
  • SQL commands, functions, and operators
  • RisingWave cluster configuration changes
  • Other (please specify in the release note below)

Release note

Please create a release note for your changes. In the release note, focus on the impact on users, and mention the environment or conditions where the impact may occur.

Refer to a related PR or issue link (optional)

part of #6391

@wcy-fdu wcy-fdu marked this pull request as draft January 4, 2023 08:59
@wcy-fdu wcy-fdu changed the title feat(storage): maintain per table bloom filter inside a SST feat(storage): maintain per table bloom filter inside a SST(WIP) Jan 4, 2023
src/storage/src/hummock/sstable/builder.rs Show resolved Hide resolved
src/storage/src/hummock/sstable/mod.rs Show resolved Hide resolved
src/storage/src/hummock/state_store_v1.rs Outdated Show resolved Hide resolved
src/storage/src/hummock/state_store_v1.rs Show resolved Hide resolved
src/storage/src/hummock/store/version.rs Outdated Show resolved Hide resolved
src/storage/src/hummock/store/version.rs Outdated Show resolved Hide resolved
src/storage/src/hummock/store/version.rs Outdated Show resolved Hide resolved
src/storage/src/hummock/store/version.rs Outdated Show resolved Hide resolved
src/storage/src/hummock/store/version.rs Outdated Show resolved Hide resolved
src/storage/src/hummock/store/version.rs Outdated Show resolved Hide resolved
src/storage/src/hummock/state_store_v1.rs Outdated Show resolved Hide resolved
src/storage/src/hummock/state_store_v1.rs Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Jan 5, 2023

Codecov Report

Merging #7187 (ad1fc49) into main (80abae3) will increase coverage by 0.02%.
The diff coverage is 79.74%.

❗ Current head ad1fc49 differs from pull request most recent head 66b412c. Consider uploading reports for the commit 66b412c to get more accurate results

@@            Coverage Diff             @@
##             main    #7187      +/-   ##
==========================================
+ Coverage   73.13%   73.15%   +0.02%     
==========================================
  Files        1055     1054       -1     
  Lines      168781   168518     -263     
==========================================
- Hits       123430   123274     -156     
+ Misses      45351    45244     -107     
Flag Coverage Δ
rust 73.15% <79.74%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/storage/src/hummock/state_store_v1.rs 67.41% <0.00%> (-0.93%) ⬇️
src/storage/src/hummock/sstable/mod.rs 96.61% <96.55%> (-0.12%) ⬇️
src/storage/src/hummock/mod.rs 88.02% <100.00%> (+0.06%) ⬆️
src/storage/src/hummock/sstable/builder.rs 94.11% <100.00%> (+0.36%) ⬆️
src/storage/src/hummock/sstable/writer.rs 100.00% <100.00%> (ø)
src/sqlparser/src/ast/data_type.rs 82.97% <0.00%> (-9.53%) ⬇️
src/frontend/src/handler/create_source.rs 59.30% <0.00%> (-4.58%) ⬇️
src/storage/src/hummock/sstable/bloom.rs 96.59% <0.00%> (-2.28%) ⬇️
...tend/src/optimizer/plan_node/stream_materialize.rs 93.00% <0.00%> (-1.65%) ⬇️
src/storage/src/hummock/sstable_store.rs 63.77% <0.00%> (-1.14%) ⬇️
... and 61 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@wcy-fdu wcy-fdu marked this pull request as ready for review January 5, 2023 06:19
@wcy-fdu wcy-fdu changed the title feat(storage): maintain per table bloom filter inside a SST(WIP) feat(storage): maintain per table bloom filter inside a SST Jan 5, 2023
@wcy-fdu wcy-fdu requested review from Li0k and hzxa21 January 5, 2023 07:17
@@ -58,7 +59,7 @@ use super::{HummockError, HummockResult};

const DEFAULT_META_BUFFER_CAPACITY: usize = 4096;
const MAGIC: u32 = 0x5785ab73;
const VERSION: u32 = 1;
const VERSION: u32 = 2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make the changes introduced by this PR backward compactible, modifying the VERSION const is not enough and we need to do the following things:

  1. Change SstableMeta::decode to use different implementations to decode bloom filter based on the version.
  2. When checking bloom filter in surely_not_have_hashvalue, using different implementations to check bloom filter based on the meta version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, but this will bring some duplicated code, maybe we can update version and remove duplicate code later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unavoidable if we want to ensure backward compatibility unless we fully deprecate a released version.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep the BTreeMap structure for bloom filter in SstableMeta and simply put only one entry in the BTreeMap for version 1. Then we use if-else in the decode and surely_not_have_hashvalue implementation to decide how to populate and check the BTreeMap.

Comment on lines +219 to +229
let entry = self.user_key_hashes.entry(table_id);

match entry {
std::collections::btree_map::Entry::Vacant(e) => {
e.insert(vec![key_hash]);
}
std::collections::btree_map::Entry::Occupied(mut e) => {
let current_key_hashes = e.get_mut();
current_key_hashes.push(key_hash);
}
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: can be simplified to:

self.user_key_hashes
    .entry(table_id)
    .or_default()
    .push(key_hash);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, got it.

@@ -353,7 +372,8 @@ impl SstableMeta {
.map(| tombstone| 16 + tombstone.start_user_key.encoded_len() + tombstone.end_user_key.encoded_len())
.sum::<usize>()
+ 4 // bloom filter len
+ self.bloom_filter.len()
+ 8 * self.bloom_filter.len()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

table_id is u32. Should this be 4 * self.bloom_filter.len()?

@@ -353,7 +372,8 @@ impl SstableMeta {
.map(| tombstone| 16 + tombstone.start_user_key.encoded_len() + tombstone.end_user_key.encoded_len())
.sum::<usize>()
+ 4 // bloom filter len
+ self.bloom_filter.len()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prior to this PR, we use bloom_filter.len() to calculate the bloom filter size but it is no longer the case. Please find all the usage of bloom_filter.len() and change them accordingly. I did a quick search and we need to change them in builder.rs and sst_dump.rs.

@Little-Wallace
Copy link
Contributor

Is there any test could show that this feature could improve bloom-filter hit rate ?

@Little-Wallace
Copy link
Contributor

I do not think maintaining a complex bloom-filter is necessary .....

@wcy-fdu
Copy link
Contributor Author

wcy-fdu commented Jan 5, 2023

After some offline discussion, we can XOR with table_id and avoid maintain a BTreeMap in SstableMeta.
Will apply new solution later.

1 similar comment
@wcy-fdu
Copy link
Contributor Author

wcy-fdu commented Jan 5, 2023

After some offline discussion, we can XOR with table_id and avoid maintain a BTreeMap in SstableMeta.
Will apply new solution later.

@wcy-fdu
Copy link
Contributor Author

wcy-fdu commented Jan 20, 2023

I opened a new PR because the solution changed, so close this one.
The new implement: #7502

@wcy-fdu wcy-fdu closed this Jan 20, 2023
@wcy-fdu wcy-fdu deleted the wcy/per_table_bloom_filter branch February 20, 2023 09:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants