-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NEW] Valkey-Bloom: BloomFilter support for Valkey. #407
Comments
@KarthikSubbarao we are continuing the goodform.io modules as native valkey modules too. Personally I don't think the lack of activity relates to the license - it's more that the code is essentially done and that all modules generally get little attention once mature - but we're just speculating here. Can we find a way to co-exist? I have used naming like valkey-bloom (all lower case) and the module shared library valkeybloom.so for a simple transition for users (this module will be in Fedora soon with this naming convention as we transition away from Redis). This matches up with the other goodform.io modules like valkey-search, valkey-json, valkey-graph, and so on. Would it be possible to name this new module in a way that highlights the differences perhaps? (e.g. Valkey-Bloom-Rust?) |
Given your precedence, I think we shouldn't overwrite your naming. If you want to translate the names to valkey-*, I think we should respect that.
We could call it |
Thanks @KarthikSubbarao for creating this. This is one of the most popular modules and I've seen users used various alternatives like lua scripts, custom application around Key questions :
|
@hpatro there is an existing python-based test framework (BSD licensed) from the early days that has been kept and used with all of the goodform modules. The earlier version is named 'rmtest' (Redis Modules Test) and I've been working on transitioning it to 'vkmtest' (ValKey Modules Test). Maybe it'll work for the Rust module testing too - you can find the initial version here: https://github.com/goodform/valkey-module-test |
@natoscott That is something I am very interested in taking over (specifically because I want a python based testing framework for the main project) if you have any interest in offloading the maintenance of it. Ideally it could be re-usable across all projects that run Valkey (or Redis even). |
This isn't the question we should answer here. Can you make a separate issue for it? |
@madolson happy to either work with you on it or have you take it over - I have alot on my plate (as I'm sure you do!) but I can definitely still dedicate some time to it. This test framework is also packaged in Fedora and I'd like to upload it to pypi for ease of use within the Valkey modules too. |
@KarthikSubbarao another possibility if you're super keen on ValkeyBloom and not something with 'Rust' in the name would be for me to use valkey-module-bloom for the existing modules. In hindsight I see I've used that prefix for -test and -sdk (python and C respectively) and that convention could be used on the C modules also perhaps? Anyway, let me know your thoughts, I'm happy to change it at this early stage. There was also mention of a new implementation of ValkeyJSON (not sure if its using Rust) from someone at Alibaba IIRC - so this naming issue may not be an isolated problem. |
Cool! Not an immediate something to figure out, but would love to collaborate on this. |
Here we are #408 |
I like a name that highlights the differences in behavior but not one that gives the slightest hint about how it is implemented. |
@valkey-io/core-team I guess maybe ask for a vote if we want to adopt this and continue developing it as an official bloom module? This is not committing to a specific date for when we will release it, just to start the ball rolling for a module based distribution. Some things to consider. There are other modules like the good form modules. I believe Alibaba also has a module that implements bloom that they have not open sourced. |
Regarding naming, I though we had sort-of decided to reserve the Anyhow, I hope both can co-exist and that they're made API compatible. In that way, users don't need to worry about the differences running on their distro vs running against a hosted database as a service. @KarthikSubbarao How complete is your module? I'm fine with adding it, if you (or anyone else) promises to maintain it actively. My name suggestion is "ValkeyBF", picking up the |
@zuiderkwast I think @KarthikSubbarao's bloom filter is licensed under BSD 3-clause and it is the one being proposed here. My vote is yes on the same conditions as Viktor mentioned above: 1) full command compat; 2) active maintenance. Name wise, my preference would be Valkey-Bloom. |
@PingXie Do you want to clarify what you mean with |
Yes it is, but we're also discussing the already-exising AGPL "valkey-bloom" module here. @natoscott It's good that you're willing to name the AGPL module "valkey-module-bloom" and we can name the BSD licensed module "Valkey-Bloom". That's no collission. Even if we allow projects under the Valkey unbrella to be AGPL, it might be good to avoid it for modules that are to be included in the "Valkey+" (name TBD) package, which will be a container containing Valkey + some official modules. |
yeah existing commands being fully compatible is good for now. Also the maintainers (whoever they are) agree that eventual full compat (meaning new commands as well) is a p0 goal by default. We can always discuss exceptions on a case-by-case basis. "incremental perfection" (R) :-) |
What is done:
What is remaining:
This Module supports every Bloom Filter command (from ReBloom) except for the BF.LOADCHUNK and BF.SCANDUMP and the commands have been implemented with ReBloom compatibility. The reason for not implementing the two cmds is because the Module provides the ability to load and save BloomModule data type items during RDB load and save. BF.LOADCHUNK and BF.SCANDUMP are APIs to load BloomModule data types through commands, but since we will provide RDB save & load and also AOF Rewrite, having specific commands for the same purpose was not considered as required. This can always be re-evaluated if we think it is useful
I would be glad to help with maintenance of the Module by addressing issues and having discussions on missing aspects that we would like to build into the Module's functionality and testing |
Full command compat is one of the point I wanted clarification on for all the future modules we're planning to build/accept. As we don't have any data points for Redis Modules, one can't be sure which API(s) were really used. Do you think it's wise to build full compatibility? Right now the changes which @KarthikSubbarao has made supports all the bloom filter related API(s) but leaves out some of the other probabilistic filter(s). I think we should not strive for full command compatibility to accept a Module. Rather accept one if it meets the performance/memory/language/coding standards aspect of the project. We can always improve/add as per user(s) request. |
We can argue the opposite way too without concrete data and this would become pure speculation at the end. If there is a legit reason to not be fully compatible we can always take an exception but I think it is important to aim at a higher compat bar so that existing Redis users can migrate their workload seamlessly to Valkey. Any incompatibility adds adoption friction and they add up. I am not saying a module needs to be bit by bit compatible in order to be adopted under Valkey. I am talking about directional alignment on helping all customers move on to Valkey with minimum possible friction. |
Well the bloom filter module proposed here has all the bloom filter commands implemented. Remaining commands, technically don't fit under bloom filter they would ideally be under probabilistic filter. @KarthikSubbarao could we also list out the remaining commands not built yet? |
This got me thinking about the on-disk format compatibility, which would be another very valuable property. Though I can see it being harder to achieve.
I agree. Along the compat topic, I would also like the module maintainer to provide migration best practices, when applicable. |
Realized the other probabilistic filter/algorithm commands are each under different command namespace like Cuckoo filter commands s are under |
Ok, so we can eventually have separate modules for cuckoo and minsketch. Seems reasonable. |
I think we should start with first principals and decide what we want the APIs to look, and then decide if we want to be API compatible with Redis. You are starting with the assumption that our users are migrating from Redis, but that need not be the case. They also might be net new developers, and we want to build the right application for them. We may want to alter the APIs to better suite those users. It should always be evaluated case by case, and should not be a general tenet. I also would bias to skipping APIs that don't make a lot of sense. For example, I know in the search modules they implemented functionality like |
On the compat topic, we have a lot of issues to deal with the issue that Redis RDB OP code has changed. I documented the issue here: #645. We don't have a good compatibility story in general with Redis. |
@valkey-io/core-team Any TSC interested in helping shape this up? I think this would be a nice module to start with and help set the baseline for other modules in the future. |
I'm not particularly interested in spending time with this, but I'm in favor of accepting it, with the relevant bloom filter commands being rebloom-compatible. It's no problem that it excludes non-bloom probabilistic filters (they can be provided by another module in the future) and the obsolete commands (dump/load). |
+1. I am in favor of accepting this module too. |
@hpatro I already spoke with Ping and Viktor privately, I will take this module support, Thanks |
Hello - I wanted to post an update here regarding the work remaining and also follow up on the next steps regarding the valkey-bloom Module's review. From this list, do we want to close out on all these items before the Module can be accepted into the Valkey project? Or instead, would we want to pick and address the "high priority" items from this list as a requirement? (And continue addressing the remaining as follow up issues) Issues remaining to close on:
Performance ComparisonDefault parameters:
Starting the server (pinned to 2 cores):
Creating the BloomObject & Running the benchmark (pinned to 1 core):
Performance Comparison (Non Scaling)
Regarding performance comparison - valkey-bloom performs roughly the same as Rebloom during Non Scaling tests. |
Thanks for sharing the information. Reference all above comments, I prefer you can begin with the following items.
I always want to build a proper framework then add more features here, then other contributors could work on the same way. After above 2 tasks are done, let us check which work we should do on next step, how do you think? |
I don't understand. Does this mean that the data is lost on AOF rewrite? |
Currently, we have not yet implemented the AOF callback and I was hoping to first discuss this here. If we handle AOF rewrite by saving commands such as This is not an issue with RDB Load and Save because we save the raw BloomObject's byte vector data during RDB Save and we are able to restore this during RDB Load. For AOF rewrite to support saving the exact state of the Bloom object (including the items that were "set"), we need to include the dump in the AOF and will need to support a command that can restore this data. ReBloom supports a BF.LOADCHUNK command to restore a bloom object from its dump |
Thanks. It sounds to me that LOADCHUNK is needed. Is seems wrong to me to assume a BF is only volatile cache data. |
As we discussed in the meeting, now we are blocked by 2 issues:
Please draft these 2 points in our rfc https://github.com/valkey-io/valkey-rfc @KarthikSubbarao Thanks |
Hello - I have documented the ValkeyBloom feature as an RFC here: valkey-io/valkey-rfc#4 Please do take a look when possible |
I believe the current testing framework in use is https://github.com/RedisLabsModules/RLtest |
@hwware @zuiderkwast @KarthikSubbarao |
That issue mentions a limit of 512MB for DUMP/RESTORE due to the protocol limits. It's for any type, so not specific to bloom filters. A large string or hash has the same problem. This is configurable though. In valkey.conf, I find this comment:
(We really should change "mb" to "MB" though, because in the metric standard "m" = milli, "M" = mega and in internet standards define "b" = bit, "B" = byte. Millibit doesn't make much sense.) |
@madolson @zuiderkwast @KarthikSubbarao @gkorland I am working on a new rust bloom filter implementation which I think valkey can benefit from. It will have the following features:
Is there a timeline for the release of the module? |
The problem/use-case that the feature addresses
Bloom filters are a space efficient probabilistic data structure that can be used to “check” whether an element exists in a set (with a defined false positive), and to “add” elements to a set. While checking whether an item exists, false positives are possible, but false negatives are not possible. https://en.wikipedia.org/wiki/Bloom_filter
Description of the feature
Valkey-Bloom is a Rust Valkey-Module which brings a native and space efficient probabilistic Module data type to Valkey. With this, users can create filters (space-efficient probabilistic Module data type) to add elements, perform “check” operation to test whether an element exists, check cardinality / INFO, auto scale their filters, reserve filters, perform RDB Save and load operations, etc.
Valkey-Bloom is built using
bloomfilter::Bloom
(https://crates.io/crates/bloomfilter which has a BSD-2-Clause license).It is compatible with the BloomFilter (
BF.*
) command APIs of redislabs/rebloom from Redis Ltd. which has over 10M image pulls on Docker and is compatible with several client libraries.The following commands are supported.
We would like to bring Valkey-Bloom into the valkey-io project as an open source Valkey-Module that is free to use, contribute to, etc.
Alternatives you've considered
A bloom filter module does exist today for Redis - https://github.com/goodform/rebloom. However, it uses an AGPL-3.0 license which has additional obligations that are are difficult to meet for many of the active contributors who are looking to provide Valkey as a service. AGPL is also widely disallowed by company open source program offices (including Amazon). Given that this package has not been significantly modified since it was created six year ago, it seems likely that the license is part of the issue.
The text was updated successfully, but these errors were encountered: