-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Roaring Bitmaps for indexing #25
Comments
that should be suggested to the dgraph folk and not here since this is only a wrapper. |
@ivanjaros This wrapper seems to provide indexing functionality. Badger itself knows nothing about indexing. So I assume this is the correct suggestion here. |
no, that is not how this works. i dont even have to look at the source code since i have done this multiple times. the underlying db is a key-value store/kvdb, in this case badger. badger itself uses its own way of handling the keys and values. badger uses lsm tree and it keeps keys in memory and values on disk https://github.com/dgraph-io/badger#design if you are experiecing "slowness" then i think you might have wrong schema or something like that(are you using indices properly?). |
@ivanjaros There's probably a misunderstanding somewhere, maybe in what we all refer to as indexes. For context, I'm well aware of what Badger is, how LSM trees work and etc. I just discovered this wrapper today though, and I may be wrong with what I'm saying, but doing a quick test inserting a few objects makes me think that original purpose of this issue is indeed valid. I assume that @evanoberholster refers to the functionality of this library for building indexes on particular values of structs being stored using struct tag This is the functionality provided by this library, not by Badger itself. The way it works is by doing a reverse index from the index name to the list of other keys that contain the indexed value. So for a type: type Person struct {
Name string
Division string `badgerhold:"index"`
}
store.Insert("alex", Person{Name: "Alex", Division: "Engineering"}) the keys in badger will be (using
The So, what I understand @evanoberholster suggests is to replace this keyList struct with a Roaring Bitmap, which seems like a valid suggestion for a reverse index. Though I believe in this case integer keys would need to be used for all the records being stored. |
@burdiyan Sorry that I was not more clear when opening the issue. Yes you assume correct and is the reason for opening this issue. There are more performant was of doing reverse-indexing, especially when needing to process multiple indices for one query. In some personal projects I have started to include Roaring Bitmaps due to their compactness and performance when while writing queries (AND, NOT, OR). The challenges that I have experienced with using Roaring Bitmaps as reverse-indices are:
@ivanjaros as per Badger's description
Badger is really not the place to implement reverse-indexing. Reverse-indexing could also be applied to https://github.com/timshannon/bolthold as it should be implemented on the KV abstraction level and not as part of the database. |
Dgraph opened a new repository called "sroar". I believe that this is what would best be implemented into badgerhold. https://github.com/dgraph-io/sroar Description:
|
I've also experienced this issue. In particular, when an index value is shared by tens of thousands of keys, badgerhold ends up burning a significant amount of CPU in
This is fine if the list of keys in that index value is up to, say, a hundred or a thousand. Beyond that, the cost of indexUpdate really starts adding up; inserting 50k records with a common index field value can take tens of seconds on my laptop. Personally, I think there's a simpler solution than bitmaps. Since every insert must happen on a unique key, we could use prefix-based entries in the database, like:
Inserts and Deletes should be instantaneous now; we can figure out the precise index key to put or delete quickly. Iteration should also be possible, by seeking to the common prefix I don't think it could be a problem for the encoded index value to contain a separator character |
@timshannon I should add - this is something that I need for a side project (where we're trying to stay away from a full SQL database), so if the design sounds fine to you, I could contribute a PR :) |
This is a quite important improvement. With a few thousands of indexes any operation with that kind of record becomes very very slow. Actually I would say many people around is suffering from this without knowing. It took me time until I discovered that indexes where slowing down all my application... Note that this is a breaking change so you probably want to let the legacy indexing mechanism available (and made the new optional) or publish a major version with a notice of breaking compatibility. |
Yeah this is a definite issue. See timshannon/bolthold#105 in bolthold. My plan there was to use paging similar to how SQL Server does it, but that might be more complicated than what is necessary. Your prefix solution has the benefit of being more simple than paging, but needing more reads for each index. The tradeoff may be worth it. I've been really busy with work stuff, but feel free to submit a PR. If it passes the test suite, and doesn't break compatibility, then it'll probably get merged. Thanks, |
I think finding a false positive is quite an option on a database with many (tens of thousands) of indexes. Also if the Index is a string, the Gob encoded bytes would also be a string, thus an Index string containing ":" would break the logic. In the otter side, using base64 would increase the kv local disk size required for the Indexing by a significant number (around 2 or 3 times more probably) Got two solutions, but I think the first one is the best. Use a hashUse a hash function for the index: The size of the hash is always the same (i.e 32 bytes) so we don't need the separator ":" for the index and the key. There are very cheap and standard hash function which might be used such as blake2 or keccack256 My feeling is that this approach would also save some disk space since most of the Gob Encoded indexes would be bigger than 32 bytes. EDIT: Probably that's the best shot (standard library and designed for hash tables) https://golang.org/pkg/hash/maphash/ Add index sizeAdd the index size just before the index itself. The size could be a 32 bits (4 bytes) integer which should be enough for specifying the size of an index.
So we can use bytes and make it safe. We must be sure however that bhIndex, someType and someField would never have the byte |
I agree that it's a bit of a tradeoff. The prefix solution keeps single reads and writes fast, but it might make large iterations/queries a bit slower as we'd need to iterate over many of these prefixed index keys. If you're confident that you'll be able to implement paging in a way that keeps writes fast, I think that might be the best solution overall. One would have to do some benchmarking to be sure. In that case, I might then implement my prefix-based indexing outside of badgerhold, to not get in the way of paging.
Keeping compatibility with the existing index sounds reasonable, but also tricky. Do we only use the prefix-based index for newly created indexes? Do we transition an existing index into a prefix index on the first write, making it potentially very expensive? Do we add some sort of "re-index" API that deletes all indexes and re-creates them with the prefix method? |
My plan was for all new indexes to use the new format, and for any existing
indexes to use the old method. I was going to prefix the byte values so I
could tell the difference when reading and writing them. Since we're
talking about possible multiple versions, it's probably a good idea to have
something like an index version enum that is used as the prefix.
…On Tue, Apr 13, 2021 at 3:59 AM Daniel Martí ***@***.***> wrote:
Your prefix solution has the benefit of being more simple than paging, but
needing more reads for each index. The tradeoff may be worth it.
I agree that it's a bit of a tradeoff. The prefix solution keeps single
reads and writes fast, but it might make large iterations/queries a bit
slower as we'd need to iterate over many of these prefixed index keys.
If you're confident that you'll be able to implement paging in a way that
keeps writes fast, I think that might be the best solution overall. One
would have to do some benchmarking to be sure. In that case, I might then
implement my prefix-based indexing outside of badgerhold, to not get in the
way of paging.
If it passes the test suite, and doesn't break compatibility, then it'll
probably get merged.
Keeping compatibility with the existing index sounds reasonable, but also
tricky. Do we only use the prefix-based index for newly created indexes? Do
we transition an existing index into a prefix index on the first write,
making it potentially very expensive? Do we add some sort of "re-index" API
that deletes all indexes and re-creates them with the prefix method?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#25 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKE2NMBU3A5KOUPCFLNPSLTIQBYVANCNFSM4O3OWS7A>
.
--
Tim Shannon
www.townsourced.com
|
why do you have thousands of indices???? |
@ivanjaros what is meant is that, when a field is indexed and thousands of records are inserted with the same value for that field, then the insertion cost is quadratic, as the index for that same field value is kept as a growing list. |
@timshannon here's the patch to replace the existing indexing with what I suggested above: vocdoni@1226c2c I haven't posted a PR because this is not backwards-compatible with existing indexes. I see three paths forward:
I still think option 1 is best, as we could end up with a single indexing mechanism rather than 2 or more. |
As a previous user of badgerhold I had performance issues with large indices. After much research I have started to use Roaring Bitmaps for indexing.
I would like to suggest that badgerhold use RoaringBitmaps for indexing. I believe this will provide significant performance improvements.
https://github.com/RoaringBitmap/roaring
https://sudonull.com/post/28381-Go-Bitmap-Indexes-Wild-Speed-Search-Badoo-Blog
The text was updated successfully, but these errors were encountered: