-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Profile and improve map performance #50
Comments
To start people off, here is a flamegraph of Looking at the profile, the stack on the right (ignoring the |
When looking at the cost of inserts, the biggest cost I see is allocation ( |
Interesting. The Overall, I'm more concerned with the comparison to |
Re: Edit: |
If creating and dropping |
@jhinch that could work, though the tricky part is going to be to find the right way to free them. If we can figure that out though, that would probably help a lot. |
I'd say the |
That's a good observation, and yes, I think that could work! That should give a very nice speedup for resizes indeed. Anyone want to write up a PR? |
Thinking some more about this, I think the trick will be to allocate a |
As an update on |
@jonhoo That's what I meant, yeah. Sorry if that was unclear from my previous, rather tired comment 😅 Could you expand on the problem you see with existing references at the point of |
@domenicquirl Ah, but notice that the drop code uses |
I found some time to try and implement this in #56. The result on my machine is a very significant performance improvement:
|
When working on it I noticed you were also probably referring to other references from bins in the same table, which I did not consider immediately. Still, destroying it immediately is also what I do at the moment, under the assumption that at the point of |
Ah, no, what I was thinking about was making sure that all the pointers that point to the shared You can assume that all references to the table were indeed in a previous epoch. This was the core of the argument in a9c6890, and is why all returned references are tied to both the lifetime of |
I wonder how different these numbers would look with jemalloc... We're certainly being bitten by the cost of allocation here, which |
It is particularly bad that we have to do two allocations for every put -- one for the value and one for the node. We could at least to a little better by not allocating for the node if we end up doing a replacement. Keeping a "graveyard" for allocated |
Thinking some more about this, I don't think a graveyard will help us much. Unless the user does a lot of removals or replacements (which I don't think are that common — updates maybe?), there probably won't be much garbage for us to re-use, and it won't be worth the bookkeeping overhead. |
I wrote a first draft of a more "serious" concurrent benchmark: https://github.com/jonhoo/bustle |
That's a lot of time taken by guards, even though it appears you are running this against |
Yup, that's one guard pin for each iteration. I have a version that just repins every ~1k operations, and it does much better. |
Interesting. For how much we've looked at Also how many elements are in the map at max? I'm wondering about For a safety check, |
This workload is very read-heavy. The mix includes both inserts (2%), updates (3%), and deletes (1%). In this particular benchmark the table had an initial capacity of 2^29 with 75% of the keys populated before the benchmark began. Then the mix above ran 402653184 operations across 10 threads in 13.201507468s (time/op = 32ns). I agree the lack of transfer seems odd here, not sure what that's about. I'm not terribly surprised that |
If the table is presized to be this large and actually allocates the bins beforehand, then it never needs to Regarding I don't have a good idea for the guard check in |
Ah, no the table is initialized with a capacity of I think that just suggests that most lookups find bins of length 1, where the first element matches. That would mean that we very rarely end up walking a bin through Hmm, yeah, I'm not sure. Having a private "I promise this |
Yeah, but still if a capacity of Long bins are also interesting for hash collisions, because then we don't have a resize. But hopefully this picture is similar, at least once we have tree bins. Is holding a If we wrap I think we should either
I personally am in favour of the second option. What do you think? |
Ah, but the Yes, you will still need to check when the One option is to have a method like I'm not entirely sure how you are suggesting the second option would work? Some kind of trait that we implement for both "our" guard and the epoch guard, with a method that returns whether it's been checked? That's probably going to cause all sorts of soundness issues, like the user implementing the trait themselves for their own type. I prefer requiring that the user must use our One thing we have to be very careful with is that a user doesn't take a "checked" |
No no no, I did not want to implement this as a trait 😅 I was also assuming that the collector will stay the same, since we currently explicitly don't provide a |
Getting the cost of A naive idea would be to count the maps and give each an increasing ID, and then store in the Guard for which IDs it was already checked. This has the obvious issue of a (possibly large but) finite number of maps that can be created. But what we could do is have a pool of N ids, which can be assigned to a map and re-used if that map is dropped. A map with an ID could use the cheaper check against the ID after checking any Guard once, and maps without an ID can default to the expensive check until they are assigned an ID (or forever in the worst case). A map without an ID could try to obtain one whenever (or every n times) it has to perform the expensive check. Remaining considerations with this:
I hope this at least furthers the discussion and maybe produces some more ideas 😉 |
So, here's where this all gets weird to me — checking that a |
That is weird indeed. Why then would it take up so much time? But yeah, if we not only change it to our own check, but have to introduce additional branching and bookkeeping overhead, it's likely that would not be faster. It just seems too much for a single comparison, given how involved some of the map methods are 🤔 |
It could be that it just happens so often. Maybe accessing the |
We just merged #43, which gives us benchmarks! Currently we have benches against
dashmap
andhashbrown
.First runs of the benchmarks seem to indicate that retrieving values from the map is quite performant, but inserting into the map is rather slow compared to both other maps. Preliminary results:
insert
andinsert_erase
run in the order of one ms, compared to 10-20 us inhashbrown
.get
holds up reasonably well though, still ~20 us to ~5-6, but given the added concurrency this seems much better thaninsert
. Both of these use one guard across all operations of one benchmark run and relatively consistent across input distributions.dashmap
inserts in something between 1 ms with 1 thread and .8 ms for multiple threads, however what's probably more important is throughput. This for some reason is not in the report produced bycriterion
, but was on the order of 100 MElements/s. We compare with about 25-12 ms depending on thread count, which is so slow it takes several minutes to run any benchmark and puts throughput at something like 5, maybe 10 MElements/s. This is using 1 guard per map operation, which admittedly is not what you would do in a scenario such as the benchmark.insert
in 14-11 ms, which is better, but still considerably slower thandashmap
. Note that this does not compare directly due to the different setup of the threads.dashmap
gets 700-500 us, while we get 1.5 ms on one thread, 800us on 2 threads and go down to below 300 us on 8 threads, even withpin()
for every call toget
. With only one guard, this improves to 800 to a bit more than 200 us, depending on thread count.It would be interesting to find out if these results are reproducible and, if so, what causes inserts to be so slow. @jonhoo suggests graphing the results of
perf record
as a profiling method in the discussion in #43, maybe this can be used as a starting point for the analysis.The text was updated successfully, but these errors were encountered: