-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF/TST: Figure out what's going on with vb_suite test result instability (frame_reindex_*
in particular)
#3158
Comments
(I suspect the problem is due to positive but non-deterministic performance optimizations, which means that, if we can't resolve the source of the non-determinism, then we have to choose between less consistent but better average performance versus more consistent but worse average performance, which is not a nice choice.) |
It doesn't seem like
The changes quoted (before the percentage values) are natural logs of the ratios (target/base). This is based on three commits: the It's possible my p-values are off, but in any case just grepping the results makes clear that the large variance was present without |
added -n to control number of vb runs to average over for each vb
|
Pushed some more bells and whistles to test_perf, vb_suite let's the timeit module choose I ran a few iterations, still variable (though much quicker about it), one
|
OK, good to know it's not just my machine that has this issue. I'm also not sure if it's bimodal with a large enough sample size, |
I pushed another update to test_perf, it now sets cpu affinity to protect againt cold cache surprises using
before and after definitely seems to have made an impact:
before
after
There are still outliers in the data (and more in the entire dataset), but that's definitely |
FYI The vbenchs like frame_add, frame_add_no_ne and such are using Numexpr (or not for comparison) |
Ok, good to know. After disabling the ext4 journal commit interval and rerunning again 25 times, Some of these are IO functions, which I'd expect to be a bit wonky. @jreback: got any more insights?
|
nice! that looks much better. On Thu, Apr 4, 2013 at 8:56 PM, y-p notifications@github.com wrote:
|
there is another source of variability here. the groupby timing in simple_compress is related to how random the input and this how many groups are formed. |
any idea why it's the "no_ne" vbenches that are misbehaving? |
they are really large (100k rows), prob cache missing a lot |
also multi_ne are 3 Boolean ops, but because the first 2 ops are Boolean among random nums maybe 3rd op has more variance - just a guess |
Actually I trimmed those down to 20000 because they were slowing down the knocking frame_multi_and_no_ne down to 5000/1000 rows doesn't do much.
|
At least it's now a breeze to test a hypothesis. |
looks like the first result is always off (cache probably), without it things look
|
@jreback, is numexpr completely out of the code in the no_ne vbenches? because the longer |
it should be completely bypassed on the _ne; different code path |
I guess this might be due to other libraries rather than pandas, but, just for curiousity, do we know if pandas actually uses circular references anywhere anyway? I feel like everything I've seen is a directed acyclic graph, at worst....if there really are any reference cycles we ought to remove them or convert them to use weak refs in one direction, imho. @wesm, do you know off the top of your head?
This could be too sensitive to outliers to be a useful headline sort metric...why don't you do something like std / mean instead, or interquartile range / mean?
I found out in my experimental SIMD proxy testing that testing very large arrays can be misleading anyway, btw, so maybe we shouldn't be doing this anyway (or be supplementing the tests with other array sizes)...will explain in a post soon. |
I'll do that. But I think outliers would produce false positives rather then false negatives, interquartile range would obscure the outliers wouldn't it? that would produce false negatives. |
I just think complete outliers could always be because of one-time hiccups on the host system; what is annoying is the consistent overall variability; but I guess if you're ok with false positives and have the underlying data anyway it'd be ok. |
Well, It's not a lot of data, and I'm actively looking for causes, I think it's ok. std/mean It'd be good if you and jeff could repeat the test on your own machines: pull master + setup.py develop
|
see new comment on #3146 |
Everything looks good to me, I think either the affinity, gc thing, or both did it :) Here are the "worst" offenders:
which isn't much...maybe I got lucky though, will try again with more runs |
The burn-in helped some too.... I did another 25 runs on an old laptop, a few stragglers but all in all, pretty solid. If you or jeff come back with another positive report, I think we can close this. |
burn-in? quae? anyway, if it is resolved, we should systematically try to figure out what actually made the difference: if gc is really having an effect then we need to fix cycles or hopefully get someone else to. i looked at the linked issue about |
It was partly cargo cult I admit, I just wanted to make sure any kind of burn-in: do foo + N runs, and throw away the foo results. I thought I was seeing As to finding out the exact cause in excruciating detail, you're the man for the job. |
ok, i'll look into it, it'll be a learning experience =D |
Second run (100 iterations) looks fine as well
|
hooray. |
for completeness, my run (with your above call), I got some errors like this:
|
I got that running in a venv with no numexpr. Pushed an update, the gc is now disabled during vbenches. vbench actually |
oh, right, gc will have to run anyway to verify the lack of cycles. although without cycles, memory usage should stay bounded and hopefully not trigger whatever heuristics CPython uses to determine when to run gc, what ought to be fairly conservative since gc is really just a backup memory management method anyway (but I suppose it might not be, or we might be using up enough memory, even in a bounded fashion, to trigger them) |
some of the vbs are mem hungery (and were even more so until a recent commit), Anyway as I added above, gc is now disabled during the benchmark, just need to make Btw, the gc theory would also explain why some vbs are more prone then others: |
but in theory all of that memory goes back immediately after each iteration? i guess it depends on the vagarities of python's allocator/deallocator (particular free store management) whether that really keeps the memory usage bounded across iterations, though. |
I don't think the memory is released back to the os on an iteration. Instead the python process keeps it and tries to allocate from the new pool. This causes fragmentation over time. In theory it would be better to run each iteration from a top-level script (that is not python) to have a constant allocation regime..... |
right, it might not go back to the OS (that's why that's an argument for the burn-in to make a difference, though: in theory only the first cycle will request new pages from the OS and every subsequent iteration will just reuse the same memory, if all the stars align just right... i actually think instead of |
The malloc_trim issue was all about a libc optimization behaving unexpectedly IIUC, My guess would be that memory pressure/pool frgamentation is not the issue here, merely as jeff said, and I mentioned earlier as an idea, running each test in a new process would Here's a blog post that suggests actively managing GC can yield build perf wins in certain situtations, |
hmm, anyway, it's not a big deal, but i think we might be talking about different things at this point. i just was under the impression that in the absence of cycles all the memory allocated by a test should be released by the end of it, in theory, so each iteration past the first one should have stable memory usage if it's running the same tests each time. even with stable memory usage, though, it's possible that gc will run because of whatever heuristics CPython uses to determine it's time to start a trace, so it's definitely better to isolate that as a variable anyway. but if memory usage is not stable and it's because of cycles, we ought to eliminate them, if possible, that's all. |
I agree with the first part and test_perf now calls Are we still talking about vbench variability, or thinking out loud about #2659? |
Well, I don't think #2659 is an issue really--just because the OS reports higher memory usage doesn't mean it has any practical effect on performance unless things start swapping, in which case glibc is probably smart enough to give back memory on demand. I just think cycles ought to be eliminated if they're there, independently of how much it's contributing to other issues or not. |
sounds good. |
Can we close this then? |
Nice job on all this forensics. I'd say it's safe to close if no objections. I think I got rid of all the circular references in pandas-- the main issue is the wily "engine" objects inside the Index classes which utilize weakrefs. One of these days when we migrate all the index object code to Cython/C we can clean up that mess. |
Closing |
hmm...well turning off affinity doesn't seem to have any effect, actually, and neither does doing the gc.collect(), as it shouldn't because apparently
When running multiple runs within against head, you're doing this all within one interpreter session, right? Not 100% sure yet I'm guessing that's what's helping the most, especially with dropping the burn-in runs. But we can't do that when running two different commits, no? I'm guessing this means we might have to run multiple runs on each commit, with burn-in, and compare the entire sets, to get the best results...@y-p can we do this already conveniently with your recent changes, or is there a bit more scripting required for that? |
So basically, I did nothing except the libc trim which I didn't think has any effect either, The burn-in too, might have been a red-herring I'm not sure, as I didn't you can control the burn in with -b (default 1), -N will run head multiple times and give you I've been trying various combinations of -c -n -N -b, and I can't get something consistently better. |
yeah, i'm not totally sure what it is either, but presumably if we use whatever methodology you're using for the multiple-runs-against-head tests in the normal comparison runs as well, then we should be ok |
So, Panel didn't respect OrderedDict up to about 5 minutes ago, and now |
Example:
with the following:
These random outliers are making it hard to figure out what's really happening (the optimization I wanted to test seems to improve things by 25%-40% in a C testbed, but I can't get clean results.)
These tests aren't doing anything fancy either, so if they're affected, it's possible the entire suite is affected as well (just less dramatically...)
Possibly some combination of #3129, #3146 (cache and alignment) are related, or maybe some other source of non-determinism in Python and numpy. I can't reproduce this instability in C by trying to proxy it either by randomly adjusting byte alignment, but I haven't tried screwing around with memory paging yet.
The text was updated successfully, but these errors were encountered: