-
-
Notifications
You must be signed in to change notification settings - Fork 480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use strong caches diligently #13400
Comments
comment:1
This example seems fairly typical:
In 5.3b2:
In 5.3b2+tickets
You can see where a lot of that time is going: In category construction stuff. With the stricter collection, these spaces need to be created time and again. On the plus side, if we run instead
we get with 5.3b2
and 5.3b2+tickets
so it's actually faster! (I've tried, these numbers seem representative). I have no explanation for what is faster here. Perhaps just some dictionary So apart from |
comment:2
I think the ticket description is a bit too broad. I hope you don't mind that I change the title? If you do, please change back. The reason for the name change I'm suggesting is the observation that quite some objects are created during startup of Sage, and are then garbage collected before even the Sage prompt appears. I think that this indicates potential trouble. I find the following with sage-5.2.beta3+#12313 and its dependencies: When Sage starts, 37 different objects appearing as values in a weak value dictionary get removed. Some of the deleted values occur 3 or 8 times. In particular, the "Groupoid with underlying set Symbolic Ring" is deleted 3 times, and the "Groupoid with underlying set Rational Field" is deleted 8 times, before sage even starts. In addition, both Perhaps one should emphasize that Would it be useful to have a switch for Perhaps that could even become a context, so that the computation of the echelon form could be like
Concerning startup, I think 37 objects are easy to take care of. Perhaps one should add a strong reference to them in the modules creating them. |
comment:3
Since the profile above seems that an insane number of calls to
it's pretty scary to see how often the system goes off on a investigation of completely trivial questions. For instance, just the line
triggers 140 lines like
I'm assuming you'll have to try really hard to get the category system to produce an arbitrarily long list of categories and that the 5.3b2 + tickets + caching is_subcategory:
That shaves about
so apparently some spaces get stuck in cycles, but not all. |
comment:4
Replying to @nbruin:
That shouldn't be the problem, because ...
... it is cached, mainly. Do we talk about the same version of sage? Namely, with #11943, the set of super categories is created once and for all, stored as an attribute, and then testing subcategories just amounts to testing set containment. And before merging #11943, I think Before looking into the set of supercategories, Since Actually I am pretty much sure that the categories involved in that computation are all different (namely, algebras over different base rings or so). If they are all different, then what could be done about it? One could improve the
The question is: Where did you put the print statement? Namely, the actual work is done in
My impression is that this is more of a problem then The question is whether we look into the code and do that manually. Say, by creating a list of all matrix spaces occurring in the computation, where the list either resides on module level (if the matrix spaces are used in different functions/methods) or locally (if there is just a single function creating these matrix spaces). Or perhaps it is easier to create a context that prevents matrix spaces (or all weak values) to be temporarily strongly cached. |
comment:5
Putting a print statement into the lazy attribute
actually prints nothing. Hence, all the work for the subcategory test seems to be done in the |
comment:6
Replying to @simon-king-jena:
That said: There are precisely two |
comment:7
Here is the effect of cythonising the subcategory hook:
|
comment:8
Here is how one can cythonise the default
With patch:
However, that doesn't much improve the performance of
which (together with the fact that |
comment:9
I just notice: Apparently in your example, the is_subcategory function is not the usual one (sage/categories/category.py), but the one in sage/categories/covariant_functorial_construction.py. It says def is_subcategory(self, C):
"""
.. todo:: doctests + explain why this method is needed
"""
if C is self:
return True
return any(X.is_subcategory(C) for X in self._super_categories) Actually I wonder why is it needed, myself. Nicolas, is it not possible to use the default is_subcategory, by #11943? Anyway, I think the real speed-up will not come from is_subcategory, but from a more intelligent caching. |
comment:10
The following is on top of sage-5.2.beta3 plus #12313 and its dependencies. When I remove the custom is_subcategory method in covariant_functorial_constructions.pyx, and apply attachment: trac_13400_subclass_hook_cython.patch, I get
When I apply my patch but keep the covariant functorial constructions as they are, I get
When I do nothing on top of #12313, I get
Again, it isn't exactly a break-through. I didn't time it pre-#12313, yet. |
comment:11
Replying to @simon-king-jena:
In that case you might want to check if that ticket was properly merged in the What I see is def is_subcategory(self, C):
"""
.. todo:: doctests + explain why this method is needed
"""
if C is self:
return True
return any(X.is_subcategory(C) for X in self._super_categories) if I add [EDIT: removed, because I gave a sample above] a lot of them, and as you can see, all "static" categories. None of those funny
multimodular echelonization loop (5.3b2 + tickets + cached is_subcategory):
The profile looks as above. I'm puzzled as to why I'm no consistently getting 2-3 seconds worse than before. Did some recompilation put things in unfortunate spots? Strongly caching matrix spaces: Wouldn't help much with the current design. It might be an idea to keep a cache of good fields for multimodular stuff, that |
comment:12
Hm, simply deleting
As you can see, it's really the construction of the quotient rings now that's expensive (i.e., the finite fields). That would get better with having a multimodular fields cache. |
comment:13
Replying to @simon-king-jena:
Oops, honestly I don't remember at all; I should have done this todo! In any cases: if all tests pass after removing this method, please go Cheers, |
comment:14
ptest with CovariantFunctor.is_subcategory removed:
Main thing now would be to see if the next one on the list,
needs to cost as much as it does. And if the general one has to, whether the path should be shortcut for creating finite fields. |
comment:15
Replying to @nbruin:
Yes, I was talking about the is_subcategory method in sage/categories/category.py. I didn't know that the covariant functorial constructions have their own method (which may be deletable, according to Nicolas, provided tests pass)
Just for clarification: The Anyway. If deleting the method is an option, then we should try it. After all, it isn't cached at all, in contrast to what is done in the default is_subcategory method. Plus, if it turns out that my experimental patch gives more speed, I may consider to make it a proper patch. |
comment:16
Replying to @nbruin:
OK, then I think we should remove it.
If I remember correctly, improving the initialisation of finite fields was part of #11900. We could consider to put stuff from And we should test whether "small" finite fields shouldn't better be cached. I see that the elements of an
versus
Hence, I suggest that |
comment:17
Replying to @simon-king-jena:
Yes, if that speeds things up that might benefit quite a bit of sage. Almost any considerable computation over ZZ (and, by first clearing denominators, over QQ) benefits enormously from a multimodular approach, and I expect a lot of computations in sage to already make use of that. It would be great if these can be computed using the general framework, so construction of finite fields and conversion between ZZ and GF(p) should be blindingly fast. Basically anything goes. This case is different from all other rings because it is so close to the metal.
With a prudently chosen bound, this would probably not lead to excessive memory use. I'm wondering whether there's any benefit, though. For elliptic curve computations and other arithmetic-geometric things there might be: One often has to look at the reduction modulo all primes below a certain bound, so if one does so repeatedly, the gain can be substantial. This might explain why the elliptic curves code is relatively sensitive to these things. Although I think the most heavy duty applications of this get seconded to GP/Pari wholesale anyway, so would not involve sage finite fields at all. It would not help multimodular stuff at all, though, because there one usually choses primes that are as large as possible while still allowing for the fastest arithmetic on the processor. Usually one would randomize the choice of moduli, to amortize the cost of choosing "wrong" moduli. However, that's only theoretically wise. For an actual application, when you choose a prime a little below 231 (or whatever you need to still be able to do a full mult and do a mod computation afterwards. REALLY worth delving into your CPU's instruction set to see if a "mult into two 64-bit registers" combined with a "remainder of 128 mod 64 bit number" exist, because it gives you twice the number of bits for almost the same price -- hopefully we are already using libraries that do that for us), it's NEVER going to be a bad one, so using it all the time's not bad (OK, on a 4GHz machine in a computation that takes 106 cycles, it would be happening about 1.25 times a month) What might be a good idea is to have a
which could always start with giving primes from a fixed set (in fixed order or not) and then continue with generating new ones. If we strongly cache the fields belonging to those first ones, we'd get most of the benefit at limited memory cost. Of course it does mean changing all prime selection loops for multimodular stuff to using this iterator. The next bit is tuning the crossover points between naive/multimodular/padic methods for most of these. The main thing this does is mitigate the effects if finite field construction cannot be made lightning fast in general. We still need conversion between ZZ and GF to be lightning fast, even if that means special casing this in relevant code paths. It's nice if derived conversions between matrix algebras and polynomial rings would be similarly fast. |
comment:18
OOPS! Turns out the default algorithm for these matrices is "padic", which is a bad choice as these examples show. It's particularly bad for low rank examples such as this one. If we take a vandermonde matrix, it's a little better
but multimodular is a clear winner. There the fields DO get generated in a deterministic order:
so if it were an issue we should store finite fields for some of those p (and also for
so it seems that with Thierry's code fixed, constructing the matrix rings The REAL surprise is that in the padic variant the primes get selected via this
Random numbers are very expensive! So just changing that (in two places) to a In short, after picking off the Futhermore, perhaps the implementation of I suspect there are some places we can benefit from more tuned/intelligent |
comment:19
OK, if the echelon form thingy could simply be optimized by a better heuristics for choosing between padic and multimodular, and if it has not so much to do with coercion and cache, the question is whether we split the ticket into one for coercion and cache and another one for the echelon form. Just for information: With #12313, make ptest took 2155.0 seconds on my machine, with an update of the patch from here took 2152.6 seconds, but vanilla sage-5.2.beta3 took only 1991.3 seconds. Hence, there is indeed more work to do, and I think one should start by finding out what tests showed the worst regression. |
comment:20
In these tests, vanilla is at least one seconds faster than #12313:
In these tests, #12313+#13400 is at least one seconds faster than vanilla:
|
comment:21
Perhaps the routine I wrote for extracting the execution times of the individual tests was wrong. I tried again. In the following, I give the time t1 in sage-5.2.beta3 versus the time t2 with #12313 and the patch from here, followed by the name of the test. In the following tests tests, patched sage is slower by at least 5 seconds with a regression of at least 15% (
Here are the tests that became faster in the patched version by at least 3 seconds, improving by at least 7% (
The worst regression seems to be in devel/sage/sage/combinat/partition_algebra.py, a regression of more than 70%: |
comment:22
I've extracted the vanilla 5.3b2:
5.32b + ticket + Thierry's fixed code
The worst change in cumulative time (it would probably be better to sort the profile data on that to see which part of the code is responsible - after that you can see which callees from that code use much time) seem to be in
but not that the NUMBER of calls here isn't much different, so it doesn't look like we're creating parents that much more often here (that would always show up in python profiles, right?) On the other hand, Ah, and I finally found what the slash means and what
means I can imagine that these attributes take more time if they get called the first time on a parent, so perhaps this showing up IS an indication new parents are being made. |
comment:23
I think this illustrates quite nicely what the problems are here:
One creates a parent Conceptually, one would probably want an "element-like" object first, with delayed "parent" properties. Only when the user insists on treating the thing as a parent do the parent-like properties get initialized. I don't know whether it's worth having something like this. I suspect that if you want to access elements of |
Attachment: trac_13400_subclass_hook_cython.patch.gz Cythonise subcategory_hook. Experimental, and a bit off-topic |
comment:24
I experimented a bit further, towards an improved initialisation of finite fields. I have already mentioned the idea of caching small fields: The elements of a small field are cached, hence, why should one not cache the field itself? That's in attachment: trac_13400_cache_small_rings.patch And I created a short-cut for building ideals in ZZ. That's useful, because one needs an ideal in ZZ while initialising a finite field. It has a noticeable effect. With only the first two patches:
Note that the time constantly drops - why is that? With all three patches:
Hence, the quick way of creating an ideal is much faster, and when using it in the creation of finite fields, it yields a speed-up of In a next step, one could try to unravel the Apply trac_13400_subclass_hook_cython.patch trac_13400_cache_small_rings.patch trac_13400_quick_ZZ_ideal.patch |
comment:25
Are you taking the following into account?
The multimodular code I have seen uses I did not quite understand the following:
So |
comment:26
Replying to @nbruin:
Apparently I do, since the timings improve in the same way:
With the patch, one has
Hence the speedup is
Actually I have wondered about that, myself. I guess it is just a tribute to very old code. IIRC, someone (William Stein?) argued as follows: It is actually typical that coercion goes from "less structure / simpler" (e.g. from a base ring) to "more structure / more complicated" (e.g., to a polynomial ring or to a matrix space over the base ring). And: If we have an object O in some category (say, GF(5) in the category of fields), and an object X that is obtained from O by a forgetful functor (say, Integers(5) in the category of rings), and we have a in O and b in X, then we typically want that a+b belongs to the "richer" structure - otherwise, we would have worked in X right away. |
comment:27
That looks like quite an impressive gain. Given that you already have a good way of analysing doctest timing per-file, would you see what the effect is? "Premature optimization is the root of evil" (Knuth): We shouldn't actually be optimizing doctests blindly. We can use them as a guide to find gross inefficiencies and repair those. My general feeling is that reducing the overhead on producing finite fields and related structures should be a big benefit generally, but we should only go down that path if we can corroborate that. I was a bit surprised to see that caching "popular multimodular" fields didn't make much of a difference after fixing Thierry's code for an example where I though it would. It may still be a good idea but I don't have evidence that it matters. |
comment:28
Replying to @nbruin:
That's why I didn't pack everything in one patch. Keep it modular... |
comment:29
OK, this is more about tuning rational echelon form code, but since the issue came up here, I'll post some preliminary timings here:
Timings are reported as
so it seems that the cross-overs could use some tuning. In particular, for non-maximal rank matrices it seems to take a while longer to overtake classical. It would be great if someone would be able to tune this code properly. It is very useful if sage would echelonize small matrices quickly over arbitrary base fields quickly in most most cases (and not make silly choices), because it allows linear algebra intensive algorithms in a base-field agnostic way. |
comment:30
Above, I mentioned With sage-5.3.beta2, #12313 and the patches from here, I get
In unpatched sage-5.3.beta2, one gets
I thought this was supposed to be fixed with #13370. but sage-5.3.beta2 plus #12313 plus the patches from here plus #13370 yields:
My impression is that, again, caching is to blame. Namely, since the weak caches from #715 and #12313 compare by identity, not equality, repeating But have we not recently dealt with the same problem on another ticket? I currently can't find it. |
comment:31
Replying to @simon-king-jena:
It was #13378. |
Prevent small finite rings from being garbage collected |
Attachment: trac_13400_cache_small_rings.patch.gz Provide a short-cut for the creation of ideals in ZZ |
comment:32
Attachment: trac_13400_quick_ZZ_ideal.patch.gz I have updated the last two of my patches. The doctests should all pass with them. |
With tickets #715, #11521, #12215, #12313, parent structures don't have eternal life anymore the way they used to have. That affects run-times, because chances are much higher now that the parent has to be rebuilt, whereas before, it was just looked up in a cache. It would be nice if we can mitigate much of this.
(filed under coercion because that seems to be one of the main components in this)
CC: @nthiery @jpflori
Component: coercion
Issue created by migration from https://trac.sagemath.org/ticket/13400
The text was updated successfully, but these errors were encountered: