-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Int/float numpy performance #100
Comments
Yes, let's include alternative type dot tests in Then let's compare the results against the latest NumPy release (1.14.0). Am I misunderstanding about unittest overhead? The code measures and prints CPU and Wall times for a function that just calls |
Sorry I thought I was including the second dimension in the first test case and thought there had to be some overhead to account for the difference in time but it's just because there are fewer operations being done in Based on the I'll create a pull request for the changes to |
If we can close it sometime soon (next week or so), then I can add on top of it but I don't think there's that much of a rush to get it in there. Seems like it fits in pretty nicely where it is so don't really want to create a new test class. |
Also against the new release (1.14.0), the times are comparable within the variability from run to run |
In this case, sparse matrices may be the best solution. Setup: import numpy as np
from scipy import sparse
S_int = np.zeros((100, 100), np.int64)
# Place a single '1' in each row and column
S_int[np.random.choice(100, 100, replace = False), np.random.choice(100, 100, replace = False),] = 1
S_float = S_int.astype(np.float64)
S_sparse = sparse.coo_matrix(S_int) # "coo"rdinate matrix Test via %timeit np.random.uniform(size = 100) # approximately, the overhead: 2.05 us
%timeit S_int.dot(np.random.uniform(size = 100)) # integer matrix: 261 us
%timeit S_float.dot(np.random.uniform(size = 100)) # float matrix: 96 us
%timeit S_sparse.dot(np.random.uniform(size = 100)) # sparse matrix: 7.82 us That's a massive speed-up, even over the float matrices. Granted, this is probably a best case scenario for the coordinate matrix representation, but I'd hazard that our stoichiometric matrices are pretty sparse, too. Caveat: these all give the cached result warning, ala
I even get this warning for the random number generation alone, suggesting to me that it's erroneous. Probably worth checking; I'd also consider checking the other sparse matrix representations. My tests suggest that |
Interestingly I'm not seeing any improvement when using sparse matrices in an analogous circumstance (in my own project). In fact performance is slightly reduced. In this project ( %timeit np.random.uniform(size = 100) # approximately, the overhead: 2.09 us
%timeit S_int.dot(np.random.uniform(size = 100)) # integer matrix: 15.6 us
%timeit S_float.dot(np.random.uniform(size = 100)) # float matrix: 4.98 us
%timeit S_sparse.dot(np.random.uniform(size = 100)) # sparse matrix: 8.03 us Those are huge improvements for integer and float multiplication when compared to environment %timeit np.random.uniform(size = 1000) # approximately, the overhead: 12.4 us
%timeit S_int.dot(np.random.uniform(size = 1000)) # integer matrix: 1.13 ms
%timeit S_float.dot(np.random.uniform(size = 1000)) # float matrix: 215 us
%timeit S_sparse.dot(np.random.uniform(size = 1000)) # sparse matrix: 20.8 us I'd guess the break-even point is somewhat less than 1% sparsity, but factors other than the percent-sparsity doubtless contribute. Regardless, it seems like some of your speed issues would be resolved by linking |
One final thought, mostly for @1fish2: As you've probably seen, we do a lot of matrix-vector multiplications between sparse integer matrices (usually with values of +/- 1 or 2) and float vectors. What I wonder is whether it is possible to write a faster version (e.g. in Cython). For example, if your nonzero matrix coefficients are always 1, we could skip the casting from the integer 1 to the float 1, as well as the multiplication between the (now float) 1 and the float vector element's value. I don't know if this is enough of a bottleneck in simulation execution time to be worth pursuing, but it seems like it would be relatively straightforward to implement in Cython. |
Let's open a "performance" Issue and investigate the possibilities when we get to it, and profile to estimate the potential impact, as you mentioned. Using Cython to take advantage of the special case is straightforward. Getting parallelism there is more work and more fragile. Meanwhile, I'll add this case to the numpy/scipy performance test, then let's compare pyenvs that have different numpy & scipy installations. -- The pip-installed numpy was looking good over the one that uses |
Good point on the parallelism, I imagine doing that correctly might require going beyond Cython. |
Cython can release the GIL (global interpreter lock) and use a threading library like OpenMP. It's a "small matter of figuring out" just how to do that, to access those libraries, create a thread pool so the threads don't have to be recreated each time, see that it doesn't try to use more CPU cores than slurm actually made available, etc. This reminds me: The “cached result warning” could be general startup overhead like spawning a pool of threads. You could test that by using cell mode |
I added some integer and mixed dot products to the performance test. On macOS:
Integer × integer matrix multiply takes 18x the CPU time, 88x the Wall time, and gets no parallelization. Cutting to the chase, there's an easy workaround:
Stack Overflow reveals that recent Intel CPUs are optimized to pump through lots of floating point multiplications, but not so for integer multiplications. Also BLAS has no integer data type.
|
Sherlock 1.0 with 16 cores gets a bit more parallelization than Mac but more SYS overhead, and similar results:
I'll mark two of those to "skip" since the results are approximately symmetric. |
Timing various ways to convert an integer64 matrix to floats, the timings are about the same except float32 x float32 matrices goes twice as fast. So the hardware is good at float32 math and it has half the data to chew through.
... or use |
This is really interesting! Good to know float32 is even faster. Does |
Including vs. excluding the conversion in the timing:
AFAIK, the reason is the hardware devotes lots of transistors to floating point array math, and the operation runs in optimized BLAS code with parallelization, and caching must be part of the story. Converting to and from floats seems to hardly matter to the timing and might preload the cache. (Does NumPy do array operations lazily to process the data in fewer passes?) This might not be the full story. It's such a large factor! The array size probably matters. Explaining float32 vs float64 is easy: It has half the data to load from RAM, churn through, and write to RAM, and the floating point hardware must handle that natively. (float16 is even slower than int32. It's probably converting to float32 in software, and back. GPUs might possible handle float16 or float8 natively.) A spot check on whether it computes the same result even with integer overflow:
|
Thanks for all this analysis, Jerry. Very illuminating, lots of little surprises. |
Sorry, I misread Travis' question:
The answer is: I must've goofed typing in those numbers. All these timings are the same, within measurement noise, call it .02 sec:
What is different (takes 1/2 as long) is converting both operands to float32:
and converting the result back to int64 doesn't seem to make a dent in the timing. Some runs with |
I noticed that after I updated the stoich matrices to return ints instead of floats that the daily build went from ~15 hrs to complete to ~18 hrs to complete. The majority of this is coming from the analysis plots that use the stoich matrices to do calculations. The changes were made to fix a future warning about implicitly casting to int since the calculations were being done with floats. Since the stoich matrix should only have ints, I thought the most direct way to address this would be to store and return it as an int matrix instead of explicitly converting in each of the analysis plots that use the matrix. I realized that numpy performance with ints is significantly different than with floats and this was causing the increased time.
The int matrix changes should be rolled back to improve performance and it might be worthwhile to explore other areas where ints might be used in numpy operations. I will push a branch with some test cases I wrote but a summary of the comparison is shown below. Not sure if we should include alternative type dot tests in
test_library_performance.py
as well.Without unittests overhead:
python wholecell/tests/utils/dot_tests.py
With unittests overhead:
python -m wholecell.tests.utils.test_library_performance
The text was updated successfully, but these errors were encountered: