Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Int/float numpy performance #100

Open
tahorst opened this issue Feb 19, 2018 · 18 comments
Open

Int/float numpy performance #100

tahorst opened this issue Feb 19, 2018 · 18 comments
Assignees

Comments

@tahorst
Copy link
Member

tahorst commented Feb 19, 2018

I noticed that after I updated the stoich matrices to return ints instead of floats that the daily build went from ~15 hrs to complete to ~18 hrs to complete. The majority of this is coming from the analysis plots that use the stoich matrices to do calculations. The changes were made to fix a future warning about implicitly casting to int since the calculations were being done with floats. Since the stoich matrix should only have ints, I thought the most direct way to address this would be to store and return it as an int matrix instead of explicitly converting in each of the analysis plots that use the matrix. I realized that numpy performance with ints is significantly different than with floats and this was causing the increased time.

The int matrix changes should be rolled back to improve performance and it might be worthwhile to explore other areas where ints might be used in numpy operations. I will push a branch with some test cases I wrote but a summary of the comparison is shown below. Not sure if we should include alternative type dot tests in test_library_performance.py as well.

Without unittests overhead:
python wholecell/tests/utils/dot_tests.py

int x int:
Measured CPU times: user 7 ms, sys: 0 ns, total: 7 ms; Wall time: 7.12 ms

float x float
Measured CPU times: user 3 ms, sys: 0 ns, total: 3 ms; Wall time: 2.55 ms

int x float
Measured CPU times: user 13 ms, sys: 0 ns, total: 13 ms; Wall time: 12.5 ms

float x float (converted)
Measured CPU times: user 3 ms, sys: 0 ns, total: 3 ms; Wall time: 3.41 ms

With unittests overhead:
python -m wholecell.tests.utils.test_library_performance

test_dot CPU times: user 99 ms, sys: 134 ms, total: 233 ms; Wall time: 232 ms
test_dot_ints CPU times: user 1.6 s, sys: 0 ns, total: 1.6 s; Wall time: 1.6 s
test_dot_mixed CPU times: user 95 ms, sys: 142 ms, total: 237 ms; Wall time: 237 ms
@tahorst tahorst self-assigned this Feb 19, 2018
@1fish2
Copy link
Contributor

1fish2 commented Feb 20, 2018

Yes, let's include alternative type dot tests in test_library_performance.py. Slow integer performance is strange. Doesn't numpy use integer vector instructions?

Then let's compare the results against the latest NumPy release (1.14.0).

Am I misunderstanding about unittest overhead? The code measures and prints CPU and Wall times for a function that just calls M.dot(M) or other operation. That should include no test overhead -- unlike the @nose.tools.timed(secs) time limit.

@tahorst
Copy link
Member Author

tahorst commented Feb 20, 2018

Sorry I thought I was including the second dimension in the first test case and thought there had to be some overhead to account for the difference in time but it's just because there are fewer operations being done in dot_tests.py

Based on the test_library_performance results, it looks like numpy doesn't try to use multiprocessing with ints since there's no sys overhead. Maybe they don't vectorize operations either.

I'll create a pull request for the changes to test_library_performance, not sure if we need to include dot_tests.py as well

@1fish2
Copy link
Contributor

1fish2 commented Feb 20, 2018

@tahorst would it help if I merge #92 so you can add to it in master? It's also fine to make a separate test class if that's more convenient. Leaving it open makes a place for the experiment log, but there are other places.

@tahorst
Copy link
Member Author

tahorst commented Feb 20, 2018

If we can close it sometime soon (next week or so), then I can add on top of it but I don't think there's that much of a rush to get it in there. Seems like it fits in pretty nicely where it is so don't really want to create a new test class.

@tahorst
Copy link
Member Author

tahorst commented Feb 20, 2018

Also against the new release (1.14.0), the times are comparable within the variability from run to run

@jmason42
Copy link
Collaborator

In this case, sparse matrices may be the best solution. Setup:

import numpy as np
from scipy import sparse
S_int = np.zeros((100, 100), np.int64)
# Place a single '1' in each row and column
S_int[np.random.choice(100, 100, replace = False), np.random.choice(100, 100, replace = False),] = 1
S_float = S_int.astype(np.float64)
S_sparse = sparse.coo_matrix(S_int) # "coo"rdinate matrix

Test via ipython in the wcEcoli environment (1 CPU):

%timeit np.random.uniform(size = 100) # approximately, the overhead: 2.05 us
%timeit S_int.dot(np.random.uniform(size = 100)) # integer matrix: 261 us
%timeit S_float.dot(np.random.uniform(size = 100)) # float matrix: 96 us
%timeit S_sparse.dot(np.random.uniform(size = 100)) # sparse matrix: 7.82 us

That's a massive speed-up, even over the float matrices. Granted, this is probably a best case scenario for the coordinate matrix representation, but I'd hazard that our stoichiometric matrices are pretty sparse, too.

Caveat: these all give the cached result warning, ala

The slowest run took 5.76 times longer than the fastest. This could mean that an intermediate result is being cached

I even get this warning for the random number generation alone, suggesting to me that it's erroneous. Probably worth checking; I'd also consider checking the other sparse matrix representations. My tests suggest that sparse.csr_matrix may be better than the sparse.coo_matrix but by no more than 10%. Documentation suggests using sparse.csr_matrix or sparse.csc_matrix for best algebra performance; the other sparse matrices are faster at construction. Since a matrix-vector product is just a dot product over the rows of the matrix, I'm not surprised that the sparse.csr_matrix (compressed sparse row matrix, facilitates fast row slicing) is superior.

@jmason42
Copy link
Collaborator

jmason42 commented Feb 20, 2018

Interestingly I'm not seeing any improvement when using sparse matrices in an analogous circumstance (in my own project). In fact performance is slightly reduced. In this project (pyenv parest2, numpy 1.9.2 and scipy 1.0.0 both linked to /share/sw/free/openblas/0.2.15/lib), the above profiling gives

%timeit np.random.uniform(size = 100) # approximately, the overhead: 2.09 us
%timeit S_int.dot(np.random.uniform(size = 100)) # integer matrix: 15.6 us
%timeit S_float.dot(np.random.uniform(size = 100)) # float matrix: 4.98 us
%timeit S_sparse.dot(np.random.uniform(size = 100)) # sparse matrix: 8.03 us

Those are huge improvements for integer and float multiplication when compared to environment wcEcoli; comparable performance for sparse multiplication (which is now inferior to float and only slightly better than int). If I bump the size of the matrix from 100x100 to 1000x1000 (and the vector from 100 to 1000), the sparse matrix clearly wins out:

%timeit np.random.uniform(size = 1000) # approximately, the overhead: 12.4 us
%timeit S_int.dot(np.random.uniform(size = 1000)) # integer matrix: 1.13 ms
%timeit S_float.dot(np.random.uniform(size = 1000)) # float matrix: 215 us
%timeit S_sparse.dot(np.random.uniform(size = 1000)) # sparse matrix: 20.8 us

I'd guess the break-even point is somewhat less than 1% sparsity, but factors other than the percent-sparsity doubtless contribute. Regardless, it seems like some of your speed issues would be resolved by linking numpy to /share/sw/free/openblas/0.2.15/lib.

@jmason42
Copy link
Collaborator

One final thought, mostly for @1fish2: As you've probably seen, we do a lot of matrix-vector multiplications between sparse integer matrices (usually with values of +/- 1 or 2) and float vectors. What I wonder is whether it is possible to write a faster version (e.g. in Cython). For example, if your nonzero matrix coefficients are always 1, we could skip the casting from the integer 1 to the float 1, as well as the multiplication between the (now float) 1 and the float vector element's value. I don't know if this is enough of a bottleneck in simulation execution time to be worth pursuing, but it seems like it would be relatively straightforward to implement in Cython.

@1fish2
Copy link
Contributor

1fish2 commented Feb 20, 2018

Let's open a "performance" Issue and investigate the possibilities when we get to it, and profile to estimate the potential impact, as you mentioned.

Using Cython to take advantage of the special case is straightforward. Getting parallelism there is more work and more fragile.

Meanwhile, I'll add this case to the numpy/scipy performance test, then let's compare pyenvs that have different numpy & scipy installations. -- The pip-installed numpy was looking good over the one that uses /share/sw/free/openblas/0.2.15/lib.

@jmason42
Copy link
Collaborator

Good point on the parallelism, I imagine doing that correctly might require going beyond Cython.

@1fish2
Copy link
Contributor

1fish2 commented Feb 21, 2018

Cython can release the GIL (global interpreter lock) and use a threading library like OpenMP.

It's a "small matter of figuring out" just how to do that, to access those libraries, create a thread pool so the threads don't have to be recreated each time, see that it doesn't try to use more CPU cores than slurm actually made available, etc.

This reminds me: The “cached result warning” could be general startup overhead like spawning a pool of threads. You could test that by using cell mode %%timeit, where the statement in the first line is used as setup code (executed but not timed) and the body of the cell is timed. The cell body has access to any variables created in the setup code.

@1fish2
Copy link
Contributor

1fish2 commented Feb 21, 2018

I added some integer and mixed dot products to the performance test. On macOS:

test_dot CPU times: user 108 ms, sys: 6.57 ms, total: 114 ms; Wall time: 23.3 ms
test_float_dot_int CPU times: user 111 ms, sys: 6.05 ms, total: 117 ms; Wall time: 28.2 ms
test_int_dot_float CPU times: user 107 ms, sys: 1.85 ms, total: 109 ms; Wall time: 20.6 ms
test_int_dot_floated_int CPU times: user 106 ms, sys: 5.41 ms, total: 111 ms; Wall time: 23.3 ms
test_int_dot_int CPU times: user 2.04 s, sys: 8.79 ms, total: 2.05 s; Wall time: 2.06 s

Integer × integer matrix multiply takes 18x the CPU time, 88x the Wall time, and gets no parallelization.

Cutting to the chase, there's an easy workaround:

N = np.random.random_integers(0, 9, size=(1000, 1000))
result = N.dot(N * 1.0)

Stack Overflow reveals that recent Intel CPUs are optimized to pump through lots of floating point multiplications, but not so for integer multiplications. Also BLAS has no integer data type.

@1fish2
Copy link
Contributor

1fish2 commented Feb 21, 2018

Sherlock 1.0 with 16 cores gets a bit more parallelization than Mac but more SYS overhead, and similar results:

test_dot CPU times: user 112 ms, sys: 56 ms, total: 168 ms; Wall time: 15.6 ms
test_float_dot_int CPU times: user 161 ms, sys: 216 ms, total: 377 ms; Wall time: 23.5 ms
test_floated_int_dot_int CPU times: user 216 ms, sys: 287 ms, total: 503 ms; Wall time: 31.4 ms
test_int_dot_float CPU times: user 194 ms, sys: 189 ms, total: 383 ms; Wall time: 24 ms
test_int_dot_floated_int CPU times: user 198 ms, sys: 236 ms, total: 434 ms; Wall time: 27.1 ms
test_int_dot_int CPU times: user 1.83 s, sys: 1.02 s, total: 2.85 s; Wall time: 1.63 s

I'll mark two of those to "skip" since the results are approximately symmetric.

@1fish2
Copy link
Contributor

1fish2 commented Feb 21, 2018

Timing various ways to convert an integer64 matrix to floats, the timings are about the same except float32 x float32 matrices goes twice as fast. So the hardware is good at float32 math and it has half the data to chew through.

def timeit(callable):
    start = time.time()
    callable()
    end = time.time()
    return end - start

N = np.random.random_integers(0, 9, size=(1000, 1000))

timeit(lambda: N.dot(N))  # ≈1.9 sec
timeit(lambda: N.dot(np.array(N * 1.0)))  # ≈ 0.025 sec
timeit(lambda: N.dot(np.array(N, dtype=np.float64)))  # ≈.02 sec [was: ≈0.22 sec]
timeit(lambda: N.dot(np.array(N, dtype=np.float32)))  # ≈.02 sec [was: ≈0.23 sec]
timeit(lambda: (np.array(N, dtype=np.float64)).dot(np.array(N, dtype=np.float64)))  # ≈0.022 sec
timeit(lambda: (np.array(N, dtype=np.float32)).dot(np.array(N, dtype=np.float32)))  # ≈0.010 sec

... or use N.astype(np.float32) to be more concise.

@tahorst
Copy link
Member Author

tahorst commented Feb 21, 2018

This is really interesting! Good to know float32 is even faster. Does timeit(lambda: N.dot(np.array(N * 1.0))) convert the first N to float while timeit(lambda: N.dot(np.array(N, dtype=np.float64))) doesn't? Or do you have another explanation for why there's a ~10x difference between the two? Seems like it's effectively doing the same thing.

@1fish2
Copy link
Contributor

1fish2 commented Feb 21, 2018

timeit() just times the function call, showing that computing the integer dot product took 190x as long as converting the entire array to float32, multiplying, and converting back!

Including vs. excluding the conversion in the timing:

a = np.random.random_integers(0, 16000, size=(1000, 1000)).astype(np.int32)
timeit(lambda: a.dot(a))  # ≈1.2 sec
af = a.astype(np.float32)
timeit(lambda: af.dot(af))  # ≈ 0.01 sec
timeit(lambda: af.astype(np.float32).dot(af.astype(np.float32)).astype(np.int32))  # ≈0.01 sec

AFAIK, the reason is the hardware devotes lots of transistors to floating point array math, and the operation runs in optimized BLAS code with parallelization, and caching must be part of the story. Converting to and from floats seems to hardly matter to the timing and might preload the cache. (Does NumPy do array operations lazily to process the data in fewer passes?) This might not be the full story. It's such a large factor! The array size probably matters.

Explaining float32 vs float64 is easy: It has half the data to load from RAM, churn through, and write to RAM, and the floating point hardware must handle that natively. (float16 is even slower than int32. It's probably converting to float32 in software, and back. GPUs might possible handle float16 or float8 natively.)

A spot check on whether it computes the same result even with integer overflow:

>>> a = np.random.random_integers(0, 127, size=(1000, 1000)).astype(np.int8)
>>> aa = a.dot(a)
>>> aa
array([[ -72,   59,  119, ...,  -62,  -40,   53],
       [ 112,  -61,   59, ...,  -76,  -61, -106],
       [  84,  -31,   98, ...,   34,  120, -110],
       ...,
       [ -53,  126,  -90, ...,  105,   21,  -24],
       [-115,   98,   69, ...,   11,   23,  -31],
       [  93,  -38,   90, ...,    3,  -46,    2]], dtype=int8)
>>> aaf = a.astype(np.float32).dot(a.astype(np.float32))
>>> aaf
array([[ 4021944.,  4062523.,  3916407., ...,  3884482.,  4045784.,
         4058421.],
       [ 4023408.,  4212419.,  4073787., ...,  4028084.,  4194243.,
         4126102.],
       [ 3918420.,  4108513.,  3882338., ...,  3832610.,  4029304.,
         3882898.],
       ...,
       [ 3904715.,  4097406.,  3829158., ...,  3744361.,  4006421.,
         3985640.],
       [ 4010893.,  4155746.,  4053829., ...,  3949067.,  4157463.,
         4141025.],
       [ 4000861.,  3989210.,  3943258., ...,  3878147.,  4027858.,
         3978242.]], dtype=float32)
>>> aafi = aaf.astype(np.int8)
>>> aafi
array([[ -72,   59,  119, ...,  -62,  -40,   53],
       [ 112,  -61,   59, ...,  -76,  -61, -106],
       [  84,  -31,   98, ...,   34,  120, -110],
       ...,
       [ -53,  126,  -90, ...,  105,   21,  -24],
       [-115,   98,   69, ...,   11,   23,  -31],
       [  93,  -38,   90, ...,    3,  -46,    2]], dtype=int8)
>>> np.all(aa == aafi)
True

@jmason42
Copy link
Collaborator

Thanks for all this analysis, Jerry. Very illuminating, lots of little surprises.

@1fish2
Copy link
Contributor

1fish2 commented Feb 22, 2018

Sorry, I misread Travis' question:

Does timeit(lambda: N.dot(np.array(N * 1.0))) convert the first N to float while timeit(lambda: N.dot(np.array(N, dtype=np.float64))) doesn't? Or do you have another explanation for why there's a ~10x difference between the two? Seems like it's effectively doing the same thing.

The answer is: I must've goofed typing in those numbers. All these timings are the same, within measurement noise, call it .02 sec:

# given N = np.random.random_integers(0, 16000, size=(1000, 1000))
N.dot(N * 1.0)
N.dot(np.array(N, dtype=np.float64))
N.dot(np.array(N, dtype=np.float32))
N.dot(N.astype(np.float64))
N.dot(N.astype(np.float32))

What is different (takes 1/2 as long) is converting both operands to float32:

N.astype(np.float32).dot(N.astype(np.float32))

and converting the result back to int64 doesn't seem to make a dent in the timing.

Some runs with size=(10000, 10000) confirm these timing buckets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants