-
Notifications
You must be signed in to change notification settings - Fork 4
NumPy optimization tips
Due to how arrays are typically stored in memory,
array = np.array([[0, 1], [2, 3]])
for row in array: # iterating over an array yields the first dimension
# do something with the row
is often faster than
array = np.array([[0, 1], [2, 3]])
for col in array.T: # transpose the array to iterate over the new highest dimension (the columns of the old array)
# do something with the column
This is because the data of a single row is contiguous in memory; transposing an array just changes how the array is interpreted rather than creating a new array. There are ways to change the memory order, and the rules can be complicated for higher-dimension arrays. Generally you are best off transposing and then copying the array if you need to iterate over columns:
array = np.array([[0, 1], [2, 3]])
for col in array.T.copy():
# do something with the column
Naturally this takes up more memory. Typically any operation that can be considered an operation over a dimension, including a matrix-vector product, will benefit from having your data in the right order.
Many numpy
operations on floating-point vectors can be sped up by replacing them with dot-products, which can be accelerated by third-party libraries. If any of these operations are part of your inner-loops, you should consider using these alternative implementations.
First, the L1-norm (sum of absolute values):
def fast_sumabs_vector(x):
# faster than np.sum(np.abs(x)) and np.linalg.norm(x, 1)
return np.sign(x).dot(x)
You could probably independently speed up np.sum
alone using this strategy. Likewise for computing a mean, and possibly also other statistics (i.e. np.std
).
Similarly, the squared L2-norm (sum of squares):
def fast_sumsq_vector(x):
# faster than np.sum(np.square(x)) and np.square(np.linalg.norm(x, 2))
return x.dot(x)
I also found that I could speed up the calculation of the median for short vectors (less than a hundred elements):
def fast_median_shortvector(x):
# faster than np.median on vectors < 100 elements
(halfsize, odd_number_of_elements) = divmod(x.size, 2)
x_sorted = np.sort(x)
if odd_number_of_elements:
return x_sorted[halfsize]
else:
return (x_sorted[halfsize-1] + x_sorted[halfsize])/2.0
I think this is faster because np.median
has a lot of type and error checking. Also, np.median
uses np.partition
, which performs a partial sort, instead of np.sort
. It seems that np.partition
is no better, and possibly slower, than np.sort
for very short vectors.
Third-party linear algebra libraries like BLAS typically do not implement an integer type. Thus
result = np.dot(integer_array1, integer_array2)
will not be evaluated by BLAS and consequently be much slower than
result = np.dot(float_array1, float_array2)
even if the float arrays contain the same numerical values. Additionally,
result = np.dot(float_array1, integer_array2)
will be faster than the int-int product but a bit slower than the float-float product. This is because the integer array will be cast to float, then passed to BLAS. Thus, if you don't need the exactness of integer operations, and you need to run a dot-product several times (i.e. in an inner loop), you should pre-cast all your arrays to floats.
If you run BLAS in multiple processes (like the parallel code in the Fitter), each BLAS process can try to use all available CPUs. Running with the environment variable OPENBLAS_NUM_THREADS=1
prevents the spawned processes from stepping on each others toes.