Skip to content

NumPy optimization tips

Jerry Morrison edited this page Jun 13, 2018 · 3 revisions

Array order and memory contiguity

Due to how arrays are typically stored in memory,

array = np.array([[0, 1], [2, 3]])
for row in array: # iterating over an array yields the first dimension
    # do something with the row

is often faster than

array = np.array([[0, 1], [2, 3]])
for col in array.T: # transpose the array to iterate over the new highest dimension (the columns of the old array)
    # do something with the column

This is because the data of a single row is contiguous in memory; transposing an array just changes how the array is interpreted rather than creating a new array. There are ways to change the memory order, and the rules can be complicated for higher-dimension arrays. Generally you are best off transposing and then copying the array if you need to iterate over columns:

array = np.array([[0, 1], [2, 3]])
for col in array.T.copy(): 
    # do something with the column

Naturally this takes up more memory. Typically any operation that can be considered an operation over a dimension, including a matrix-vector product, will benefit from having your data in the right order.

Vector operations

Many numpy operations on floating-point vectors can be sped up by replacing them with dot-products, which can be accelerated by third-party libraries. If any of these operations are part of your inner-loops, you should consider using these alternative implementations.

First, the L1-norm (sum of absolute values):

def fast_sumabs_vector(x):
    # faster than np.sum(np.abs(x)) and np.linalg.norm(x, 1)
    return np.sign(x).dot(x)

You could probably independently speed up np.sum alone using this strategy. Likewise for computing a mean, and possibly also other statistics (i.e. np.std).

Similarly, the squared L2-norm (sum of squares):

def fast_sumsq_vector(x):
    # faster than np.sum(np.square(x)) and np.square(np.linalg.norm(x, 2))
    return x.dot(x)

I also found that I could speed up the calculation of the median for short vectors (less than a hundred elements):

def fast_median_shortvector(x):
    # faster than np.median on vectors < 100 elements
    (halfsize, odd_number_of_elements) = divmod(x.size, 2)
    x_sorted = np.sort(x)

    if odd_number_of_elements:
        return x_sorted[halfsize]

    else:
        return (x_sorted[halfsize-1] + x_sorted[halfsize])/2.0

I think this is faster because np.median has a lot of type and error checking. Also, np.median uses np.partition, which performs a partial sort, instead of np.sort. It seems that np.partition is no better, and possibly slower, than np.sort for very short vectors.

Datatypes and BLAS

Third-party linear algebra libraries like BLAS typically do not implement an integer type. Thus

result = np.dot(integer_array1, integer_array2)

will not be evaluated by BLAS and consequently be much slower than

result = np.dot(float_array1, float_array2)

even if the float arrays contain the same numerical values. Additionally,

result = np.dot(float_array1, integer_array2)

will be faster than the int-int product but a bit slower than the float-float product. This is because the integer array will be cast to float, then passed to BLAS. Thus, if you don't need the exactness of integer operations, and you need to run a dot-product several times (i.e. in an inner loop), you should pre-cast all your arrays to floats.

BLAS can oversubscribe to CPU threads

If you run BLAS in multiple processes (like the parallel code in the Fitter), each BLAS process can try to use all available CPUs. Running with the environment variable OPENBLAS_NUM_THREADS=1 prevents the spawned processes from stepping on each others toes.