Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove debugging slow assert statement #7221

Merged
merged 8 commits into from
Oct 28, 2022
Merged

Conversation

hmaarrfk
Copy link
Contributor

We've been trying to understand why our code is slow. One part is that we use xarray.Datasets almost like dictionaries for our data. The following code is quite common for us

import xarray as xr
dataset = xr.Dataset()
dataset['a'] = 1
dataset['b'] = 2

However, through benchmarks, it became obvious that the merge_core method of xarray was causing alot of slowdowns.
main branch:
image

With this merge request:
image

from tqdm import tqdm
import xarray as xr
from time import perf_counter
import numpy as np

N = 1000

# Everybody is lazy loading now, so lets force modules to get instantiated
dummy_dataset = xr.Dataset()
dummy_dataset['a'] = 1
dummy_dataset['b'] = 1
del dummy_dataset

time_elapsed = np.zeros(N)
dataset = xr.Dataset()

for i in tqdm(range(N)):
    time_start = perf_counter()
    dataset[f"var{i}"] = i
    time_end = perf_counter()
    time_elapsed[i] = time_end - time_start
    
    
# %%
from matplotlib import pyplot as plt

plt.plot(np.arange(N), time_elapsed * 1E3, label='Time to add one variable')
plt.xlabel("Number of existing variables")
plt.ylabel("Time to add a variables (ms)")
plt.ylim([0, 50])
plt.grid(True)
  • Closes #xxxx
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

@max-sixty
Copy link
Collaborator

Gosh, that's quite dramatic! Impressive find @hmaarrfk. (out of interest, how did you find this?)

I can see how that's quadratic when looping like that. I wonder whether using .assign(var1=1, var2=2, ...) has the same behavior?

Would be interesting to see whether this was covered by our existing asv benchmarks. Would be a good benchmark to add if we don't have one already.

@hmaarrfk
Copy link
Contributor Author

out of interest, how did you find this?

Spyder profiler

@github-actions github-actions bot added run-benchmark Run the ASV benchmark workflow topic-performance labels Oct 26, 2022
@hmaarrfk
Copy link
Contributor Author

Would be interesting to see whether this was covered by our existing asv benchmarks.

I wasn't able to find something that really benchmarked "large" datasets.

Would be a good benchmark to add if we don't have one already.

Added one.

@hmaarrfk
Copy link
Contributor Author

:/ not fun, the benchmark is failing. not sure why.

@hmaarrfk
Copy link
Contributor Author

I'm somewhat ocnfused, I can run the benchmark locally

[  1.80%] ··· dataset_creation.Creation.time_dataset_creation                                                    4.37±0s

@Illviljan
Copy link
Contributor

Error: [ 75.90%] ··· dataset_creation.Creation.time_dataset_creation             failed
[ 75.90%] ···· asv: benchmark timed out (timeout 60.0s)

Maybe 1000 loops is too much. Start with 100 maybe? We still want these benchmarks to be decently fast in the CI.

@Illviljan
Copy link
Contributor

Illviljan commented Oct 26, 2022

I like large datasets as well. I seem to remember getting caught in similar places when creating my datasets. I think I solved it by using Variable instead, does doing something like this improve the performance for you?

import xarray as xr
dataset = xr.Dataset()
dataset['a'] = xr.Variable(dims="time", data=[1])
dataset['b'] = xr.Variable(dims="time", data=[2])

@Illviljan
Copy link
Contributor

Now the asv finishes at least! Could you make a separate PR for the asv? I don't think it runs it when comparing to the main branch.

@hmaarrfk
Copy link
Contributor Author

Ok. I'll want to rethink them.

I know it looks quadratic time, but i really would like to test n=1000 and i have an idea

@hmaarrfk hmaarrfk mentioned this pull request Oct 26, 2022
4 tasks
@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Oct 26, 2022

I know it is not comparable, but I was really curious what "dictionary insertion" costs, in order to be able to understand if my comparisons were fair:

code
from tqdm import tqdm
import xarray as xr
from time import perf_counter
import numpy as np

N = 1000

# Everybody is lazy loading now, so lets force modules to get instantiated
dummy_dataset = xr.Dataset()
dummy_dataset['a'] = 1
dummy_dataset['b'] = 1
del dummy_dataset

time_elapsed = np.zeros(N)
# dataset = xr.Dataset()
dataset = {}

for i in tqdm(range(N)):
# for i in range(N):
    time_start = perf_counter()
    dataset[f"var{i}"] = i
    time_end = perf_counter()
    time_elapsed[i] = time_end - time_start
    
    
# %%
from matplotlib import pyplot as plt

plt.plot(np.arange(N), time_elapsed * 1E6, label='Time to add one variable')
plt.xlabel("Number of existing variables")
plt.ylabel("Time to add a variables (us)")
plt.ylim([0, 10])
plt.title("Dictionary insertion")
plt.grid(True)

image

I think xarray gives me 3 order of magnitude of "thinking" benefit, so I'll take it!

python --version
Python 3.9.13

@dcherian dcherian requested a review from benbovy October 27, 2022 15:39
@github-actions github-actions bot removed run-benchmark Run the ASV benchmark workflow topic-performance labels Oct 27, 2022
@Illviljan
Copy link
Contributor

       before           after         ratio
     [c000690c]       [24753f1f]
-     3.17±0.02ms      1.94±0.01ms     0.61  merge.DatasetAddVariable.time_variable_insertion(100)
-        81.5±2ms       17.0±0.2ms     0.21  merge.DatasetAddVariable.time_variable_insertion(1000)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Nice improvements. :)

I haven't fully understood why we had that code though?

@benbovy
Copy link
Member

benbovy commented Oct 27, 2022

Thanks @hmaarrfk!

I haven't fully understood why we had that code though?

Me neither. I don't remember ever seeing this assertion error raised while refactoring things. Any idea @shoyer?

@max-sixty max-sixty added the plan to merge Final call for comments label Oct 27, 2022
@shoyer
Copy link
Member

shoyer commented Oct 28, 2022

I no longer remember why I added these checks, but I certainly did not expect to see this sort of performance penalty!

@max-sixty max-sixty merged commit 040816a into pydata:main Oct 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants