Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset insertion benchmark #7223

Merged
merged 17 commits into from
Oct 27, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions asv_bench/benchmarks/merge.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
import xarray as xr


class DatasetAddVariable:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, i was going to expand this to other in-memory operations. Replace a variable? Drop a variable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the first version used was relevant as well:

ds = xr.Dataset(coords=coords)
for v in names:
    ds[v] = ("time", value)

And this fast one from #7224 (comment) would also be nice to benchmark:

data_vars = {v: ("time", value) for v in names}
ds = xr.Dataset(data_vars=data_vars, coords=coords)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first case is captured by cause i am suggesting we time insertion for 0, 10, 100, 1000 existing variable. One can interpolate the bahvior.

I do believe that bencharmming all at once creation (the second case) is also interesting

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed it thinking that we were really timing the merge operations. This is also true for replacing, but not true for dropping.

Maybe we add dropping to a new dataset.py. This is all quite minor though! Shall we merge?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking that we were really timing the merge operations

It depends if you want to think about from the point of view of "xarray developer" or "xarray user".

As an "xarray user", i'm trying to "insert a totally new variable, no dims, no coords". I don't really care what is happening under the hood. This is what I originally had in mind. This is h

My goal with the original name was to test operations on in-memory xarrays (as opposed to those backed by a file, which seem to be done in dataset_io). I don't know if I have time to "add other benchmarks" at the moment. This seems good.

The name is the only thing that will survive. so if you are are ok with it.

I ran the benchmarks against main, #7221, and #7222. The benchmark seems to show the effect locally even with a single insertion. I've thus simplified the benchmark to do a single insertion, and not 5.

main

[ 50.00%] ··· merge.DatasetAddVariable.time_variable_insertion   ok
[ 50.00%] ··· ======== ==========
               param1
              -------- ----------
                 0      636±0μs
                 10     511±0μs
                100     5.92±0ms
                1000    41.0±0ms
              ======== ==========

#7221

[ 50.00%] ··· merge.DatasetAddVariable.time_variable_insertion   ok
[ 50.00%] ··· ======== ==========
               param1
              -------- ----------
                 0      609±0μs
                 10     1.32±0ms
                100     4.01±0ms
                1000    8.65±0ms
              ======== ==========

#7222

[ 50.00%] ··· merge.DatasetAddVariable.time_variable_insertion   ok
[ 50.00%] ··· ======== ==========
               param1
              -------- ----------
                 0      578±0μs
                 10     1.03±0ms
                100     2.74±0ms
                1000    5.37±0ms
              ======== ==========

Copy link
Contributor Author

@hmaarrfk hmaarrfk Oct 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way rename param1? ok i renamed it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ asv run -E existing --quick --bench merge.DatasetAddVariable.time_variable_insertion
· Discovering benchmarks
· Running 1 total benchmarks (1 commits * 1 environments * 1 benchmarks)
[  0.00%] ·· Benchmarking existing-py_home_mark_mambaforge_envs_mcam_dev_bin_python
[ 50.00%] ··· ...e.DatasetAddVariable.time_variable_insertion                 ok
[ 50.00%] ··· =================== ==========
               existing_elements
              ------------------- ----------
                       0           668±0μs
                       10          1.17±0ms
                      100          2.22±0ms
                      1000         5.38±0ms
              =================== ==========

param_names = ["existing_elements"]
params = [[0, 10, 100, 1000]]

def setup(self, existing_elements):
self.datasets = {}
# Dictionary insertion is fast(er) than xarray.Dataset insertion
d = {}
for i in range(existing_elements):
d[f"var{i}"] = i
self.dataset = xr.merge([d])

def time_variable_insertion(self, existin_elements):
dataset = self.dataset
dataset["new_var"] = 0