(requires design discussion) 114 add circular block bootstrapping #418

reza-armuei · 2024-05-23T03:21:49Z

Please work through the following checklists. Delete anything that isn't relevant.

Development for new xarray-based metrics

Works with n-dimensional data and includes reduce_dims, preserve_dims, and weights args.
Typehints added
Docstrings complete and followed Napoleon (google) style
Reference to paper/webpage is in docstring
Add error handling
Imported into the API

Testing of new xarray-based metrics

100% unit test coverage
Test that metric is compatible with dask.
Test that metrics work with inputs that contain NaNs
Test that broadcasting with xarray works
Test both reduce and preserve dims arguments work
Test that errors are raised as expected
Test that it works with both xr.Dataarrays and xr.Datasets

Tutorial notebook

Short introduction to why you would use that metric and what it tells you
A link to a reference
A "things to try next" section at the end
Add notebook to Explanation.ipynb
Optional - a detailed discussion of how the metric works at the end of the notebook

Documentation

Add the score to the API documentation
Add the score to the included list of metrics and tools

nikeethr · 2024-05-23T10:26:57Z

src/scores/processing/block_bootstrap.py

+
+
+def block_bootstrap(
+    *objects: XarrayLike,


If these are necessarily array-like items and necessarily contained in a list, then it may be more explicit to:

call them "arrays", "da_list", or similar.

use an Iterable e.g. list rather than variadic arguments. And use a forceful * after to split out keyword-only arguments. (Furthermore, this is generally how I've seen it done in the code-base - so it'll be more consistent.)

e.g.

def block_bootstrap( array_list: List[XarrayLike], *, # force remaining args to be keyword only blocks: Dict[str, int], ..., circular: bool = True, ) -> ...

nikeethr · 2024-05-23T10:29:20Z

src/scores/processing/block_bootstrap.py

+        objects: The data to bootstrap, which can be multiple datasets. In the case where
+            multiple datasets are passed, each dataset can have its own set of dimension. However,
+            for successful bootstrapping, dimensions across all input objects must be nested.
+            For instance, for `block.keys=['d1', 'd2', 'd3'], an object with dimension 'd1' and


Suggested change

For instance, for `block.keys=['d1', 'd2', 'd3'], an object with dimension 'd1' and

For instance, for ``block.keys=['d1', 'd2', 'd3']``, an object with dimension 'd1' and

Note: sphinx inline-code monospacing requires two backticks. There may be other places this may need to be done as well.

nikeethr · 2024-05-23T10:49:09Z

src/scores/processing/block_bootstrap.py

+            """
+            num = 1
+            for value in chunks:
+                value = max(value) if not isinstance(value, int) else value


Consider using numpy.max(value) here instead of the if/else, it handles scalars as well in which case it just behaves like an identity func. It also guards against non-numerics (e.g. strings) by default I think.

nikeethr · 2024-05-23T10:54:27Z

src/scores/processing/block_bootstrap.py

+        Get the max chunk size in a dataset
+        """
+
+        def size_of_chunk(chunks, itemsize):


[optional]

Consider replacing this whole thing with numpy e.g. num = np.prod([np.max(x) for x in chunks]) (not tested...)

nikeethr · 2024-05-23T11:02:05Z

src/scores/processing/block_bootstrap.py

+            for value in chunks:
+                value = max(value) if not isinstance(value, int) else value
+                num = num * value
+            return itemsize * num / 1024**2


[optional]

In 1024**2, ** is prioritised, but it can be vague (to me anyway). So I generally prefer to add parentheses i.e. itemsize * num / (1024 ** 2). If black automagically remove them then don't worry about it.

nikeethr · 2024-05-23T11:09:45Z

src/scores/processing/block_bootstrap.py

+                num = num * value
+            return itemsize * num / 1024**2
+
+        ds = ds.to_dataset(name="ds") if isinstance(ds, xr.DataArray) else ds


I think it may be better to flip this.

ds = ds if isinstance(ds, xr.Dataset) else ds.to_dataset(name="ds")

that way even if ds is not a DataArray it'll be forcefully be duck-typed and will raise an error if an incorrect data type is input.

nikeethr · 2024-05-23T11:19:54Z

src/scores/processing/block_bootstrap.py

+    if exclude_dims is None:
+        exclude_dims = [[] for _ in range(len(objects_list))]
+    if (
+        not isinstance(exclude_dims, list)


Could split this up into different error branches for clarity.

lengths not equal => length mismatch

elements are not lists, or container is not a list => unexpected data structure

nikeethr · 2024-05-23T11:25:03Z

src/scores/processing/block_bootstrap.py

+                kwargs={"indices": [ind]},
+                input_core_dims=[core_dims],
+                output_core_dims=[core_dims + ["iteration"]],
+                dask="parallelized",


Might be worth propagating this to an API argument in block_bootstrap so that the user can forcefully disable dask for the ufuncs for whatever reason.

nikeethr · 2024-05-23T11:39:05Z

src/scores/processing/block_bootstrap.py

+            "objects containing lists of dimensions to exclude for each object"
+        )
+    renames = []
+    for i, (obj, exclude) in enumerate(zip(objects_list, exclude_dims)):


What should be the expected behaviour if a dimension to exclude is also named "dim{x}" and x == ii?

Is there a reason for renaming as opposed to dropping/splitting and re-combining?

nikeethr · 2024-05-23T12:16:06Z

src/scores/processing/block_bootstrap.py

+            sizes = OrderedDict(
+                {d: (obj.sizes[d], b) for d, b in blocks.items()},
+            )
+            break


break is a bit ambiguous. Maybe a comment here saying that only one obj with all keys present is required.

You could also use itertools.dropwhile or similar, to avoid excessive control flow:

try: # find the first obj with all keys in `blocks` obj = next(itertools.dropwhile( lambda x: x.sizes.keys() != blocks.keys(), objects_list, )) sizes = OrderedDict(...) except StopIteration: # no obj found that matches all the keys in `blocks` raise ValueError(...)

nikeethr · 2024-05-23T12:21:34Z

@reza-armuei Thanks for this. I've gone through about 40-50% of it. Looks good so far, no major functional issues thus far - mostly structural comments/alternatives/suggestions. I work at flexible times, so don't feel the need to respond to comments straight away :).

nikeethr · 2024-05-23T23:45:10Z

src/scores/processing/block_bootstrap.py

+    for obj in objects_list:
+        try:
+            sizes = OrderedDict(
+                {d: (obj.sizes[d], b) for d, b in blocks.items()},


I believe you have to initialize an ordered dict as a list of tuples, otherwise it doesn't maintain order on initialization. In more recent versions of python dict is already ordered I believe, so this behaviour may have been obfuscated.

This is a safer way, since lists by definition are ordered (assuming the way I remember initializing ordered dicts is still accurate):

sizes = OrderedDict([ (d, (obj.sizes[d], b)) for d,b in blocks.items() ])

nikeethr · 2024-05-24T02:29:31Z

src/scores/processing/block_bootstrap.py

+MAX_CHUNK_SIZE_MB = 200
+
+
+def _get_blocked_random_indices(


We may need to walk through this logic. Currently the logic feels a bit coupled (to me anyway).

For example, it seems that shape is mutable so it's pretty tricky for me to evaluate what it's trying to do. shape[prev_ax] = math.ceil(shape[prev_ax] / b) implies that / b could happen multiple times depending on the number of axes. Doing it once implies shape / block_size = num_blocks, but doing it again, block_size / num_blocks implies there's some iterative reduction in the sampling space for that axis. But this reduction is controlled by a iterative call in another function: _n_nested_blocked_random_indices.

There may be more explicit ways for you to rearrange the logic in this function for clarity? Maybe a tree structure can help with organizing the data? Example:

axis_1 │ sample_1 sample_2 ... N_1 │ │ │ │ ┌────┼────┐ ┌───┼───┐ │ │ │ │ │ │ │ │ │ │ │ │ │ │ axis_2 │ s_1 s_2 s_n ▼ ▼ ▼ ... N_2 │ │ │ │ │ │ │ ... │ │ │ │ │ │ │ │ ┌─┼─┐ ┌─┼─┐ │ ┌─┼─┬─┼─┬─┼─┐ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ axis_n │ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ... N_n ─────────┴─────────────────────────────────────────── samples---> - each leaf contains n-tuple representing the combined sample index pointing to a single entry. - repeat * n_iter and stack if we want multiple of these. - each axis level is randomly sampled based on its block size, for each parent sample/block.

But this also may not be right if there are restrictions on what the underlying algorithm should extract and how it needs to do it.

Happy to discuss. Maybe once @tennlee also has had a look...

nikeethr · 2024-05-24T12:17:26Z

src/scores/processing/block_bootstrap.py

+    indices = OrderedDict()
+    prev_blocks: List[int] = []
+    for ax, (key, (_, block)) in enumerate(sizes.items()):
+        indices[key] = _get_blocked_random_indices(shape[: ax + 1] + [n_iteration], ax, block, prev_blocks, circular)


Ok so, a combination of indexing shape[: ax + 1] and maybe + [n_iteration] seems to create a copy of shape (because it's a list). Therefore, even if it is manipulated in _get_blocked_random_indices (see previous comment), the slicing creates unique obj each time, and hence the function doesn't mutate the original shape.

However, this wasn't obvious to me initially. Again it may be good to make this behaviour explicit by passing in a "frozen" list or similar, and/or instead of creating a .copy() in _get_blocked_random_indices and manipulating the passed in shape, manipulate the copy instead.

nikeethr · 2024-05-24T12:44:06Z

src/scores/processing/block_bootstrap.py

@@ -0,0 +1,410 @@
+"""
+Functions for performing block bootstrapping of arrays. This is inspired from
+https://github.com/dougiesquire/xbootstrap with modifications to make the functions more


I just noticed this, and the implementation is similar enough to https://github.com/dougiesquire/xbootstrap/blob/main/xbootstrap/core.py
that if we want to use it "as-is" we should probably include their license/permission notice.

Alternatively, we should consider re-implementing using a unique solution but using xbootstrap as a test case (offline - not part of the unit tests) to check if the results are the same. (This obviously would take more time, but is definitely do-able in a more efficient way in my opinion.)

This is one for @tennlee to make the call on (I would suggest to hold up on addressing any major comments before that).

I'll have to really double check this, but my understanding is that the MIT license which was used by xbootstrap and the Apache 2.0 license are compatible. If so there should be a viable way forward here, we'll just need to make sure of how to attribute, reference and include license statements correctly. If there is an unrelated reason to re-implement the code (e.g. efficiency, compability etc) then that makes sense to do regardless. We can have a discussion next week.

@reza-armuei I'm going to mark this as "draft" for now, just as a guard.

@tennlee I haven't gone through everything, but my 2c is - it seems there's a few different coding paradigms used in the code, itertools, math, list comprehensions, basic for loops, and numpy, rather arbitrarily. A lot of this can just be purely numpy, with some exceptions for basic operations - (this would improve clarity and as a bonus speed). The structure of the code seems overly complex for what I believe it's trying to do, and there are a lot of implicit assumptions. The type safety (increase) and mutability (reduce) can also be improved. Having said that it may still be functionally correct in its implementation.

We can discuss more next week.

@nikeethr Thank you so much for your time to review and for your comments. I agree with you that the current structure is complex and it can be improved. I have started to apply your comments (it may take time as I am working on a few other tasks too).
@nikeethr and @tennlee, I am open to any suggestions to improve it.

I'll just share this here from summarizing my notes of the algorithm for reference. It'd be easier if we go through a discussion over a call.

The way I understand the algorithm (feel free to correct me), is that it is trying to sample blocks (with replacement) essentially to re-arrange/randomize the domain; without breaking any dependence relationship. This can be summarized as:

block_sample = { (x_1, x_2, ..., x_n) such that |x_i - x_i_hat| <= b_i, for all x_i, and fixed x_i_hat in each dimension i } where, i = 1,...,n n = number axes b_i = dependence distance for axis i (or block size) x_i = indices in dimension i x_i_hat = index of i'th axis in random sample

If we want to add cyclic operations then basically the norm/distance || operator will have to account for the space being cyclic (instead of linearly ordered). If there are partial blocks due to block size not being a multiple of domain size - then we either: trim, expand, or provide partial block sampling (not recommended as it will sever dependence).

If this is accurate then here are a following things to consider:

numpy.random.default_rng() -> rng.integers() is actually very performant, and it should be used to sample in bulk.

It is sufficient to reproduce the algorithm by generating a random point cloud (which is a one-liner), and then expand each point into a block.

This will result in a array of extents for each axis, which I believe is a perfectly reasonable way of indexing arrays in xarray/numpy format.

If we want sampling to be cyclic, we apply the modulo operator when expanding (suspected partial) blocks. Or a faster (but more memory consuming) way is to just pad the axes.

If we don't want them to be cyclic, we just restrict the sample domain.

In order to preserve the "nested" sampling, it is sufficient to preserve

the order (with no jumps) of the dimensions for every input array

for a lower dimensionality we simply ignore the dimensions not present and broadcast to confirm to the other arrays.

Use dask & apply_ufunc optionally (based on user preference/compatibility) for the outer repetitions i.e. num_iterations

The above seems like a lot of considerations, but honestly the resulting code imo would be quite a bit simpler.

nicholasloveday

There are also some spelling and grammatical issues with the notebook (e.g., "technic", "threats" etc). The change is too big for me to provide direct feedback on the diffs. Perhaps have a first review yourself and then I'll go through it in more detail.

I still need to work through the code. In a separate pull request, it might be nice to do a further update that explains how it works and some tips on how to use it (e.g., selecting length size)

nicholasloveday · 2024-05-27T00:29:43Z

tests/processing/test_block_bootstrap.py

@@ -0,0 +1,306 @@
+"""
+This module contains unit tests for scores.stats.tests.block_bootstrap


Can you add some tests where you have NaNs in your test data?

nicholasloveday · 2024-05-27T00:32:28Z

tests/processing/test_block_bootstrap.py

+    result = block_bootstrap(
+        *objects, blocks=blocks, n_iteration=n_iteration, exclude_dims=exclude_dims, circular=circular
+    )
+    assert result.shape == expected_shape


Can you test that that result.data is a dask object? Then when you run result.compute() the underlying object is an np.ndarray? See https://github.com/nci/scores/blob/develop/tests/categorical/test_multicategorical.py#L283 for an example.

nikeethr · 2024-06-09T12:29:56Z

@reza-armuei Ok I've put in some more updates to clean up the sample code a bit. I think I'm reasonably happy with its current state, and in my eyes its more aligned to the coding style in this package. It currently does a single iteration of bootstrapping to a single numpy array input:
src/scores/emerging/block_bootstrap

I've put some TODOs in the module docstring. If the core implementation is fine, the TODOs are relatively straightforward to expand on.

However, we should throw some real data at it to make sure that it works properly.

Note: pipeline is expected to fail due to code coverage

I've also kept the original implementation intact to compare against. I feel like this should go under emerging if we choose to start from my version.

…otstrap

nikeethr · 2024-06-11T08:08:30Z

Before implementing the xarray/dask portions, here are a couple of thoughts looking back at the original xbootstrap implementation:

I'm not certain if exclude_dims is needed.
I'm not certain if iterations actually need to have chunk size > 1.
The inference of chunk sizes also may not be necessary: Inferring chunk sizes is mostly to determine how to chunk iterations dimension, but the underlying ufunc is numpy based (dask setting is "parallelized") this implies that all the core dims will likely be squashed in the compute.

I think we can start with a simpler ufunc call and re-test the above assumptions. In fact, there is no real harm in limiting the chunk_size of iterations to 1, (except for maybe speed, due to under-utilization of broadcasting - dask should be smart enough to manage memory that's one of its selling points.).

nikeethr · 2024-06-12T00:30:21Z

In the interest of keeping things neater. I'll actually pull out all my changes into a separate pull request in a fork, given that I'm utilizing the emerging space. That way this pull request only addresses comments from the xbootstrap port.

nikeethr · 2024-06-12T02:33:46Z

I've moved my sample implementation here: #523

My updated recommendation for this PR to:

add xbootstrap as either a dependency or extra-dependency for now.
remove any ported code.
write wrapper functions around internal API calls to xbootstrap (under tests) that we would like to test, as a sense check against incompatibilities.
write a wrapper function (with docstrings) around the main API to xbootstrap call (under scores.processing.block_bootstrap) and expose that to users, and use that in the tutorial instead.
if we plan to use an extra-dependency, we should do a "import check" at the top of the file and spit out an incompatibility error, if xbootstrap is not installed.

The above will resolve most of my review comments, and then the rest of the review would just involve additional tests, checking the documentation and grammar, as well as a science review.

Also commented in #114 , split out issue to track alternative implementation here: #522

reza-armuei linked an issue May 23, 2024 that may be closed by this pull request

Add circular block bootstrapping #114

Open

reza-armuei requested review from tennlee, nikeethr and nicholasloveday May 23, 2024 03:26

nikeethr reviewed May 23, 2024

View reviewed changes

nikeethr reviewed May 24, 2024

View reviewed changes

nikeethr marked this pull request as draft May 25, 2024 11:13

nicholasloveday reviewed May 27, 2024

View reviewed changes

nikeethr changed the title ~~114 add circular block bootstrapping~~ (requires design discussion) 114 add circular block bootstrapping May 28, 2024

nikeethr force-pushed the 114-add-circular-block-bootstrapping branch from 5191ad2 to 90bcf78 Compare May 28, 2024 05:10

nikeethr force-pushed the 114-add-circular-block-bootstrapping branch from 5591bd3 to 0ab4965 Compare June 9, 2024 23:56

reza-armuei and others added 5 commits June 11, 2024 10:39

Added block_bootstrapping technique

a57b2ae

wip: (TEMPORARY) scaffolding code for block bootstrapping

ff2d584

[skip ci][scaffold] expand indices

1917d28

[skip ci][wip] bootstrapping python impl

2621e58

[skip ci][wip] transpose in wrong spot

394c95d

nikeethr added 3 commits June 11, 2024 10:39

[wip] extra methods, bug fixes, and better documentation for block bo…

dc3e851

…otstrap

[wip] linting

a2fcd79

[wip] sample bootstrapping code out of scaffolding

d703d87

nikeethr force-pushed the 114-add-circular-block-bootstrapping branch from 8429051 to d703d87 Compare June 11, 2024 00:39

nikeethr added the emerging label Jun 11, 2024

This was referenced Jun 12, 2024

Cyclic block bootstrapping alternate version #522

Open

Add circular block bootstrapping #114

Open

[revert] remove emerging implementation, moved to #522

566078d

nikeethr mentioned this pull request Jun 12, 2024

(still under development) Add cyclic block bootstrap - alternate implementation #523

Draft

11 tasks

nikeethr removed the emerging label Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(requires design discussion) 114 add circular block bootstrapping #418

(requires design discussion) 114 add circular block bootstrapping #418

reza-armuei commented May 23, 2024

nikeethr May 23, 2024 •

edited

Loading

nikeethr May 23, 2024

nikeethr May 23, 2024

nikeethr May 23, 2024

nikeethr May 23, 2024

nikeethr May 23, 2024

nikeethr May 23, 2024 •

edited

Loading

nikeethr May 23, 2024

nikeethr May 23, 2024

nikeethr May 23, 2024 •

edited

Loading

nikeethr commented May 23, 2024

nikeethr May 23, 2024

nikeethr May 24, 2024

nikeethr May 24, 2024 •

edited

Loading

nikeethr May 24, 2024 •

edited

Loading

tennlee May 25, 2024 •

edited

Loading

nikeethr May 25, 2024 •

edited

Loading

reza-armuei May 26, 2024

nikeethr May 26, 2024

nicholasloveday left a comment

nicholasloveday May 27, 2024

nicholasloveday May 27, 2024

nikeethr commented Jun 9, 2024 •

edited

Loading

nikeethr commented Jun 11, 2024 •

edited

Loading

nikeethr commented Jun 12, 2024 •

edited

Loading

nikeethr commented Jun 12, 2024 •

edited

Loading

	For instance, for `block.keys=['d1', 'd2', 'd3'], an object with dimension 'd1' and
	For instance, for ``block.keys=['d1', 'd2', 'd3']``, an object with dimension 'd1' and

		@@ -0,0 +1,306 @@
		"""
		This module contains unit tests for scores.stats.tests.block_bootstrap

(requires design discussion) 114 add circular block bootstrapping #418

Are you sure you want to change the base?

(requires design discussion) 114 add circular block bootstrapping #418

Conversation

reza-armuei commented May 23, 2024

Development for new xarray-based metrics

Testing of new xarray-based metrics

Tutorial notebook

Documentation

nikeethr May 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikeethr May 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikeethr May 23, 2024 • edited Loading

Choose a reason for hiding this comment

nikeethr commented May 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikeethr May 24, 2024 • edited Loading

Choose a reason for hiding this comment

nikeethr May 24, 2024 • edited Loading

Choose a reason for hiding this comment

tennlee May 25, 2024 • edited Loading

Choose a reason for hiding this comment

nikeethr May 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicholasloveday left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikeethr commented Jun 9, 2024 • edited Loading

nikeethr commented Jun 11, 2024 • edited Loading

nikeethr commented Jun 12, 2024 • edited Loading

nikeethr commented Jun 12, 2024 • edited Loading

nikeethr May 23, 2024 •

edited

Loading

nikeethr May 23, 2024 •

edited

Loading

nikeethr May 23, 2024 •

edited

Loading

nikeethr May 24, 2024 •

edited

Loading

nikeethr May 24, 2024 •

edited

Loading

tennlee May 25, 2024 •

edited

Loading

nikeethr May 25, 2024 •

edited

Loading

nikeethr commented Jun 9, 2024 •

edited

Loading

nikeethr commented Jun 11, 2024 •

edited

Loading

nikeethr commented Jun 12, 2024 •

edited

Loading

nikeethr commented Jun 12, 2024 •

edited

Loading