Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(requires design discussion) 114 add circular block bootstrapping #418

Draft
wants to merge 9 commits into
base: develop
Choose a base branch
from

Conversation

reza-armuei
Copy link
Collaborator

Please work through the following checklists. Delete anything that isn't relevant.

Development for new xarray-based metrics

  • Works with n-dimensional data and includes reduce_dims, preserve_dims, and weights args.
  • Typehints added
  • Docstrings complete and followed Napoleon (google) style
  • Reference to paper/webpage is in docstring
  • Add error handling
  • Imported into the API

Testing of new xarray-based metrics

  • 100% unit test coverage
  • Test that metric is compatible with dask.
  • Test that metrics work with inputs that contain NaNs
  • Test that broadcasting with xarray works
  • Test both reduce and preserve dims arguments work
  • Test that errors are raised as expected
  • Test that it works with both xr.Dataarrays and xr.Datasets

Tutorial notebook

  • Short introduction to why you would use that metric and what it tells you
  • A link to a reference
  • A "things to try next" section at the end
  • Add notebook to Explanation.ipynb
  • Optional - a detailed discussion of how the metric works at the end of the notebook

Documentation

@reza-armuei reza-armuei linked an issue May 23, 2024 that may be closed by this pull request


def block_bootstrap(
*objects: XarrayLike,
Copy link
Collaborator

@nikeethr nikeethr May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these are necessarily array-like items and necessarily contained in a list, then it may be more explicit to:

  1. call them "arrays", "da_list", or similar.
  2. use an Iterable e.g. list rather than variadic arguments. And use a forceful * after to split out keyword-only arguments. (Furthermore, this is generally how I've seen it done in the code-base - so it'll be more consistent.)

e.g.

def block_bootstrap(
    array_list: List[XarrayLike],
    *,  # force remaining args to be keyword only
    blocks: Dict[str, int],
    ...,
    circular: bool = True,
) -> ...

objects: The data to bootstrap, which can be multiple datasets. In the case where
multiple datasets are passed, each dataset can have its own set of dimension. However,
for successful bootstrapping, dimensions across all input objects must be nested.
For instance, for `block.keys=['d1', 'd2', 'd3'], an object with dimension 'd1' and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For instance, for `block.keys=['d1', 'd2', 'd3'], an object with dimension 'd1' and
For instance, for ``block.keys=['d1', 'd2', 'd3']``, an object with dimension 'd1' and

Note: sphinx inline-code monospacing requires two backticks. There may be other places this may need to be done as well.

"""
num = 1
for value in chunks:
value = max(value) if not isinstance(value, int) else value
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using numpy.max(value) here instead of the if/else, it handles scalars as well in which case it just behaves like an identity func. It also guards against non-numerics (e.g. strings) by default I think.

Get the max chunk size in a dataset
"""

def size_of_chunk(chunks, itemsize):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[optional]

Consider replacing this whole thing with numpy e.g. num = np.prod([np.max(x) for x in chunks]) (not tested...)

for value in chunks:
value = max(value) if not isinstance(value, int) else value
num = num * value
return itemsize * num / 1024**2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[optional]

In 1024**2, ** is prioritised, but it can be vague (to me anyway). So I generally prefer to add parentheses i.e. itemsize * num / (1024 ** 2). If black automagically remove them then don't worry about it.

num = num * value
return itemsize * num / 1024**2

ds = ds.to_dataset(name="ds") if isinstance(ds, xr.DataArray) else ds
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be better to flip this.

ds = ds if isinstance(ds, xr.Dataset) else ds.to_dataset(name="ds")

that way even if ds is not a DataArray it'll be forcefully be duck-typed and will raise an error if an incorrect data type is input.

if exclude_dims is None:
exclude_dims = [[] for _ in range(len(objects_list))]
if (
not isinstance(exclude_dims, list)
Copy link
Collaborator

@nikeethr nikeethr May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could split this up into different error branches for clarity.

  1. lengths not equal => length mismatch
  2. elements are not lists, or container is not a list => unexpected data structure

kwargs={"indices": [ind]},
input_core_dims=[core_dims],
output_core_dims=[core_dims + ["iteration"]],
dask="parallelized",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth propagating this to an API argument in block_bootstrap so that the user can forcefully disable dask for the ufuncs for whatever reason.

"objects containing lists of dimensions to exclude for each object"
)
renames = []
for i, (obj, exclude) in enumerate(zip(objects_list, exclude_dims)):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should be the expected behaviour if a dimension to exclude is also named "dim{x}" and x == ii?

Is there a reason for renaming as opposed to dropping/splitting and re-combining?

sizes = OrderedDict(
{d: (obj.sizes[d], b) for d, b in blocks.items()},
)
break
Copy link
Collaborator

@nikeethr nikeethr May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

break is a bit ambiguous. Maybe a comment here saying that only one obj with all keys present is required.

You could also use itertools.dropwhile or similar, to avoid excessive control flow:

try:
    # find the first obj with all keys in `blocks`
    obj = next(itertools.dropwhile(
        lambda x: x.sizes.keys() != blocks.keys(),
        objects_list,
    ))
    sizes = OrderedDict(...)
except StopIteration:
    # no obj found that matches all the keys in `blocks`
    raise ValueError(...)

@nikeethr
Copy link
Collaborator

@reza-armuei Thanks for this. I've gone through about 40-50% of it. Looks good so far, no major functional issues thus far - mostly structural comments/alternatives/suggestions. I work at flexible times, so don't feel the need to respond to comments straight away :).

for obj in objects_list:
try:
sizes = OrderedDict(
{d: (obj.sizes[d], b) for d, b in blocks.items()},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you have to initialize an ordered dict as a list of tuples, otherwise it doesn't maintain order on initialization. In more recent versions of python dict is already ordered I believe, so this behaviour may have been obfuscated.

This is a safer way, since lists by definition are ordered (assuming the way I remember initializing ordered dicts is still accurate):

sizes = OrderedDict([
    (d, (obj.sizes[d], b))
    for d,b in blocks.items()
])

MAX_CHUNK_SIZE_MB = 200


def _get_blocked_random_indices(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to walk through this logic. Currently the logic feels a bit coupled (to me anyway).

For example, it seems that shape is mutable so it's pretty tricky for me to evaluate what it's trying to do. shape[prev_ax] = math.ceil(shape[prev_ax] / b) implies that / b could happen multiple times depending on the number of axes. Doing it once implies shape / block_size = num_blocks, but doing it again, block_size / num_blocks implies there's some iterative reduction in the sampling space for that axis. But this reduction is controlled by a iterative call in another function: _n_nested_blocked_random_indices.

There may be more explicit ways for you to rearrange the logic in this function for clarity? Maybe a tree structure can help with organizing the data? Example:

     axis_1 │       sample_1        sample_2    ...  N_1   
            │          │               │                   
            │     ┌────┼────┐      ┌───┼───┐               
            │     │    │    │      │   │   │               
            │     │    │    │      │   │   │               
     axis_2 │   s_1   s_2   s_n    ▼   ▼   ▼    ...  N_2   
            │    │     │     │     │   │   │               
       ...  │    │     │     │     │   │   │               
            │  ┌─┼─┐ ┌─┼─┐   │   ┌─┼─┬─┼─┬─┼─┐             
            │  │ │ │ │ │ │   │   │ │ │ │ │ │ │             
     axis_n │  ▼ ▼ ▼ ▼ ▼ ▼   ▼   ▼ ▼ ▼ ▼ ▼ ▼ ▼  ...  N_n   
   ─────────┴───────────────────────────────────────────   
              samples--->                                                                  
                                                                                                   
    - each leaf contains n-tuple representing the combined sample index pointing to a single entry.
    - repeat * n_iter and stack if we want multiple of these.
    - each axis level is randomly sampled based on its block size, for each parent sample/block.

But this also may not be right if there are restrictions on what the underlying algorithm should extract and how it needs to do it.

Happy to discuss. Maybe once @tennlee also has had a look...

indices = OrderedDict()
prev_blocks: List[int] = []
for ax, (key, (_, block)) in enumerate(sizes.items()):
indices[key] = _get_blocked_random_indices(shape[: ax + 1] + [n_iteration], ax, block, prev_blocks, circular)
Copy link
Collaborator

@nikeethr nikeethr May 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so, a combination of indexing shape[: ax + 1] and maybe + [n_iteration] seems to create a copy of shape (because it's a list). Therefore, even if it is manipulated in _get_blocked_random_indices (see previous comment), the slicing creates unique obj each time, and hence the function doesn't mutate the original shape.

However, this wasn't obvious to me initially. Again it may be good to make this behaviour explicit by passing in a "frozen" list or similar, and/or instead of creating a .copy() in _get_blocked_random_indices and manipulating the passed in shape, manipulate the copy instead.

@@ -0,0 +1,410 @@
"""
Functions for performing block bootstrapping of arrays. This is inspired from
https://github.com/dougiesquire/xbootstrap with modifications to make the functions more
Copy link
Collaborator

@nikeethr nikeethr May 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed this, and the implementation is similar enough to https://github.com/dougiesquire/xbootstrap/blob/main/xbootstrap/core.py
that if we want to use it "as-is" we should probably include their license/permission notice.

Alternatively, we should consider re-implementing using a unique solution but using xbootstrap as a test case (offline - not part of the unit tests) to check if the results are the same. (This obviously would take more time, but is definitely do-able in a more efficient way in my opinion.)

This is one for @tennlee to make the call on (I would suggest to hold up on addressing any major comments before that).

Copy link
Collaborator

@tennlee tennlee May 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to really double check this, but my understanding is that the MIT license which was used by xbootstrap and the Apache 2.0 license are compatible. If so there should be a viable way forward here, we'll just need to make sure of how to attribute, reference and include license statements correctly. If there is an unrelated reason to re-implement the code (e.g. efficiency, compability etc) then that makes sense to do regardless. We can have a discussion next week.

Copy link
Collaborator

@nikeethr nikeethr May 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reza-armuei I'm going to mark this as "draft" for now, just as a guard.

@tennlee I haven't gone through everything, but my 2c is - it seems there's a few different coding paradigms used in the code, itertools, math, list comprehensions, basic for loops, and numpy, rather arbitrarily. A lot of this can just be purely numpy, with some exceptions for basic operations - (this would improve clarity and as a bonus speed). The structure of the code seems overly complex for what I believe it's trying to do, and there are a lot of implicit assumptions. The type safety (increase) and mutability (reduce) can also be improved. Having said that it may still be functionally correct in its implementation.

We can discuss more next week.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nikeethr Thank you so much for your time to review and for your comments. I agree with you that the current structure is complex and it can be improved. I have started to apply your comments (it may take time as I am working on a few other tasks too).
@nikeethr and @tennlee, I am open to any suggestions to improve it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just share this here from summarizing my notes of the algorithm for reference. It'd be easier if we go through a discussion over a call.


The way I understand the algorithm (feel free to correct me), is that it is trying to sample blocks (with replacement) essentially to re-arrange/randomize the domain; without breaking any dependence relationship. This can be summarized as:

block_sample = {
    (x_1, x_2, ..., x_n)
    such that |x_i - x_i_hat| <= b_i,
    for all x_i, and fixed x_i_hat in each dimension i
}
where,
    i = 1,...,n
    n = number axes
    b_i = dependence distance for axis i (or block size)
    x_i = indices in dimension i
    x_i_hat = index of i'th axis in random sample

If we want to add cyclic operations then basically the norm/distance || operator will have to account for the space being cyclic (instead of linearly ordered). If there are partial blocks due to block size not being a multiple of domain size - then we either: trim, expand, or provide partial block sampling (not recommended as it will sever dependence).

If this is accurate then here are a following things to consider:

  • numpy.random.default_rng() -> rng.integers() is actually very performant, and it should be used to sample in bulk.
  • It is sufficient to reproduce the algorithm by generating a random point cloud (which is a one-liner), and then expand each point into a block.
  • This will result in a array of extents for each axis, which I believe is a perfectly reasonable way of indexing arrays in xarray/numpy format.
  • If we want sampling to be cyclic, we apply the modulo operator when expanding (suspected partial) blocks. Or a faster (but more memory consuming) way is to just pad the axes.
  • If we don't want them to be cyclic, we just restrict the sample domain.
  • In order to preserve the "nested" sampling, it is sufficient to preserve
    1. the order (with no jumps) of the dimensions for every input array
    2. for a lower dimensionality we simply ignore the dimensions not present and broadcast to confirm to the other arrays.
  • Use dask & apply_ufunc optionally (based on user preference/compatibility) for the outer repetitions i.e. num_iterations

The above seems like a lot of considerations, but honestly the resulting code imo would be quite a bit simpler.

@nikeethr nikeethr marked this pull request as draft May 25, 2024 11:13
Copy link
Collaborator

@nicholasloveday nicholasloveday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are also some spelling and grammatical issues with the notebook (e.g., "technic", "threats" etc). The change is too big for me to provide direct feedback on the diffs. Perhaps have a first review yourself and then I'll go through it in more detail.

I still need to work through the code. In a separate pull request, it might be nice to do a further update that explains how it works and some tips on how to use it (e.g., selecting length size)

@@ -0,0 +1,306 @@
"""
This module contains unit tests for scores.stats.tests.block_bootstrap
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some tests where you have NaNs in your test data?

result = block_bootstrap(
*objects, blocks=blocks, n_iteration=n_iteration, exclude_dims=exclude_dims, circular=circular
)
assert result.shape == expected_shape
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you test that that result.data is a dask object? Then when you run result.compute() the underlying object is an np.ndarray? See https://github.com/nci/scores/blob/develop/tests/categorical/test_multicategorical.py#L283 for an example.

@nikeethr nikeethr changed the title 114 add circular block bootstrapping (requires design discussion) 114 add circular block bootstrapping May 28, 2024
@nikeethr nikeethr force-pushed the 114-add-circular-block-bootstrapping branch from 5191ad2 to 90bcf78 Compare May 28, 2024 05:10
@nikeethr
Copy link
Collaborator

nikeethr commented Jun 9, 2024

@reza-armuei Ok I've put in some more updates to clean up the sample code a bit. I think I'm reasonably happy with its current state, and in my eyes its more aligned to the coding style in this package. It currently does a single iteration of bootstrapping to a single numpy array input:
src/scores/emerging/block_bootstrap

I've put some TODOs in the module docstring. If the core implementation is fine, the TODOs are relatively straightforward to expand on.

However, we should throw some real data at it to make sure that it works properly.

Note: pipeline is expected to fail due to code coverage

I've also kept the original implementation intact to compare against. I feel like this should go under emerging if we choose to start from my version.

@nikeethr nikeethr force-pushed the 114-add-circular-block-bootstrapping branch from 5591bd3 to 0ab4965 Compare June 9, 2024 23:56
@nikeethr nikeethr force-pushed the 114-add-circular-block-bootstrapping branch from 8429051 to d703d87 Compare June 11, 2024 00:39
@nikeethr
Copy link
Collaborator

nikeethr commented Jun 11, 2024

Before implementing the xarray/dask portions, here are a couple of thoughts looking back at the original xbootstrap implementation:

  • I'm not certain if exclude_dims is needed.
  • I'm not certain if iterations actually need to have chunk size > 1.
  • The inference of chunk sizes also may not be necessary: Inferring chunk sizes is mostly to determine how to chunk iterations dimension, but the underlying ufunc is numpy based (dask setting is "parallelized") this implies that all the core dims will likely be squashed in the compute.

I think we can start with a simpler ufunc call and re-test the above assumptions. In fact, there is no real harm in limiting the chunk_size of iterations to 1, (except for maybe speed, due to under-utilization of broadcasting - dask should be smart enough to manage memory that's one of its selling points.).

@nikeethr
Copy link
Collaborator

nikeethr commented Jun 12, 2024

In the interest of keeping things neater. I'll actually pull out all my changes into a separate pull request in a fork, given that I'm utilizing the emerging space. That way this pull request only addresses comments from the xbootstrap port.

@nikeethr
Copy link
Collaborator

nikeethr commented Jun 12, 2024

I've moved my sample implementation here: #523

My updated recommendation for this PR to:

  • add xbootstrap as either a dependency or extra-dependency for now.
  • remove any ported code.
  • write wrapper functions around internal API calls to xbootstrap (under tests) that we would like to test, as a sense check against incompatibilities.
  • write a wrapper function (with docstrings) around the main API to xbootstrap call (under scores.processing.block_bootstrap) and expose that to users, and use that in the tutorial instead.
  • if we plan to use an extra-dependency, we should do a "import check" at the top of the file and spit out an incompatibility error, if xbootstrap is not installed.

The above will resolve most of my review comments, and then the rest of the review would just involve additional tests, checking the documentation and grammar, as well as a science review.

Also commented in #114 , split out issue to track alternative implementation here: #522

@nikeethr nikeethr removed the emerging label Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add circular block bootstrapping
4 participants