-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(requires design discussion) 114 add circular block bootstrapping #418
base: develop
Are you sure you want to change the base?
Conversation
|
||
|
||
def block_bootstrap( | ||
*objects: XarrayLike, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these are necessarily array-like items and necessarily contained in a list, then it may be more explicit to:
- call them "arrays", "da_list", or similar.
- use an
Iterable
e.g. list rather than variadic arguments. And use a forceful * after to split out keyword-only arguments. (Furthermore, this is generally how I've seen it done in the code-base - so it'll be more consistent.)
e.g.
def block_bootstrap(
array_list: List[XarrayLike],
*, # force remaining args to be keyword only
blocks: Dict[str, int],
...,
circular: bool = True,
) -> ...
objects: The data to bootstrap, which can be multiple datasets. In the case where | ||
multiple datasets are passed, each dataset can have its own set of dimension. However, | ||
for successful bootstrapping, dimensions across all input objects must be nested. | ||
For instance, for `block.keys=['d1', 'd2', 'd3'], an object with dimension 'd1' and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For instance, for `block.keys=['d1', 'd2', 'd3'], an object with dimension 'd1' and | |
For instance, for ``block.keys=['d1', 'd2', 'd3']``, an object with dimension 'd1' and |
Note: sphinx inline-code monospacing requires two backticks. There may be other places this may need to be done as well.
""" | ||
num = 1 | ||
for value in chunks: | ||
value = max(value) if not isinstance(value, int) else value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using numpy.max(value)
here instead of the if/else, it handles scalars as well in which case it just behaves like an identity func. It also guards against non-numerics (e.g. strings) by default I think.
Get the max chunk size in a dataset | ||
""" | ||
|
||
def size_of_chunk(chunks, itemsize): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[optional]
Consider replacing this whole thing with numpy
e.g. num = np.prod([np.max(x) for x in chunks])
(not tested...)
for value in chunks: | ||
value = max(value) if not isinstance(value, int) else value | ||
num = num * value | ||
return itemsize * num / 1024**2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[optional]
In 1024**2
, **
is prioritised, but it can be vague (to me anyway). So I generally prefer to add parentheses i.e. itemsize * num / (1024 ** 2)
. If black
automagically remove them then don't worry about it.
num = num * value | ||
return itemsize * num / 1024**2 | ||
|
||
ds = ds.to_dataset(name="ds") if isinstance(ds, xr.DataArray) else ds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may be better to flip this.
ds = ds if isinstance(ds, xr.Dataset) else ds.to_dataset(name="ds")
that way even if ds
is not a DataArray
it'll be forcefully be duck-typed and will raise an error if an incorrect data type is input.
if exclude_dims is None: | ||
exclude_dims = [[] for _ in range(len(objects_list))] | ||
if ( | ||
not isinstance(exclude_dims, list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could split this up into different error branches for clarity.
- lengths not equal => length mismatch
- elements are not lists, or container is not a list => unexpected data structure
kwargs={"indices": [ind]}, | ||
input_core_dims=[core_dims], | ||
output_core_dims=[core_dims + ["iteration"]], | ||
dask="parallelized", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth propagating this to an API argument in block_bootstrap
so that the user can forcefully disable dask
for the ufuncs
for whatever reason.
"objects containing lists of dimensions to exclude for each object" | ||
) | ||
renames = [] | ||
for i, (obj, exclude) in enumerate(zip(objects_list, exclude_dims)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should be the expected behaviour if a dimension to exclude is also named "dim{x}"
and x == ii
?
Is there a reason for renaming as opposed to dropping/splitting and re-combining?
sizes = OrderedDict( | ||
{d: (obj.sizes[d], b) for d, b in blocks.items()}, | ||
) | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
break
is a bit ambiguous. Maybe a comment here saying that only one obj with all keys present is required.
You could also use itertools.dropwhile
or similar, to avoid excessive control flow:
try:
# find the first obj with all keys in `blocks`
obj = next(itertools.dropwhile(
lambda x: x.sizes.keys() != blocks.keys(),
objects_list,
))
sizes = OrderedDict(...)
except StopIteration:
# no obj found that matches all the keys in `blocks`
raise ValueError(...)
@reza-armuei Thanks for this. I've gone through about 40-50% of it. Looks good so far, no major functional issues thus far - mostly structural comments/alternatives/suggestions. I work at flexible times, so don't feel the need to respond to comments straight away :). |
for obj in objects_list: | ||
try: | ||
sizes = OrderedDict( | ||
{d: (obj.sizes[d], b) for d, b in blocks.items()}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you have to initialize an ordered dict
as a list of tuples, otherwise it doesn't maintain order on initialization. In more recent versions of python dict
is already ordered I believe, so this behaviour may have been obfuscated.
This is a safer way, since lists by definition are ordered (assuming the way I remember initializing ordered dicts is still accurate):
sizes = OrderedDict([
(d, (obj.sizes[d], b))
for d,b in blocks.items()
])
MAX_CHUNK_SIZE_MB = 200 | ||
|
||
|
||
def _get_blocked_random_indices( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need to walk through this logic. Currently the logic feels a bit coupled (to me anyway).
For example, it seems that shape
is mutable so it's pretty tricky for me to evaluate what it's trying to do. shape[prev_ax] = math.ceil(shape[prev_ax] / b)
implies that / b
could happen multiple times depending on the number of axes. Doing it once implies shape / block_size = num_blocks
, but doing it again, block_size / num_blocks
implies there's some iterative reduction in the sampling space for that axis. But this reduction is controlled by a iterative call in another function: _n_nested_blocked_random_indices
.
There may be more explicit ways for you to rearrange the logic in this function for clarity? Maybe a tree structure can help with organizing the data? Example:
axis_1 │ sample_1 sample_2 ... N_1
│ │ │
│ ┌────┼────┐ ┌───┼───┐
│ │ │ │ │ │ │
│ │ │ │ │ │ │
axis_2 │ s_1 s_2 s_n ▼ ▼ ▼ ... N_2
│ │ │ │ │ │ │
... │ │ │ │ │ │ │
│ ┌─┼─┐ ┌─┼─┐ │ ┌─┼─┬─┼─┬─┼─┐
│ │ │ │ │ │ │ │ │ │ │ │ │ │ │
axis_n │ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ... N_n
─────────┴───────────────────────────────────────────
samples--->
- each leaf contains n-tuple representing the combined sample index pointing to a single entry.
- repeat * n_iter and stack if we want multiple of these.
- each axis level is randomly sampled based on its block size, for each parent sample/block.
But this also may not be right if there are restrictions on what the underlying algorithm should extract and how it needs to do it.
Happy to discuss. Maybe once @tennlee also has had a look...
indices = OrderedDict() | ||
prev_blocks: List[int] = [] | ||
for ax, (key, (_, block)) in enumerate(sizes.items()): | ||
indices[key] = _get_blocked_random_indices(shape[: ax + 1] + [n_iteration], ax, block, prev_blocks, circular) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so, a combination of indexing shape[: ax + 1]
and maybe + [n_iteration]
seems to create a copy of shape
(because it's a list). Therefore, even if it is manipulated in _get_blocked_random_indices
(see previous comment), the slicing creates unique obj each time, and hence the function doesn't mutate the original shape
.
However, this wasn't obvious to me initially. Again it may be good to make this behaviour explicit by passing in a "frozen" list or similar, and/or instead of creating a .copy()
in _get_blocked_random_indices
and manipulating the passed in shape
, manipulate the copy instead.
@@ -0,0 +1,410 @@ | |||
""" | |||
Functions for performing block bootstrapping of arrays. This is inspired from | |||
https://github.com/dougiesquire/xbootstrap with modifications to make the functions more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed this, and the implementation is similar enough to https://github.com/dougiesquire/xbootstrap/blob/main/xbootstrap/core.py
that if we want to use it "as-is" we should probably include their license/permission notice.
Alternatively, we should consider re-implementing using a unique solution but using xbootstrap
as a test case (offline - not part of the unit tests) to check if the results are the same. (This obviously would take more time, but is definitely do-able in a more efficient way in my opinion.)
This is one for @tennlee to make the call on (I would suggest to hold up on addressing any major comments before that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have to really double check this, but my understanding is that the MIT license which was used by xbootstrap and the Apache 2.0 license are compatible. If so there should be a viable way forward here, we'll just need to make sure of how to attribute, reference and include license statements correctly. If there is an unrelated reason to re-implement the code (e.g. efficiency, compability etc) then that makes sense to do regardless. We can have a discussion next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reza-armuei I'm going to mark this as "draft" for now, just as a guard.
@tennlee I haven't gone through everything, but my 2c is - it seems there's a few different coding paradigms used in the code, itertools
, math
, list comprehensions, basic for loops, and numpy
, rather arbitrarily. A lot of this can just be purely numpy
, with some exceptions for basic operations - (this would improve clarity and as a bonus speed). The structure of the code seems overly complex for what I believe it's trying to do, and there are a lot of implicit assumptions. The type safety (increase) and mutability (reduce) can also be improved. Having said that it may still be functionally correct in its implementation.
We can discuss more next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nikeethr Thank you so much for your time to review and for your comments. I agree with you that the current structure is complex and it can be improved. I have started to apply your comments (it may take time as I am working on a few other tasks too).
@nikeethr and @tennlee, I am open to any suggestions to improve it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll just share this here from summarizing my notes of the algorithm for reference. It'd be easier if we go through a discussion over a call.
The way I understand the algorithm (feel free to correct me), is that it is trying to sample blocks (with replacement) essentially to re-arrange/randomize the domain; without breaking any dependence relationship. This can be summarized as:
block_sample = {
(x_1, x_2, ..., x_n)
such that |x_i - x_i_hat| <= b_i,
for all x_i, and fixed x_i_hat in each dimension i
}
where,
i = 1,...,n
n = number axes
b_i = dependence distance for axis i (or block size)
x_i = indices in dimension i
x_i_hat = index of i'th axis in random sample
If we want to add cyclic operations then basically the norm/distance ||
operator will have to account for the space being cyclic (instead of linearly ordered). If there are partial blocks due to block size not being a multiple of domain size - then we either: trim, expand, or provide partial block sampling (not recommended as it will sever dependence).
If this is accurate then here are a following things to consider:
numpy.random.default_rng() -> rng.integers()
is actually very performant, and it should be used to sample in bulk.- It is sufficient to reproduce the algorithm by generating a random point cloud (which is a one-liner), and then expand each point into a block.
- This will result in a array of extents for each axis, which I believe is a perfectly reasonable way of indexing arrays in
xarray/numpy
format. - If we want sampling to be cyclic, we apply the modulo operator when expanding (suspected partial) blocks. Or a faster (but more memory consuming) way is to just pad the axes.
- If we don't want them to be cyclic, we just restrict the sample domain.
- In order to preserve the "nested" sampling, it is sufficient to preserve
- the order (with no jumps) of the dimensions for every input array
- for a lower dimensionality we simply ignore the dimensions not present and broadcast to confirm to the other arrays.
- Use
dask
&apply_ufunc
optionally (based on user preference/compatibility) for the outer repetitions i.e.num_iterations
The above seems like a lot of considerations, but honestly the resulting code imo would be quite a bit simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are also some spelling and grammatical issues with the notebook (e.g., "technic", "threats" etc). The change is too big for me to provide direct feedback on the diffs. Perhaps have a first review yourself and then I'll go through it in more detail.
I still need to work through the code. In a separate pull request, it might be nice to do a further update that explains how it works and some tips on how to use it (e.g., selecting length size)
@@ -0,0 +1,306 @@ | |||
""" | |||
This module contains unit tests for scores.stats.tests.block_bootstrap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some tests where you have NaNs in your test data?
result = block_bootstrap( | ||
*objects, blocks=blocks, n_iteration=n_iteration, exclude_dims=exclude_dims, circular=circular | ||
) | ||
assert result.shape == expected_shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you test that that result.data
is a dask object? Then when you run result.compute()
the underlying object is an np.ndarray
? See https://github.com/nci/scores/blob/develop/tests/categorical/test_multicategorical.py#L283 for an example.
5191ad2
to
90bcf78
Compare
@reza-armuei Ok I've put in some more updates to clean up the sample code a bit. I think I'm reasonably happy with its current state, and in my eyes its more aligned to the coding style in this package. It currently does a single iteration of bootstrapping to a single numpy array input: I've put some TODOs in the module docstring. If the core implementation is fine, the TODOs are relatively straightforward to expand on. However, we should throw some real data at it to make sure that it works properly.
I've also kept the original implementation intact to compare against. I feel like this should go under emerging if we choose to start from my version. |
5591bd3
to
0ab4965
Compare
8429051
to
d703d87
Compare
Before implementing the
I think we can start with a simpler |
In the interest of keeping things neater. I'll actually pull out all my changes into a separate pull request in a fork, given that I'm utilizing the
emerging
space. That way this pull request only addresses comments from the |
I've moved my sample implementation here: #523 My updated recommendation for this PR to:
The above will resolve most of my review comments, and then the rest of the review would just involve additional tests, checking the documentation and grammar, as well as a science review. Also commented in #114 , split out issue to track alternative implementation here: #522 |
Please work through the following checklists. Delete anything that isn't relevant.
Development for new xarray-based metrics
reduce_dims
,preserve_dims
, andweights
args.Testing of new xarray-based metrics
xr.Dataarrays
andxr.Datasets
Tutorial notebook
Documentation