Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge output variables with input dataset #217

Merged
merged 4 commits into from
Sep 2, 2020

Conversation

tomwhite
Copy link
Collaborator

This is an initial attempt to implement #103 for count allele functions. Does this look like the right direction?

@eric-czech
Copy link
Collaborator

eric-czech commented Aug 31, 2020

Thanks @tomwhite, that squares well with what I took away from the issue discussion.

One problem here though is that I can see this being a frustrating experience for users:

import xarray as xr
import dask.array as da

# I load a dataset from somewhere
ds = xr.Dataset(dict(x=xr.DataArray(da.random.random(100))))

# This is what sgkit functions do (adds a `y` variable in this case)
def fn(ds):
    new_ds = xr.Dataset(dict(y=xr.DataArray(da.random.random(100))))
    return ds.merge(new_ds) 

# First run
ds = fn(ds)  # No eager evaluation happens here

# Second run
ds = fn(ds)  # Because `y` already exists, Xarray will force a compute and compare the values
> MergeError: conflicting values for variable 'y' on objects to be combined. You can skip this check by specifying compat='override'.

I think we should make every function either do ds.merge(new_ds, compat='override') (which is curiously undocumented) or have some additional conditional logic for overwriting existing variables, perhaps by defaulting to deleting the original ones and throwing a warning like @timothymillar suggested in https://github.com/pystatgen/sgkit/issues/103#issuecomment-673720012.

@jeromekelleher
Copy link
Collaborator

LGTM also. I think there'll be refinements we need to make as we get more experience, but this is the right basic "shape" of how things are done. As such, I'd say we merge ASAP (modulo addressing @eric-czech's points) and start building on it.

@tomwhite
Copy link
Collaborator Author

tomwhite commented Sep 1, 2020

Thanks @eric-czech, that's a useful case to consider. I have extracted a merge_datasets function to encapsulate this behaviour (new variables overwrite old ones, and issue a warning). How does this version look?

@tomwhite
Copy link
Collaborator Author

tomwhite commented Sep 1, 2020

@jeromekelleher I agree we want to get the general API approach established sooner rather than later. This change will impact #100 and #102 for example.

@eric-czech
Copy link
Collaborator

I have extracted a merge_datasets function to encapsulate this behaviour (new variables overwrite old ones, and issue a warning). How does this version look?

Perfect, thank you.

@tomwhite tomwhite marked this pull request as ready for review September 1, 2020 13:50
@tomwhite tomwhite merged commit ff5a0a0 into sgkit-dev:master Sep 2, 2020
@tomwhite tomwhite deleted the return-dataset branch September 2, 2020 08:13
@tomwhite tomwhite mentioned this pull request Sep 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants