Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "on"-parameter to "merge" method #3224

Closed
Hoeze opened this issue Aug 17, 2019 · 2 comments
Closed

Add "on"-parameter to "merge" method #3224

Hoeze opened this issue Aug 17, 2019 · 2 comments
Labels

Comments

@Hoeze
Copy link

Hoeze commented Aug 17, 2019

I'd like to propose a change to the merge method.

Often, I meet cases where I'd like to merge subsets of the same dataset.
However, this currently requires renaming of all dimensions, changing indices and merging them by hand.

As an example, please consider the following dataset:

Dimensions:          (genes: 8787, observations: 8166)
Coordinates:
  * observations     (observations) object 'GTEX-111CU-1826-SM-5GZYN' ... 'GTEX-ZXG5-0005-SM-57WCN'
  * genes            (genes) object 'ENSG00000227232' ... 'ENSG00000198727'
    individual       (observations) object 'GTEX-111CU' ... 'GTEX-ZXG5'
    subtissue        (observations) object 'Adipose_Subcutaneous' ... 'Whole_Blood'
Data variables:
    cdf              (observations, genes) float32 0.18883839 ... 0.4876754
    l2fc             (observations, genes) float32 -0.21032093 ... -0.032540113
    padj             (observations, genes) float32 1.0 1.0 1.0 ... 1.0 1.0 1.0

There is for each subtissue and individuum at most one observation.

Now, I'd like to plot all values in subtissue == "Whole_Blood" against subtissue == "Adipose_Subcutaneous". Therefore, I have to join all "Whole_Blood" observations with all "Adipose_Subcutaneous" observations by the "individual" coordinate.

To simplify this task, I'd like to have the following abstraction:

# select tissues
tissue_1 = ds.sel(observations = (ds.subtissue == "Whole_Blood"))
tissue_2 = ds.sel(observations = (ds.subtissue == "Adipose_Subcutaneous"))

# inner join by individual
merged = tissue_1.merge(tissue_2, on="individual", newdim="merge_dim", join="inner")

print(merged)

The result should look like this:

Dimensions:          ("genes": 8787, "individual": 286)
Coordinates:
  * genes            (genes) object 'ENSG00000227232' ... 'ENSG00000198727'
  * merge_dim       (merge_dim) object 'GTEX-111CU' ... 'GTEX-ZXG5'
    observations:1   (merge_dim) object 'GTEX-111CU-1826-SM-5GZYN' ... 'GTEX-ZXG5-1826-SM-5GZYN'
    observations:2   (merge_dim) object 'GTEX-111CU-0005-SM-57WCN' ... 'GTEX-ZXG5-0005-SM-57WCN'
    subtissue:1      (merge_dim) object 'Whole_Blood' ... 'Whole_Blood'
    subtissue:1      (merge_dim) object 'Adipose_Subcutaneous' ... 'Adipose_Subcutaneous'
Data variables:
    cdf:1            (merge_dim, genes) float32 0.18883839 ... 0.4876754
    cdf:2            (merge_dim, genes) float32 ...
    l2fc:1           (merge_dim, genes) float32 -0.21032093 ... -0.032540113
    l2fc:2           (merge_dim, genes) float32 ...
    padj:1           (merge_dim, genes) float32 1.0 1.0 1.0 ... 1.0 1.0 1.0
    padj:2           (merge_dim, genes) float32 ...

To summarize, I'd propose the following changes:

  • Add parameter on: Union[str, List[str], Tuple[str], Dict[str, str]]
    This should specify one or multiple coordinates which should be merged.
    • Simple merge: string
      => merge by left[str] and right[str]
    • Merge of multiple coords: list or tuple of strings
      => merge by left[str1, str2, ...] and right[str1, str2, ...]
    • To merge differently named coords: dict, e.g. {"str_left": "str_right})
      => merge by left[str_left] and right[str_right]
  • Add some parameter like newdim to specify the newly created index dimension.
    If on specifies multiple coords, this new index dimension should be a multi-index of these coords.
  • Rename all duplicate coordinates not specified in on to some unique name
    e.g. left["cdf"] => merged["cdf:1"] and right["cdf"] => merged["cdf:2"]

In case if the on parameter's coordinates do not unambiguously describe each data point, they should be combined in a cross-product manner. However, since this could cause a quadratic runtime and memory requirement, I am not sure how this can be handled in a safe manner.

What do you think about this addition?

@shoyer
Copy link
Member

shoyer commented Aug 20, 2019

I appreciate how this could be convenient, but I am concerned about adding more complexity to xarray's merge code, which is already pretty complex and hard to maintain. My refactor in #3234 is the first time that code has been touched in quite a while and I don't think anyone (other than myself) has made contributions to that part of xarray.

To solve your use-case, what about either:

  1. Converting observations into a MultiIndex over individual and subtissue, or
  2. Creating separate individual/subtissue dimensions and storing the data the data in the form of a sparse array

Then you could do this sort of data munging with normal indexing/alignment/merging, e.g.,

tissue_1 = ds.sel(subtissue="Whole_Blood").rename({k: k + ':1' for k in ds})
tissue_2 = ds.sel(subtissue="Adipose_Subcutaneous").rename({k: k + ':2' for k in ds})
merged = tissue_1.merge(tissue_2)  # would have dimensions [gene, individual]

@stale
Copy link

stale bot commented Jul 21, 2021

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

@stale stale bot added the stale label Jul 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants