Skip to content
This repository has been archived by the owner on Oct 3, 2023. It is now read-only.

How to retain metadata #27

Closed
erogluorhan opened this issue Dec 9, 2020 · 11 comments
Closed

How to retain metadata #27

erogluorhan opened this issue Dec 9, 2020 · 11 comments

Comments

@erogluorhan
Copy link
Collaborator

erogluorhan commented Dec 9, 2020

NCL functions have counterpart functions with a name extension "_Wrap" where the metadata is retained, e.g. "rcm2rgrid" and "rcm2rgrid_Wrap". We need to check if our code retains xarray metadata. Checking it from geocat-ncomp would also help.

@pilotchute
Copy link
Contributor

The f2py functions written based on the boilderplate in linint2 should all retain metadata.

@erogluorhan
Copy link
Collaborator Author

erogluorhan commented Dec 10, 2020

@pilotchute thanks for looking into this! I did not fully disclose my thinkings since I initiated this issue as a self note for not to forget bringing into team's discussion. Let me elaborate it a bit, also cc'ing @anissa111 and @michaelavs :

  • I am aware of and agree with that metadata is retained in linint2. For everybody's reference: Line 190:

    fo = xr.DataArray(fo.compute(), attrs=fi.attrs, dims=fi.dims, coords=fo_coords)

  • I'd like to discuss whether (1) we should always retain input's metadata in output, or should we handle it based on user's explicit request (as in NCL and geocat-ncomp); (2) in addition should we transfer input's all metadata "as-is" into output or play with it to make it reflect output data structure. I tried to exemplify these below:

    • (1) May always retaining metadata cause abundance of memory use (I am not sure for the time how xarray handles the metadata of Dataarray, but could retaining dims, coords be memory-expesnive for big data sets?) or meaningless metadata in output? NCL is only retaining metadata when user calls the "_Wrap" version of the original function (They are the same functionally, but _Wrap function retains metadata). geocat-ncomp had a boolean meta variable to handle it. FYI: Retaining it if only user requests makes sense to me.

    • (2) Retaining input's metadata as-is in the output might be meaningless and confusing. For example, in linint2 the description of output is as follows: "The returned value will have the same dimensions as fi, except for the rightmost two dimensions which will have the same dimension sizes as the lengths of yo and xo." So, simply taking all dims and coords into output does not seem to be correct.

    • All that being said, the following code block from geocat-ncomp makes more sense to me:

      if meta:
        coords = {
            k: v if k not in fi.dims[-2:] else (xo if k == fi.dims[-1] else yo)
            for (k, v) in fi.coords.items()
        }
      
        fo = xr.DataArray(fo, attrs=fi.attrs, dims=fi.dims, coords=coords)
      else:
        fo = xr.DataArray(fo)
      
    
    

Different functions might have slight differences in such a code block for sure, based on input to output dimension comparison or so.

Thoughts?

@pilotchute
Copy link
Contributor

We do not use fi.coords in the linint2 output, rather we use fo_coords, made to reflect the changed geometry of the data. in fact the linint2 implementation of fo_coords is functionally the same as the one in geocat-ncomp, just re-written to be more human readable.

@erogluorhan
Copy link
Collaborator Author

erogluorhan commented Dec 10, 2020

Ah, my bad, thanks @pilotchute ! I did not realize fo_coords in that line, more precisely, thought it was fi.coords (Turns out you made more human-readable, I am insisting still not to read it :) ).

Do you have any idea on part (1) of discussion, whether to always retain metadata (any cost with it) or introduce a user-request input argument such as meta?

@pilotchute
Copy link
Contributor

I think we always retain fi. metadata for any thing that isn't modified by the function. And in cases where we understand the changes made to an axis, we should make an effort to update the information for that axis.

Carrying data from fi. to fo. doesn't significantly increase memory usage, since we are only copying attributes from one xarray object to another.

@pilotchute
Copy link
Contributor

We could add an optional meta=true argument, to allow it to be turned off in specific situations. though that will increase our required implementation effort to have an execution path that doesn't interact with the metadata (probably turned off for speed increases)

@erogluorhan
Copy link
Collaborator Author

We could add an optional meta=true argument, to allow it to be turned off in specific situations. though that will increase our required implementation effort to have an execution path that doesn't interact with the metadata (probably turned off for speed increases)

I agree with you. Once we know retaining metadata is memory-costless, just retain it in the code and do not deal with meta variable in addition. If we face specifics situations though in the future, we could revisit this. Thanks!

@pilotchute
Copy link
Contributor

Any ideas on how we can test metadata retention memory cost?

@erogluorhan
Copy link
Collaborator Author

erogluorhan commented Dec 10, 2020

Any ideas on how we can test metadata retention memory cost?

I am not sure how to test but there is chances it would be memory costly in such a scenario (or please correct me if I am getting it wrong):

We are copying fi attributes to fo with attrs=fi.attrs when populating fo as a new xarray.dataarray. xarray.dataarray.attrs is a dictionary. If one of the fi's attributes (i.e. one of the dict entries) has its value as a huge list, then it might result in a large memory use. What I am not sure about is if we ever face such a scenario. That's the reason I said we could go with the current structure and revisit it if we ever have such a future case.

@andersy005
Copy link

May always retaining metadata cause abundance of memory use (I am not sure for the time how xarray handles the metadata of Dataarray, but could retaining dims, coords be memory-expesnive for big data sets?) or meaningless metadata in output?

This issue: pydata/xarray#1614 has relevant discussions about metadata preservation in xarray that may be of interest to you.

@erogluorhan
Copy link
Collaborator Author

Thanks @andersy005

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants