xCDAT/xarray/numpy and 'silent upcasting' question #575

jypeter · 2023-11-30T15:52:02Z

jypeter
Nov 30, 2023

Question criteria

I added a descriptive title here.
I searched the xCDAT GitHub Discussions to find a similar question and didn't find it.
I searched the xCDAT documentation and/or Xarray documentation.

Describe your question

When reading CDAT/cdms#449 this morning, I remembered why we had Float, Float32 and Float64 in the first place.

The idea was to accurately represent our data (with a good enough numerical precision), have a good mapping between NetCDF types and numpy, and only use the ram and disk space that we need

There used to be a time when you performed a numerical operation on a Float32 cdms2 variable and you ended up with a Float64 variable, thus doubling the required space. Then the savespace option was introduced to fix this

``savespace is an integer flag: if set to 1, internal Numpy operations will attempt to avoid silent upcasting, as described for instance in 2.10.2. Variable Constructors

This theoretically told cdms2 (or MV2 or numpy?) not too use more memory than actually required

Does somebody know what governs this behavior these days (or what are the rules) in numpy, xarray or xcdat and the way these packages interact?

This may seem a bit philosophical as we now have plenty of ram available. But the size of our datasets has grown, and our end users have not grown more careful (I'd say lots of them are more lazy now...), so having a conservative use of memory by default is still important

Loosely related use case: a PhD student asked me yesterday how to run his Python script on our cluster, because he did not have enough memory on his computer and had no time to split his Antarctica data, and later asked me what was the PBS: job killed: walltime 43304 exceeded limit 43200 error message he got. I have not seen his script yet, but it could be that reducing the memory used would possibly allow the job to finish before reaching the allowed time

Are there are any possible answers you came across?

No response

Minimal Complete Verifiable Example (MVCE)

No response

Relevant log output

No response

Environment

No response

Anything else we need to know?

No response

tomvothecoder · 2023-12-12T01:05:57Z

tomvothecoder
Dec 12, 2023
Maintainer

There used to be a time when you performed a numerical operation on a Float32 cdms2 variable and you ended up with a Float64 variable, thus doubling the required space. Then the savespace option was introduced to fix this

We found that cdms2 incorrectly type casts weights to np.float64 even though bounds and variables in input datasets are np.float32 (source notebook). These incorrectly type-casted weights are applied in computations, therefore producing np.float64 results.

Some related comments:

Does somebody know what governs this behavior these days (or what are the rules) in numpy, xarray or xcdat and the way these packages interact?

I'm not 100% sure, but I'm pretty confident most Numpy functions/methods maintain the dtype unless the user explicitly type casts using .astype(). If implicit type casting is being performed, it is most likely a bug.

From the team's experience developing xCDAT, we found that Xarray and xCDAT correctly maintains the dtype of whatever is in the input data. We have not found incorrect and implicit type casting happening like in cdms2.

This may seem a bit philosophical as we now have plenty of ram available. But the size of our datasets has grown, and our end users have not grown more careful (I'd say lots of them are more lazy now...), so having a conservative use of memory by default is still important

I'm currently experimenting with Dask and getting performance metrics in PR #489. I'm comparing xCDAT's serial and parallel performance against CDAT (serial-only). The preliminary results are extremely promising for xCDAT so far, so stay tuned.

1 reply

jypeter Dec 12, 2023
Author

Thank you for confirming that numpy should preserve the dtype !
About cdms2 having hard-coded np.float64 weights. My hypothesis is that this could be due to the fact that the bounds used to be double (same type as the axes), and now (CMIP6 era) we seem to have float bounds, and double dimensions. So, weights computed from CMIP5 double bounds would probably end us as double. And somebody could have incorrectly assumed that bounds would always be double, and hard-coded this fact

I never understood why we would need double dimensions in the first place, except possibly for the time axes. Maybe Karl would know

As an example, I have just downloaded from ESGF a NCAR piControl file from CMIP5 and CMIP6

CMIP5

ncdump -h orog_fx_CCSM4_piControl_r0i0p0.nc | grep lat
        lat = 192 ;
        double lat(lat) ;
                lat:bounds = "lat_bnds" ;
                lat:units = "degrees_north" ;
                lat:axis = "Y" ;
                lat:long_name = "latitude" ;
                lat:standard_name = "latitude" ;
        double lat_bnds(lat, bnds) ;
        float orog(lat, lon) ;

CMIP6

ncdump -h orog_fx_CESM2_piControl_r1i1p1f1_gn.nc | \grep lat
        lat = 192 ;
        float orog(lat, lon) ;
                orog:coordinates = "lat lon" ;
        double lat(lat) ;
                lat:axis = "Y" ;
                lat:bounds = "lat_bnds" ;
                lat:standard_name = "latitude" ;
                lat:title = "Latitude" ;
                lat:type = "double" ;
                lat:units = "degrees_north" ;
                lat:valid_max = 90. ;
                lat:valid_min = -90. ;
        float lat_bnds(lat, nbnd) ;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xCDAT/xarray/numpy and 'silent upcasting' question #575

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

xCDAT/xarray/numpy and 'silent upcasting' question #575

jypeter Nov 30, 2023

Question criteria

Describe your question

Are there are any possible answers you came across?

Minimal Complete Verifiable Example (MVCE)

Relevant log output

Environment

Anything else we need to know?

Replies: 1 comment · 1 reply

tomvothecoder Dec 12, 2023 Maintainer

jypeter Dec 12, 2023 Author

CMIP5

CMIP6

jypeter
Nov 30, 2023

Replies: 1 comment 1 reply

tomvothecoder
Dec 12, 2023
Maintainer

jypeter Dec 12, 2023
Author