Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring data handling #335

Closed
znicholls opened this issue Feb 18, 2020 · 23 comments
Closed

Refactoring data handling #335

znicholls opened this issue Feb 18, 2020 · 23 comments

Comments

@znicholls
Copy link
Collaborator

As brought up in https://twitter.com/daniel_huppmann/status/1229689895649710080, as we move to bigger datasets it's not clear that holding the data in 'long form' will be the easiest way to do things. @danielhuppmann we've started playing with holding the data in different forms in memory in https://github.com/lewisjared/scmdata, particularly openscm/scmdata#21 and openscm/scmdata#26. I'm not sure what the best way to proceed is, perhaps we continue to fiddle in scmdata and once we've found a moderately stable solution we work out how to do something similar in pyam?

@danielhuppmann
Copy link
Member

Thanks @znicholls for raising this issue, fine to test it in ScmDataFrame first. Still, can you outline the options (ideally in the descriptions above and delete my comment) or add a reference if that list already exists - I guess the simplest option is using pd.Series with index instead of pd.DataFrame (should be smaller in memory and faster), more advanced is moving to xarray, other options?

@khaeru
Copy link
Contributor

khaeru commented Feb 19, 2020

@danielhuppmann asked me to comment here.

  • Using pandas.Series with pandas.MultiIndex is usually faster than pandas.DataFrame.
    • Internally, pandas uses similar data storage for the contents of Index and subclasses, as it does for the data contents of a DataFrame. See the section “The Internal Representation of a DataFrame” in this post, which gives a good explanation (h/t @zikolach).
    • Because much of this code is carefully optimized by the pandas developers, and also in C, then it will be more complete and performant than any attempt to emulate it in (pure) Python.
    • pandas should be as performant for a 'wide' DataFrame with (e.g.) 10⁴ rows and 10 columns for a 'year' dimension, as it is for a 'long' Series with 10⁴×10=10⁵ elements and 'year' as one level in a MultiIndex. However, you should write tests (see below) and check this for your typical use cases in order to choose one. See pytest-benchmark.
    • Using base pandas.Series instead of a complex derived type also allows optimized calculation methods, that are designed to operate on Series, to work on your data.
  • In ixmp.reporting, we want to use xarray.Dataset / .DataArray for multidimensional data.

Hope that helps.

@znicholls
Copy link
Collaborator Author

I would add a reference but I’ve only thought about this for 30 seconds so here’s a brain dump instead. As I understand, the major use case for pyam is handling model output, that is lots of timeseries data where sparsity isn’t really an issue. If that understanding is wrong, stop reading because the rest of my thoughts are based on a bad assumption.

Assuming I’ve got the major use case right, then xarrays sparsity issues disappear and it seems like the most obvious way forward if you really want to go to big data where lazy loading can be super useful (unless pandas has a lazy loading solution which I haven’t considered).

The long wide discussion isn’t quite as simple as discussed in the comment above. The problem for pyam’s handling at the moment is that if you had data in wide format, say 10 rows (each of which is a time series) and 10 columns (each of which is a year), then you’ll have 10 values in the ‘unit’ column. If you pivot to long form, you still have the same number of data values (10x10), but now you have 10x more values in the unit column (because you have 100 rows not 10). Given that the unit column is also normally strings, and by default not categorical, you pay a huge memory and performance penalty by using long form (especially when filtering). Once your data gets big enough, the difference between long and wide can be the difference between having enough memory and not. You could help by using categorical columns, but they have other issues.

We’re trying a new option in scmdata which gives about an order of magnitude faster filtering in exchange for handling things a bit differently. I think the answer really depends on the main use pattern.

@gidden
Copy link
Member

gidden commented Feb 20, 2020

Hi all - a few 2 cents on my side:

On long/wide:

  • we currently store data as non-indexed long-form data
  • if we were to store it using a pd.MultiIndex as a pd.Series with name='value', I think we would deal with short-term size/computation issues (e.g., filtering would not be data.col[data.col.isin(filter)] but rather data.loc[slice])

On xarray

  • this is my strongly preferred option if our data model is appropriate
    • lazy loading
    • native write/read to disc for heterogenous data
    • native dask support for computations
  • the ixmp data model was not appropriate because our indexes (dimensions) are too heterogeonous
  • for pyam our indicies are standard (IAMC_IDX) and pervasive for the whole dataset
  • xarray further has native metdata support

So, in short - agreed with all said by @khaeru (thanks, btw!) and @znicholls . xarray didn't fit ixmp, but I believe it would fit pyam perfectly.

@khaeru
Copy link
Contributor

khaeru commented Feb 20, 2020

The long wide discussion isn’t quite as simple as discussed in the comment above. […] Given that the unit column is also normally strings, and by default not categorical, you pay a huge memory and performance penalty by using long form […] Once your data gets big enough…

This assumes pandas' C storage internals are dumb which, again, they are not. That's why I recommended testing. For example:

>>> units = ['kg', 'GJ', 'USD'] * 10**6
>>> pd.Index(units).memory_usage()                           
24000000
>>> pd.MultiIndex.from_tuples(('a', 'b', u) for u in units).memory_usage()
9000088

i.e. once one starts using MultiIndex, pandas stores each level, such as 'unit', as integer indices into an internal list of labels. This is a 3-level index with three million rows in 9 MB: 3 levels × 3·10⁶ rows × 1 byte (note it is automatically choosing the smallest necessary int, int8) = 9·10⁶ bytes, plus 88 for the labels. It is not "huge".

This is why I specifically referred to using pd.Series with pd.MultiIndex for long-form data, and did not suggest storing index information as columns in a pd.DataFrame.

On xarray […] the ixmp data model was not appropriate because our indexes (dimensions) are too heterogeonous

I think xarray will be great for ixmp once the sparse support matures. Its approach to handling arbitrarily many coordinate variables is more robust than ixmp's.

@gidden
Copy link
Member

gidden commented Feb 20, 2020

Ok, so synthesizing this - it sounds like xarray is an ideal next solution here right?

@znicholls
Copy link
Collaborator Author

(As a brief comment on the pandas storage thread. I think @khaeru and I were discussing different things. I wasn’t considering using multiindex because this isn’t how pyam is currently implemented and I hadn’t realised it was on the table. Without multiindex pandas doesn’t do clever automatic conversion to integer indicies, hence long/wide does matter. As @khaeru points out you can take that headache away if you store everything as series with multiindex and refactor pyam accordingly.)

@byersiiasa
Copy link
Collaborator

this is interesting. i had a think about how it would be structured in xarray. Initially I thought that ['scenario','region','model','year','unit'] would all go as coordinates with the variable as individual variables within an xr.Dataset.

But this gives ValueError: cannot handle a non-unique multi-index! Which means variable should also be sent to coordinates and only one value variable...

df.set_index(['scenario','region','model','year','unit','variable']).to_xarray() # a long df


<xarray.Dataset>
Dimensions:   (model: 2, region: 1, scenario: 2, unit: 28, variable: 59, year: 321)
Coordinates:
  * scenario  (scenario) object '1point5' '2point0'
  * region    (region) object 'World'
  * model     (model) object 'model1' 'model2'
  * year      (year) int64 1800 1801 1802 1803 1804 ... 2116 2117 2118 2119 2120
  * unit      (unit) object 'Gt C / yr' 'K' 'Mt BC / yr' ... 'ppb' 'ppm'
  * variable  (variable) object 'AR6 climate diagnostics|Concentration|CH4|MAGICC6|Median' ... 'AR6 climate diagnostics|Temperature|Global Mean|MAGICC6|Median'
Data variables:
    value     (scenario, region, model, year, unit, variable) float64 nan ... nan

Perhaps other have better ideas how this is handled - I'd find it useful to visualize what this will look like - because as feared above this results in a 16MB memory/filesize from a 300kb csv

@gidden
Copy link
Member

gidden commented Feb 26, 2020

Thanks for taking a look @byersiiasa =)

I think the issue here is that the unit column is bringing in non-uniqueness. Otherwise ['scenario','region','model','year'] is fine. I think if we went the xarray route, we would also have to think about how we handle units anyway (and would not be a bad time to do so).

Out of curiosity, what size do you get if you do

df.set_index(['scenario','region','model','year'])['variable'].to_xarray()

I would assume in the future we keep some sort of side car metadata which maps variables to units (and is interrogated on computation and file output, populated on file input, etc.).

@byersiiasa
Copy link
Collaborator

Out of curiosity, what size do you get if you do

df.set_index(['scenario','region','model','year'])['variable'].to_xarray()

You get ValueError: cannot handle a non-unique multi-index!

@gidden
Copy link
Member

gidden commented Feb 26, 2020 via email

@znicholls
Copy link
Collaborator Author

znicholls commented Mar 3, 2020

I started having a play with this too (xarray-illustration.txt). Fiddling around with the netCDF file, it looks like we're not choosing the right index before converting to a dataset which is why it's so big.

It may be that using to_xarray() is not the right approach and, as @gidden says, we need to hold things internally with some sort of mapping and then have a custom writer that can handle the fact that our metadata isn't really coordinates in the normal sense of the word. We've started fiddling with that in scmdata, but haven't finished yet.

@znicholls
Copy link
Collaborator Author

znicholls commented Apr 17, 2020

I think this might be solved by openscm/scmdata@59568a6, at least when I run the example script I get an output file which is 10x smaller than saving a csv and you can load the output via xrray with

>>> xrds = xr.load_dataset("scmdata_dataset.nc")
>>> xrds
<xarray.Dataset>
Dimensions:            (model: 10, region: 2, scenario: 3584, time: 12)
Coordinates:
  * time               (time) float64 9.467e+08 1.105e+09 ... 4.102e+09
  * region             (region) object 'R5LAM' 'World'
  * model              (model) object 'MESSAGEix-GLOBIOM 1.0' 'a' ... 'h' 'i'
  * scenario           (scenario) object 'CD-LINKS_INDCi' ... 'LowEnergyDemandi'
Data variables:
    emissions__ch4     (region, model, scenario, time) float64 43.32 ... 94.27
    emissions__co2     (region, model, scenario, time) float64 3.772e+03 ... -3.524e+03
    emissions__n2o     (region, model, scenario, time) float64 1.142e+03 ... 7.029e+03
    emissions__sulfur  (region, model, scenario, time) float64 4.672 ... 12.45
Attributes:
    created_at:        2020-04-17T22:52:05.931706
    _scmdata_version:  0.4.0+130.g59568a6

Example script here: netcdf-size.txt

@danielhuppmann
Copy link
Member

Thanks @znicholls! If I understand this correctly, this is basically an exporter to xarray - but would this also be a starting point for refactoring the IamDataFrame.data attribute?

@znicholls
Copy link
Collaborator Author

If I understand this correctly, this is basically an exporter to xarray

More or less, you can also read back into an ScmDataFrame object so it really it's just allowing netCDF serialisation/deserialisation.

would this also be a starting point for refactoring the IamDataFrame.data attribute?

In principle yes but I see a slightly tricky choice needing to be made. Either a) duplicate a lot of code from scmdata into pyam or b) make scmdata a dependency of pyam i.e. refactor so that .data uses scmdata for its internal data handling.

@byersiiasa
Copy link
Collaborator

@znicholls you can also play with the compression options, e.g.:

comp = dict(zlib=True, complevel=5) #1 to 10
encoding = {vari: comp for vari in scmrun.data_vars}
scmrun.to_nc(scm_ds_fn, dimensions=("region", "model", "scenario"), encoding=encoding)

@znicholls
Copy link
Collaborator Author

@znicholls you can also play with the compression options

Cool! PRs very welcome :D

@byersiiasa
Copy link
Collaborator

regarding the metadata, not sure if youre aware but you can add a dict of attrs to every variable (DataArray) in the same way that the Dataset also has attrs.

Have you thought about what if scenario was the variable, instead of the "variable". Because the variables are more consitent like the regions and time. That way, also, the meta_data could belong to that scenario in the attributes.

@znicholls
Copy link
Collaborator Author

regarding the metadata, not sure if youre aware but you can add a dict of attrs to every variable (DataArray) in the same way that the Dataset also has attrs.

Yep that's how scmdata's internal handling works (e.g. here)

Have you thought about what if scenario was the variable, instead of the "variable". Because the variables are more consitent like the regions and time. That way, also, the meta_data could belong to that scenario in the attributes.

Not yet but sounds sensible and I don't think it would be that difficult to do as a refactoring. Probably just don't hardcode "variable" here and add some keyword argument like top_level to _write_nc.

@danielhuppmann
Copy link
Member

FYI, @macflo8 will work at IIASA for the next two months, and I hope that he and I can make some progress here. We'll do a refactoring to a pandas.Series first (because this seems a much easier step), but ideally lay the groundwork for a multi-backend implementation (so that a user can choose whether to have pandas or xarray).

@khaeru
Copy link
Contributor

khaeru commented Aug 26, 2020

A few more recent bits of info:

  • Look at this section of the pint docs, especially the diagram and links. tl;dr: in Numpy and everything related there are ongoing efforts to formalize wrapping of low-level types by others that bring additional features, e.g. sparse.COO, dask.array, and pint.Quantity.
  • This will imply a fixed hierarchy of classes, and it will probably make sense for smaller-scope projects to build on top of the higher items in the hierarchy.
  • "Ongoing" means "not yet finished", but it's still worthwhile to have a long-term aim to take advantage of this upstream improvement. Eventually, relying on it for various functionality can slim code and reduce workload.
  • I'm trying to do this for ixmp.reporting, e.g. at Use sparse xarray for reporting.Quantity iiasa/ixmp#191.
  • For pyam Refactor internal data object to pandas.Series #420 is a good step by making clear the separation between pyam features and internal data representation.

@znicholls
Copy link
Collaborator Author

One extra comment on top of the great summary from @khaeru, the pint docs make clear that xarray and pandas are "upcast types" of pint arrays. So, it may be possible in future to hold everything internally as either xarray or pandas, having pint within that so all the units are held 'tightly' with the data. This would mean we wouldn't have to carry around units information with each timeseries like we do right now (although having said that, it's a pretty minor change so not a big deal either way really).

@danielhuppmann
Copy link
Member

The internal timeseries data handling was refactor to a pd.Series. Adding an option to use xarray in the backend or using pint directly is currently not planned. Please re-open this issue or start a new one if the situation changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants