Refactoring `data` handling #335

znicholls · 2020-02-18T23:30:55Z

As brought up in https://twitter.com/daniel_huppmann/status/1229689895649710080, as we move to bigger datasets it's not clear that holding the data in 'long form' will be the easiest way to do things. @danielhuppmann we've started playing with holding the data in different forms in memory in https://github.com/lewisjared/scmdata, particularly openscm/scmdata#21 and openscm/scmdata#26. I'm not sure what the best way to proceed is, perhaps we continue to fiddle in scmdata and once we've found a moderately stable solution we work out how to do something similar in pyam?

danielhuppmann · 2020-02-19T06:41:14Z

Thanks @znicholls for raising this issue, fine to test it in ScmDataFrame first. Still, can you outline the options (ideally in the descriptions above and delete my comment) or add a reference if that list already exists - I guess the simplest option is using pd.Series with index instead of pd.DataFrame (should be smaller in memory and faster), more advanced is moving to xarray, other options?

khaeru · 2020-02-19T12:34:12Z

@danielhuppmann asked me to comment here.

Using pandas.Series with pandas.MultiIndex is usually faster than pandas.DataFrame.
- Internally, pandas uses similar data storage for the contents of Index and subclasses, as it does for the data contents of a DataFrame. See the section “The Internal Representation of a DataFrame” in this post, which gives a good explanation (h/t @zikolach).
- Because much of this code is carefully optimized by the pandas developers, and also in C, then it will be more complete and performant than any attempt to emulate it in (pure) Python.
- pandas should be as performant for a 'wide' DataFrame with (e.g.) 10⁴ rows and 10 columns for a 'year' dimension, as it is for a 'long' Series with 10⁴×10=10⁵ elements and 'year' as one level in a MultiIndex. However, you should write tests (see below) and check this for your typical use cases in order to choose one. See pytest-benchmark.
- Using base pandas.Series instead of a complex derived type also allows optimized calculation methods, that are designed to operate on Series, to work on your data.
In ixmp.reporting, we want to use xarray.Dataset / .DataArray for multidimensional data.
- However, ixmp data tends to be very sparse; we have quantities with 10–15 dimensions and ~10⁷ possible entries that are only 0.1% or 0.01% full.
- xarray does have support for sparse data, but it is in early stages and requires careful management to avoid accidentally creating enormous dense (non-spare) objects. I'm following some of these issues.
- We have this test code to check whether 'typical'-size data fit in memory: https://github.com/iiasa/ixmp/blob/e8bf3b73936fb79d985ca6ed2329caee0bddc8ac/tests/test_reporting.py#L533-L592

Hope that helps.

znicholls · 2020-02-19T22:29:15Z

I would add a reference but I’ve only thought about this for 30 seconds so here’s a brain dump instead. As I understand, the major use case for pyam is handling model output, that is lots of timeseries data where sparsity isn’t really an issue. If that understanding is wrong, stop reading because the rest of my thoughts are based on a bad assumption.

Assuming I’ve got the major use case right, then xarrays sparsity issues disappear and it seems like the most obvious way forward if you really want to go to big data where lazy loading can be super useful (unless pandas has a lazy loading solution which I haven’t considered).

The long wide discussion isn’t quite as simple as discussed in the comment above. The problem for pyam’s handling at the moment is that if you had data in wide format, say 10 rows (each of which is a time series) and 10 columns (each of which is a year), then you’ll have 10 values in the ‘unit’ column. If you pivot to long form, you still have the same number of data values (10x10), but now you have 10x more values in the unit column (because you have 100 rows not 10). Given that the unit column is also normally strings, and by default not categorical, you pay a huge memory and performance penalty by using long form (especially when filtering). Once your data gets big enough, the difference between long and wide can be the difference between having enough memory and not. You could help by using categorical columns, but they have other issues.

We’re trying a new option in scmdata which gives about an order of magnitude faster filtering in exchange for handling things a bit differently. I think the answer really depends on the main use pattern.

gidden · 2020-02-20T07:12:54Z

Hi all - a few 2 cents on my side:

On long/wide:

we currently store data as non-indexed long-form data
if we were to store it using a pd.MultiIndex as a pd.Series with name='value', I think we would deal with short-term size/computation issues (e.g., filtering would not be data.col[data.col.isin(filter)] but rather data.loc[slice])

On xarray

this is my strongly preferred option if our data model is appropriate
- lazy loading
- native write/read to disc for heterogenous data
- native dask support for computations
the ixmp data model was not appropriate because our indexes (dimensions) are too heterogeonous
for pyam our indicies are standard (IAMC_IDX) and pervasive for the whole dataset
xarray further has native metdata support

So, in short - agreed with all said by @khaeru (thanks, btw!) and @znicholls . xarray didn't fit ixmp, but I believe it would fit pyam perfectly.

khaeru · 2020-02-20T15:03:53Z

The long wide discussion isn’t quite as simple as discussed in the comment above. […] Given that the unit column is also normally strings, and by default not categorical, you pay a huge memory and performance penalty by using long form […] Once your data gets big enough…

This assumes pandas' C storage internals are dumb which, again, they are not. That's why I recommended testing. For example:

>>> units = ['kg', 'GJ', 'USD'] * 10**6
>>> pd.Index(units).memory_usage()                           
24000000
>>> pd.MultiIndex.from_tuples(('a', 'b', u) for u in units).memory_usage()
9000088

i.e. once one starts using MultiIndex, pandas stores each level, such as 'unit', as integer indices into an internal list of labels. This is a 3-level index with three million rows in 9 MB: 3 levels × 3·10⁶ rows × 1 byte (note it is automatically choosing the smallest necessary int, int8) = 9·10⁶ bytes, plus 88 for the labels. It is not "huge".

This is why I specifically referred to using pd.Series with pd.MultiIndex for long-form data, and did not suggest storing index information as columns in a pd.DataFrame.

On xarray […] the ixmp data model was not appropriate because our indexes (dimensions) are too heterogeonous

I think xarray will be great for ixmp once the sparse support matures. Its approach to handling arbitrarily many coordinate variables is more robust than ixmp's.

gidden · 2020-02-20T15:23:32Z

Ok, so synthesizing this - it sounds like xarray is an ideal next solution here right?

znicholls · 2020-02-20T22:47:41Z

(As a brief comment on the pandas storage thread. I think @khaeru and I were discussing different things. I wasn’t considering using multiindex because this isn’t how pyam is currently implemented and I hadn’t realised it was on the table. Without multiindex pandas doesn’t do clever automatic conversion to integer indicies, hence long/wide does matter. As @khaeru points out you can take that headache away if you store everything as series with multiindex and refactor pyam accordingly.)

byersiiasa · 2020-02-26T09:22:28Z

this is interesting. i had a think about how it would be structured in xarray. Initially I thought that ['scenario','region','model','year','unit'] would all go as coordinates with the variable as individual variables within an xr.Dataset.

But this gives ValueError: cannot handle a non-unique multi-index! Which means variable should also be sent to coordinates and only one value variable...

df.set_index(['scenario','region','model','year','unit','variable']).to_xarray() # a long df


<xarray.Dataset>
Dimensions:   (model: 2, region: 1, scenario: 2, unit: 28, variable: 59, year: 321)
Coordinates:
  * scenario  (scenario) object '1point5' '2point0'
  * region    (region) object 'World'
  * model     (model) object 'model1' 'model2'
  * year      (year) int64 1800 1801 1802 1803 1804 ... 2116 2117 2118 2119 2120
  * unit      (unit) object 'Gt C / yr' 'K' 'Mt BC / yr' ... 'ppb' 'ppm'
  * variable  (variable) object 'AR6 climate diagnostics|Concentration|CH4|MAGICC6|Median' ... 'AR6 climate diagnostics|Temperature|Global Mean|MAGICC6|Median'
Data variables:
    value     (scenario, region, model, year, unit, variable) float64 nan ... nan

Perhaps other have better ideas how this is handled - I'd find it useful to visualize what this will look like - because as feared above this results in a 16MB memory/filesize from a 300kb csv

gidden · 2020-02-26T11:11:18Z

Thanks for taking a look @byersiiasa =)

I think the issue here is that the unit column is bringing in non-uniqueness. Otherwise ['scenario','region','model','year'] is fine. I think if we went the xarray route, we would also have to think about how we handle units anyway (and would not be a bad time to do so).

Out of curiosity, what size do you get if you do

df.set_index(['scenario','region','model','year'])['variable'].to_xarray()

I would assume in the future we keep some sort of side car metadata which maps variables to units (and is interrogated on computation and file output, populated on file input, etc.).

byersiiasa · 2020-02-26T15:56:33Z

Out of curiosity, what size do you get if you do
df.set_index(['scenario','region','model','year'])['variable'].to_xarray()

You get ValueError: cannot handle a non-unique multi-index!

gidden · 2020-02-26T17:46:09Z

Ah of course... bad suggestion. Sorry!

…

On Wed, Feb 26, 2020 at 4:56 PM byersiiasa ***@***.***> wrote: Out of curiosity, what size do you get if you do df.set_index(['scenario','region','model','year'])['variable'].to_xarray() You get ValueError: cannot handle a non-unique multi-index! — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#335?email_source=notifications&email_token=AAKUAEOIYWENUNCJWLWLAXLRE2GLFA5CNFSM4KXOOVD2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENAZGDQ#issuecomment-591500046>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKUAEKVJ67DHT4UF5FW45DRE2GLFANCNFSM4KXOOVDQ> .

znicholls · 2020-03-03T06:30:35Z

I started having a play with this too (xarray-illustration.txt). Fiddling around with the netCDF file, it looks like we're not choosing the right index before converting to a dataset which is why it's so big.

It may be that using to_xarray() is not the right approach and, as @gidden says, we need to hold things internally with some sort of mapping and then have a custom writer that can handle the fact that our metadata isn't really coordinates in the normal sense of the word. We've started fiddling with that in scmdata, but haven't finished yet.

znicholls · 2020-04-17T22:57:02Z

I think this might be solved by openscm/scmdata@59568a6, at least when I run the example script I get an output file which is 10x smaller than saving a csv and you can load the output via xrray with

>>> xrds = xr.load_dataset("scmdata_dataset.nc")
>>> xrds
<xarray.Dataset>
Dimensions:            (model: 10, region: 2, scenario: 3584, time: 12)
Coordinates:
  * time               (time) float64 9.467e+08 1.105e+09 ... 4.102e+09
  * region             (region) object 'R5LAM' 'World'
  * model              (model) object 'MESSAGEix-GLOBIOM 1.0' 'a' ... 'h' 'i'
  * scenario           (scenario) object 'CD-LINKS_INDCi' ... 'LowEnergyDemandi'
Data variables:
    emissions__ch4     (region, model, scenario, time) float64 43.32 ... 94.27
    emissions__co2     (region, model, scenario, time) float64 3.772e+03 ... -3.524e+03
    emissions__n2o     (region, model, scenario, time) float64 1.142e+03 ... 7.029e+03
    emissions__sulfur  (region, model, scenario, time) float64 4.672 ... 12.45
Attributes:
    created_at:        2020-04-17T22:52:05.931706
    _scmdata_version:  0.4.0+130.g59568a6

Example script here: netcdf-size.txt

danielhuppmann · 2020-04-18T06:52:33Z

Thanks @znicholls! If I understand this correctly, this is basically an exporter to xarray - but would this also be a starting point for refactoring the IamDataFrame.data attribute?

znicholls · 2020-04-18T07:41:35Z

If I understand this correctly, this is basically an exporter to xarray

More or less, you can also read back into an ScmDataFrame object so it really it's just allowing netCDF serialisation/deserialisation.

would this also be a starting point for refactoring the IamDataFrame.data attribute?

In principle yes but I see a slightly tricky choice needing to be made. Either a) duplicate a lot of code from scmdata into pyam or b) make scmdata a dependency of pyam i.e. refactor so that .data uses scmdata for its internal data handling.

byersiiasa · 2020-04-21T00:11:52Z

@znicholls you can also play with the compression options, e.g.:

comp = dict(zlib=True, complevel=5) #1 to 10
encoding = {vari: comp for vari in scmrun.data_vars}
scmrun.to_nc(scm_ds_fn, dimensions=("region", "model", "scenario"), encoding=encoding)

znicholls · 2020-04-21T00:13:37Z

@znicholls you can also play with the compression options

Cool! PRs very welcome :D

byersiiasa · 2020-04-21T00:20:01Z

regarding the metadata, not sure if youre aware but you can add a dict of attrs to every variable (DataArray) in the same way that the Dataset also has attrs.

Have you thought about what if scenario was the variable, instead of the "variable". Because the variables are more consitent like the regions and time. That way, also, the meta_data could belong to that scenario in the attributes.

znicholls · 2020-04-21T01:43:25Z

regarding the metadata, not sure if youre aware but you can add a dict of attrs to every variable (DataArray) in the same way that the Dataset also has attrs.

Yep that's how scmdata's internal handling works (e.g. here)

Have you thought about what if scenario was the variable, instead of the "variable". Because the variables are more consitent like the regions and time. That way, also, the meta_data could belong to that scenario in the attributes.

Not yet but sounds sensible and I don't think it would be that difficult to do as a refactoring. Probably just don't hardcode "variable" here and add some keyword argument like top_level to _write_nc.

danielhuppmann · 2020-07-31T06:35:52Z

FYI, @macflo8 will work at IIASA for the next two months, and I hope that he and I can make some progress here. We'll do a refactoring to a pandas.Series first (because this seems a much easier step), but ideally lay the groundwork for a multi-backend implementation (so that a user can choose whether to have pandas or xarray).

khaeru · 2020-08-26T19:40:35Z

A few more recent bits of info:

Look at this section of the pint docs, especially the diagram and links. tl;dr: in Numpy and everything related there are ongoing efforts to formalize wrapping of low-level types by others that bring additional features, e.g. sparse.COO, dask.array, and pint.Quantity.
This will imply a fixed hierarchy of classes, and it will probably make sense for smaller-scope projects to build on top of the higher items in the hierarchy.
"Ongoing" means "not yet finished", but it's still worthwhile to have a long-term aim to take advantage of this upstream improvement. Eventually, relying on it for various functionality can slim code and reduce workload.
I'm trying to do this for ixmp.reporting, e.g. at Use sparse xarray for reporting.Quantity iiasa/ixmp#191.
For pyam Refactor internal data object to pandas.Series #420 is a good step by making clear the separation between pyam features and internal data representation.

znicholls · 2020-08-27T00:43:58Z

One extra comment on top of the great summary from @khaeru, the pint docs make clear that xarray and pandas are "upcast types" of pint arrays. So, it may be possible in future to hold everything internally as either xarray or pandas, having pint within that so all the units are held 'tightly' with the data. This would mean we wouldn't have to carry around units information with each timeseries like we do right now (although having said that, it's a pretty minor change so not a big deal either way really).

danielhuppmann · 2021-06-10T17:32:33Z

The internal timeseries data handling was refactor to a pd.Series. Adding an option to use xarray in the backend or using pint directly is currently not planned. Please re-open this issue or start a new one if the situation changes.

danielhuppmann mentioned this issue Feb 20, 2020

IamDataFrame: fail when initialized with n/a entries in columns #337

Merged

3 tasks

danielhuppmann mentioned this issue Feb 26, 2020

Should variable-unit pairs be unique in a scenario (ensemble)? #338

Open

znicholls mentioned this issue Apr 9, 2020

Add netCDF support V2 openscm/scmdata#34

Merged

5 tasks

znicholls mentioned this issue Jun 16, 2020

Warning about IamDataFrame.data deprecation GranthamImperial/silicone#109

Closed

danielhuppmann mentioned this issue Aug 3, 2020

Refactor internal data object to pandas.Series #420

Merged

11 tasks

danielhuppmann closed this as completed Jun 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring `data` handling #335

Refactoring `data` handling #335

znicholls commented Feb 18, 2020

danielhuppmann commented Feb 19, 2020

khaeru commented Feb 19, 2020

znicholls commented Feb 19, 2020

gidden commented Feb 20, 2020

khaeru commented Feb 20, 2020

gidden commented Feb 20, 2020

znicholls commented Feb 20, 2020

byersiiasa commented Feb 26, 2020

gidden commented Feb 26, 2020

byersiiasa commented Feb 26, 2020

gidden commented Feb 26, 2020 via email

znicholls commented Mar 3, 2020 •

edited

Loading

znicholls commented Apr 17, 2020 •

edited

Loading

danielhuppmann commented Apr 18, 2020

znicholls commented Apr 18, 2020

byersiiasa commented Apr 21, 2020

znicholls commented Apr 21, 2020

byersiiasa commented Apr 21, 2020

znicholls commented Apr 21, 2020

danielhuppmann commented Jul 31, 2020

khaeru commented Aug 26, 2020

znicholls commented Aug 27, 2020

danielhuppmann commented Jun 10, 2021

Refactoring data handling #335

Refactoring data handling #335

Comments

znicholls commented Feb 18, 2020

danielhuppmann commented Feb 19, 2020

khaeru commented Feb 19, 2020

znicholls commented Feb 19, 2020

gidden commented Feb 20, 2020

khaeru commented Feb 20, 2020

gidden commented Feb 20, 2020

znicholls commented Feb 20, 2020

byersiiasa commented Feb 26, 2020

gidden commented Feb 26, 2020

byersiiasa commented Feb 26, 2020

gidden commented Feb 26, 2020 via email

znicholls commented Mar 3, 2020 • edited Loading

znicholls commented Apr 17, 2020 • edited Loading

danielhuppmann commented Apr 18, 2020

znicholls commented Apr 18, 2020

byersiiasa commented Apr 21, 2020

znicholls commented Apr 21, 2020

byersiiasa commented Apr 21, 2020

znicholls commented Apr 21, 2020

danielhuppmann commented Jul 31, 2020

khaeru commented Aug 26, 2020

znicholls commented Aug 27, 2020

danielhuppmann commented Jun 10, 2021

Refactoring `data` handling #335

Refactoring `data` handling #335

znicholls commented Mar 3, 2020 •

edited

Loading

znicholls commented Apr 17, 2020 •

edited

Loading