-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring data
handling
#335
Comments
Thanks @znicholls for raising this issue, fine to test it in |
@danielhuppmann asked me to comment here.
Hope that helps. |
I would add a reference but I’ve only thought about this for 30 seconds so here’s a brain dump instead. As I understand, the major use case for pyam is handling model output, that is lots of timeseries data where sparsity isn’t really an issue. If that understanding is wrong, stop reading because the rest of my thoughts are based on a bad assumption. Assuming I’ve got the major use case right, then xarrays sparsity issues disappear and it seems like the most obvious way forward if you really want to go to big data where lazy loading can be super useful (unless pandas has a lazy loading solution which I haven’t considered). The long wide discussion isn’t quite as simple as discussed in the comment above. The problem for pyam’s handling at the moment is that if you had data in wide format, say 10 rows (each of which is a time series) and 10 columns (each of which is a year), then you’ll have 10 values in the ‘unit’ column. If you pivot to long form, you still have the same number of data values (10x10), but now you have 10x more values in the unit column (because you have 100 rows not 10). Given that the unit column is also normally strings, and by default not categorical, you pay a huge memory and performance penalty by using long form (especially when filtering). Once your data gets big enough, the difference between long and wide can be the difference between having enough memory and not. You could help by using categorical columns, but they have other issues. We’re trying a new option in scmdata which gives about an order of magnitude faster filtering in exchange for handling things a bit differently. I think the answer really depends on the main use pattern. |
Hi all - a few 2 cents on my side: On long/wide:
On
So, in short - agreed with all said by @khaeru (thanks, btw!) and @znicholls . |
This assumes pandas' C storage internals are dumb which, again, they are not. That's why I recommended testing. For example:
i.e. once one starts using MultiIndex, pandas stores each level, such as 'unit', as integer indices into an internal list of labels. This is a 3-level index with three million rows in 9 MB: 3 levels × 3·10⁶ rows × 1 byte (note it is automatically choosing the smallest necessary int, int8) = 9·10⁶ bytes, plus 88 for the labels. It is not "huge". This is why I specifically referred to using pd.Series with pd.MultiIndex for long-form data, and did not suggest storing index information as columns in a pd.DataFrame.
I think xarray will be great for ixmp once the sparse support matures. Its approach to handling arbitrarily many coordinate variables is more robust than ixmp's. |
Ok, so synthesizing this - it sounds like |
(As a brief comment on the pandas storage thread. I think @khaeru and I were discussing different things. I wasn’t considering using multiindex because this isn’t how pyam is currently implemented and I hadn’t realised it was on the table. Without multiindex pandas doesn’t do clever automatic conversion to integer indicies, hence long/wide does matter. As @khaeru points out you can take that headache away if you store everything as series with multiindex and refactor pyam accordingly.) |
this is interesting. i had a think about how it would be structured in xarray. Initially I thought that But this gives
Perhaps other have better ideas how this is handled - I'd find it useful to visualize what this will look like - because as feared above this results in a 16MB memory/filesize from a 300kb csv |
Thanks for taking a look @byersiiasa =) I think the issue here is that the Out of curiosity, what size do you get if you do
I would assume in the future we keep some sort of side car metadata which maps variables to units (and is interrogated on computation and file output, populated on file input, etc.). |
You get |
Ah of course... bad suggestion. Sorry!
…On Wed, Feb 26, 2020 at 4:56 PM byersiiasa ***@***.***> wrote:
Out of curiosity, what size do you get if you do
df.set_index(['scenario','region','model','year'])['variable'].to_xarray()
You get ValueError: cannot handle a non-unique multi-index!
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#335?email_source=notifications&email_token=AAKUAEOIYWENUNCJWLWLAXLRE2GLFA5CNFSM4KXOOVD2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENAZGDQ#issuecomment-591500046>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKUAEKVJ67DHT4UF5FW45DRE2GLFANCNFSM4KXOOVDQ>
.
|
I started having a play with this too (xarray-illustration.txt). Fiddling around with the netCDF file, it looks like we're not choosing the right index before converting to a dataset which is why it's so big. It may be that using |
I think this might be solved by openscm/scmdata@59568a6, at least when I run the example script I get an output file which is 10x smaller than saving a csv and you can load the output via xrray with >>> xrds = xr.load_dataset("scmdata_dataset.nc")
>>> xrds
<xarray.Dataset>
Dimensions: (model: 10, region: 2, scenario: 3584, time: 12)
Coordinates:
* time (time) float64 9.467e+08 1.105e+09 ... 4.102e+09
* region (region) object 'R5LAM' 'World'
* model (model) object 'MESSAGEix-GLOBIOM 1.0' 'a' ... 'h' 'i'
* scenario (scenario) object 'CD-LINKS_INDCi' ... 'LowEnergyDemandi'
Data variables:
emissions__ch4 (region, model, scenario, time) float64 43.32 ... 94.27
emissions__co2 (region, model, scenario, time) float64 3.772e+03 ... -3.524e+03
emissions__n2o (region, model, scenario, time) float64 1.142e+03 ... 7.029e+03
emissions__sulfur (region, model, scenario, time) float64 4.672 ... 12.45
Attributes:
created_at: 2020-04-17T22:52:05.931706
_scmdata_version: 0.4.0+130.g59568a6 Example script here: netcdf-size.txt |
Thanks @znicholls! If I understand this correctly, this is basically an exporter to |
More or less, you can also read back into an
In principle yes but I see a slightly tricky choice needing to be made. Either a) duplicate a lot of code from scmdata into pyam or b) make scmdata a dependency of pyam i.e. refactor so that .data uses scmdata for its internal data handling. |
@znicholls you can also play with the compression options, e.g.:
|
Cool! PRs very welcome :D |
regarding the metadata, not sure if youre aware but you can add a Have you thought about what if scenario was the variable, instead of the "variable". Because the variables are more consitent like the regions and time. That way, also, the meta_data could belong to that scenario in the attributes. |
Yep that's how scmdata's internal handling works (e.g. here)
Not yet but sounds sensible and I don't think it would be that difficult to do as a refactoring. Probably just don't hardcode "variable" here and add some keyword argument like |
FYI, @macflo8 will work at IIASA for the next two months, and I hope that he and I can make some progress here. We'll do a refactoring to a pandas.Series first (because this seems a much easier step), but ideally lay the groundwork for a multi-backend implementation (so that a user can choose whether to have pandas or xarray). |
A few more recent bits of info:
|
One extra comment on top of the great summary from @khaeru, the pint docs make clear that xarray and pandas are "upcast types" of pint arrays. So, it may be possible in future to hold everything internally as either xarray or pandas, having pint within that so all the units are held 'tightly' with the data. This would mean we wouldn't have to carry around units information with each timeseries like we do right now (although having said that, it's a pretty minor change so not a big deal either way really). |
The internal timeseries data handling was refactor to a |
As brought up in https://twitter.com/daniel_huppmann/status/1229689895649710080, as we move to bigger datasets it's not clear that holding the data in 'long form' will be the easiest way to do things. @danielhuppmann we've started playing with holding the data in different forms in memory in https://github.com/lewisjared/scmdata, particularly openscm/scmdata#21 and openscm/scmdata#26. I'm not sure what the best way to proceed is, perhaps we continue to fiddle in scmdata and once we've found a moderately stable solution we work out how to do something similar in pyam?
The text was updated successfully, but these errors were encountered: