From pandas to xarray without blowing up memory #2139

ghost · 2018-05-16T16:51:09Z

I have a billion rows of data, but really it's just two categorical variables, time, lat, lon and some data variables.

Thinking it would somehow help me get the data into xarray, I created a five level pandas MultiIndex array out of the data, but thus far this has not been successful. xarray tries to create a product and that's just not going to work..

Trying to write a NetCDF file has presented its own issues, and I'm left wondering if there isn't a much simpler way to go about this?

jhamman · 2018-05-16T16:55:27Z

@brianmingus - any chance you can provide a reproducible example with some dummy data?

ghost · 2018-05-16T17:00:24Z

Hi @jhamman The original data is literally just a flat csv file with ie: lat,lon,epoch,cat1,cat2,var1,var2,...,var50 with 1 billion rows.

I'm looking to xarray for GeoViews, which I think would benefit from having the data properly grouped/indexed by its categories

ghost · 2018-05-16T17:01:37Z

PS: I started with Dask but haven't found a way to go from Dask to xarray.

ghost · 2018-05-16T17:13:11Z

This looks potentially helpful http://metacsv.readthedocs.io/en/latest/

shoyer · 2018-05-16T17:20:03Z

If you don't want the full Cartesian product, you need to ensure that the index only contains the variables you want to expand into a grid, e.g., time, lat and lon.

If the problem is only running out of memory (which is indeed likely with 1e9 rows), then you'll need to think about a more clever way to convert the data. One good option might be to groups over subsets of the data (using dask or another parallel processing library like spark or beam), and write a bunch of smaller netCDF which you then open with xarray's open_mfdataset(). It's probably most convenient to split over time, e.g., into files for each day or month.

ghost · 2018-05-16T18:24:02Z

@shoyer Thank you. Does metacsv look likely to work to you? It has attracted almost no attention so I wonder if it will exhaust memory. I'm kind of surprised this path (csv -> xarray) isn't better fleshed out as I would have expected it to be very common, perhaps the most common esp. for "found data."

shoyer · 2018-05-16T18:31:35Z

MetaCSV looks interesting but I haven't used it myself. My guess would be that it just wraps pandas/xarray for processing data, so I think it's unlikely to give a performance boost. It's more about a declarative way to specify how to load a CSV into pandas/xarray.

ghost · 2018-05-16T18:36:17Z

Ok. Looks like the way forward is a netCDF file for each level of my categorical variables. Will give it a shot.

ghost · 2018-05-16T18:37:24Z

Does that sound like it will play well with GeoViews if I want widgets for the categorical vars?

mankoff · 2020-10-14T11:25:03Z

Late reply, but if anyone else finds this issue, I was filling memory with: ds = df.to_xarray(), but if I build the dataset more manually, I have no memory issues:

ds = xr.Dataset({df.columns[0]: xr.DataArray(data=df[df.columns[0]], dims=['index'], coords={'index':df.index})})
for c in df.columns[1:]:
    ds[c] = (('index'), df[c])

max-sixty · 2020-10-14T16:00:35Z

@mankoff Thanks for the issue, do you have a fuller reproduction? I'm happy to take a look at this.

mankoff · 2020-10-14T16:23:36Z

@max-sixty Sorry for posting this here. This memory blow-up was a byproduct of another bug that it took me a few more hours to track down. This other bug is in Pandas, not xarray.

max-sixty · 2020-10-14T18:23:16Z

Great! Post here / a new issue if something does come up!

mankoff · 2020-10-14T18:52:38Z

The issue is that if you pass in names = ['a','b','c'] to pd.read_csv and there are more columns than names, it takes all the columns without a name and creates a multi-index. That was a bug in my code that I had more columns than names, didn't want a multi-index, and didn't make use of usecols.

This multi-index came from a small 12 MB file - 5000 rows and 40 variables. When I then did df.to_xarray() it filled up my RAM. If I ran the code I provided above, it worked.

Now that I've figured all this out, I don't think that any bugs exist in xarray or pandas, just my code. As usual :). But if the fact that I can fill ram with df.to_xarray() but not with the 3 lines shown above sounds like an issue you want to explore, I'm happy to provide an MWE on a new ticket and tag you there. Let me know...

max-sixty · 2020-10-14T19:34:53Z

As you wish — if there's a motivating example then that has more weight, and big issues should have ample supply of motivating examples. That said, if you have something ready to go, then happy to take a look at it.

shoyer mentioned this issue Aug 15, 2019

sparse=True option for from_dataframe and from_series #3210

Merged

4 tasks

crusaderky closed this as completed in #3210 Aug 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

From pandas to xarray without blowing up memory #2139

From pandas to xarray without blowing up memory #2139

ghost commented May 16, 2018

jhamman commented May 16, 2018

ghost commented May 16, 2018

ghost commented May 16, 2018

ghost commented May 16, 2018

shoyer commented May 16, 2018

ghost commented May 16, 2018

shoyer commented May 16, 2018

ghost commented May 16, 2018

ghost commented May 16, 2018

mankoff commented Oct 14, 2020

max-sixty commented Oct 14, 2020

mankoff commented Oct 14, 2020

max-sixty commented Oct 14, 2020

mankoff commented Oct 14, 2020

max-sixty commented Oct 14, 2020

From pandas to xarray without blowing up memory #2139

From pandas to xarray without blowing up memory #2139

Comments

ghost commented May 16, 2018

jhamman commented May 16, 2018

ghost commented May 16, 2018

ghost commented May 16, 2018

ghost commented May 16, 2018

shoyer commented May 16, 2018

ghost commented May 16, 2018

shoyer commented May 16, 2018

ghost commented May 16, 2018

ghost commented May 16, 2018

mankoff commented Oct 14, 2020

max-sixty commented Oct 14, 2020

mankoff commented Oct 14, 2020

max-sixty commented Oct 14, 2020

mankoff commented Oct 14, 2020

max-sixty commented Oct 14, 2020