Investigate (Sparse) XArray Sizes depending on Dimensions #8

michaelweinold · 2024-01-20T07:17:32Z

Try:

4 dimensions (eg. parameter, time, size, prop)
8 dimensions (eg. parameter, time, size, prop1, prop2, prop3, prop4, prop5)

and determine size (in memory) and performance when eg. slicing the array.

See also:

The text was updated successfully, but these errors were encountered:

michaelweinold · 2024-01-22T12:25:07Z

@iamsiddhantsahu, until tomorrow (23.01.2024) please investigate:

Currently, the carculator ingests data like this:

The input data is formatted in JSON (cf. the default_parameters.json):

"7-2000-converter mass": {
        "name": "converter mass",
        "year": 2000,
        "powertrain": [
            "PHEV-c-d",
            "PHEV-e",
            "BEV",
            "FCEV",
            "PHEV-c-p"
        ],
        "sizes": [
            "Mini",
            "Small",
            "Lower medium",
            "Medium",
            "Large",
            "Van",
            "Medium SUV",
            "Large SUV",
            "Micro"
        ],
        "amount": 4.5,
        "loc": 4.5,
        "minimum": 4,
        "maximum": 6,
        "kind": "distribution",
        "uncertainty_type": 5,
        "category": "Powertrain",
        "source": "Del Duce et al (2016)",
        "comment": ""
    },

Of course, we could represent this in tabulated format also:

name	year	powertrain	sizes	amount	...
converter mass	2000	PHEV-c-d	Mini	4.5	...

In the instantiation of the class VehicleInputParameters(NamedParameters) >> class VehicleInputParameters(NamedParameters) this data is then converted into an xarray.DataArray with dimensions:

  * size        (size) <U12 'Large' 'Large SUV' 'Lower medium' ... 'Small' 'Van'
  * powertrain  (powertrain) <U8 'BEV' 'FCEV' 'HEV-d' ... 'PHEV-e' 'PHEV-p'
  * parameter   (parameter) <U64 '1-Pentene direct emissions, rural' ... 'tra...
  * year        (year) int64 2000 2010 2020 2030 2040 2050
  * value       (value) int64 0

(where value is relevant to stochastic calculations only).

This array is then used further on in the model library to perform calculations.

From what I can see, data is extracted from the DataArray using the slicing methods.

...so why can't we do this with a Pandas DataFrame?

iamsiddhantsahu · 2024-01-23T06:03:27Z

I wrote a script b218c47 to test the memory usage and slicing times comparing Pandas and XArray with sparse arrays.

Here is the plot.

Findings:

Memory -- Pandas is more memory efficient for sparse arrays than XArray, may be because of the built-in SparseDataFrame class.
Slicing TIme -- Both XArray and Pandas seems to have the same slicing time.

michaelweinold · 2024-01-23T06:10:43Z

dataset size is a poor metric. The question was primarily on the number of dimensions, so you should have made that explicit in your plot. Just comparing the dataset size is not helpful at all.

Also, you still have to think about the utility of using an xr.DataArray vs a pd.DataFrame. Is there any operation (eg. slicing, etc.) that we could not do with a DataFrame?

iamsiddhantsahu · 2024-01-23T07:29:52Z

The pandas.DataFrame.to_xarray() function that you are using here, converts a pandas.DataFrame object to a xarray.Dataset object. And, the xarray.Dataset object is a multi-dimensional, in-memory array database. It is a dict-like container of labeled arrays (DataArray objects) with aligned dimensions.

⚠️ The important note here is that it is designed as an in-memory representation of the data model -- this might be causing memory issues.

iamsiddhantsahu · 2024-01-23T08:19:34Z

Here is a bar plot comparing the memory usage of Pandas's pandas.DataFrame object and XArrays's xarray.Dataset object.

Dataset 1 = create_sample_dataframe(df_size = 100, num_propulsion_classifications = 1)
Dataset 2 = create_sample_dataframe(df_size = 100, num_propulsion_classifications = 2)
Dataset 3 = create_sample_dataframe(df_size = 500, num_propulsion_classifications = 2)

Findings:

pandas.DataFrame object is much more efficient than xarray.Dataset object -- ⚠️ The y-axis is a log axis. The size of Pandas's pandas.DataFrame object are so small that a log scale is required.

michaelweinold added the enhancement New feature or request label Jan 20, 2024

michaelweinold self-assigned this Jan 20, 2024

michaelweinold assigned iamsiddhantsahu Jan 22, 2024

michaelweinold unassigned iamsiddhantsahu Jan 30, 2024

michaelweinold closed this as completed Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate (Sparse) XArray Sizes depending on Dimensions #8

Investigate (Sparse) XArray Sizes depending on Dimensions #8

michaelweinold commented Jan 20, 2024 •

edited

Loading

michaelweinold commented Jan 22, 2024

iamsiddhantsahu commented Jan 23, 2024 •

edited

Loading

michaelweinold commented Jan 23, 2024

iamsiddhantsahu commented Jan 23, 2024

iamsiddhantsahu commented Jan 23, 2024 •

edited

Loading

Investigate (Sparse) XArray Sizes depending on Dimensions #8

Investigate (Sparse) XArray Sizes depending on Dimensions #8

Comments

michaelweinold commented Jan 20, 2024 • edited Loading

michaelweinold commented Jan 22, 2024

iamsiddhantsahu commented Jan 23, 2024 • edited Loading

michaelweinold commented Jan 23, 2024

iamsiddhantsahu commented Jan 23, 2024

iamsiddhantsahu commented Jan 23, 2024 • edited Loading

michaelweinold commented Jan 20, 2024 •

edited

Loading

iamsiddhantsahu commented Jan 23, 2024 •

edited

Loading

iamsiddhantsahu commented Jan 23, 2024 •

edited

Loading