Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate (Sparse) XArray Sizes depending on Dimensions #8

Closed
michaelweinold opened this issue Jan 20, 2024 · 5 comments
Closed

Investigate (Sparse) XArray Sizes depending on Dimensions #8

michaelweinold opened this issue Jan 20, 2024 · 5 comments
Assignees
Labels
enhancement New feature or request

Comments

@michaelweinold
Copy link
Member

michaelweinold commented Jan 20, 2024

Try:

4 dimensions (eg. parameter, time, size, prop)
8 dimensions (eg. parameter, time, size, prop1, prop2, prop3, prop4, prop5)

and determine size (in memory) and performance when eg. slicing the array.

See also:

@michaelweinold michaelweinold added the enhancement New feature or request label Jan 20, 2024
@michaelweinold michaelweinold self-assigned this Jan 20, 2024
@michaelweinold
Copy link
Member Author

@iamsiddhantsahu, until tomorrow (23.01.2024) please investigate:

Currently, the carculator ingests data like this:

The input data is formatted in JSON (cf. the default_parameters.json):

"7-2000-converter mass": {
        "name": "converter mass",
        "year": 2000,
        "powertrain": [
            "PHEV-c-d",
            "PHEV-e",
            "BEV",
            "FCEV",
            "PHEV-c-p"
        ],
        "sizes": [
            "Mini",
            "Small",
            "Lower medium",
            "Medium",
            "Large",
            "Van",
            "Medium SUV",
            "Large SUV",
            "Micro"
        ],
        "amount": 4.5,
        "loc": 4.5,
        "minimum": 4,
        "maximum": 6,
        "kind": "distribution",
        "uncertainty_type": 5,
        "category": "Powertrain",
        "source": "Del Duce et al (2016)",
        "comment": ""
    },

Of course, we could represent this in tabulated format also:

name year powertrain sizes amount ...
converter mass 2000 PHEV-c-d Mini 4.5 ...

In the instantiation of the class VehicleInputParameters(NamedParameters) >> class VehicleInputParameters(NamedParameters) this data is then converted into an xarray.DataArray with dimensions:

  * size        (size) <U12 'Large' 'Large SUV' 'Lower medium' ... 'Small' 'Van'
  * powertrain  (powertrain) <U8 'BEV' 'FCEV' 'HEV-d' ... 'PHEV-e' 'PHEV-p'
  * parameter   (parameter) <U64 '1-Pentene direct emissions, rural' ... 'tra...
  * year        (year) int64 2000 2010 2020 2030 2040 2050
  * value       (value) int64 0

(where value is relevant to stochastic calculations only).

This array is then used further on in the model library to perform calculations.

From what I can see, data is extracted from the DataArray using the slicing methods.

...so why can't we do this with a Pandas DataFrame?

@iamsiddhantsahu
Copy link

iamsiddhantsahu commented Jan 23, 2024

I wrote a script b218c47 to test the memory usage and slicing times comparing Pandas and XArray with sparse arrays.

Here is the plot.
pandas-vs-xarray

Findings:

  1. Memory -- Pandas is more memory efficient for sparse arrays than XArray, may be because of the built-in SparseDataFrame class.
  2. Slicing TIme -- Both XArray and Pandas seems to have the same slicing time.

@michaelweinold
Copy link
Member Author

dataset size is a poor metric. The question was primarily on the number of dimensions, so you should have made that explicit in your plot. Just comparing the dataset size is not helpful at all.

Also, you still have to think about the utility of using an xr.DataArray vs a pd.DataFrame. Is there any operation (eg. slicing, etc.) that we could not do with a DataFrame?

@iamsiddhantsahu
Copy link

The pandas.DataFrame.to_xarray() function that you are using here, converts a pandas.DataFrame object to a xarray.Dataset object. And, the xarray.Dataset object is a multi-dimensional, in-memory array database. It is a dict-like container of labeled arrays (DataArray objects) with aligned dimensions.

⚠️ The important note here is that it is designed as an in-memory representation of the data model -- this might be causing memory issues.

@iamsiddhantsahu
Copy link

iamsiddhantsahu commented Jan 23, 2024

Here is a bar plot comparing the memory usage of Pandas's pandas.DataFrame object and XArrays's xarray.Dataset object.

Dataset 1 = create_sample_dataframe(df_size = 100, num_propulsion_classifications = 1)
Dataset 2 = create_sample_dataframe(df_size = 100, num_propulsion_classifications = 2)
Dataset 3 = create_sample_dataframe(df_size = 500, num_propulsion_classifications = 2)

Findings:

  • pandas.DataFrame object is much more efficient than xarray.Dataset object -- ⚠️ The y-axis is a log axis. The size of Pandas's pandas.DataFrame object are so small that a log scale is required.
    pandas-vs-xarray

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants