Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 0.7.0 breaks pyam test #50

Closed
phackstock opened this issue Feb 28, 2024 · 14 comments · Fixed by #53
Closed

Version 0.7.0 breaks pyam test #50

phackstock opened this issue Feb 28, 2024 · 14 comments · Fixed by #53
Assignees

Comments

@phackstock
Copy link
Contributor

As I was working on pyam yesterday (IAMconsortium/pyam#818) I noticed that ixmp4 0.7.0 broke the test test_ixmp4_integration[test_df0] from tests/test_ixmp4.py (https://github.com/IAMconsortium/pyam/actions/runs/8067457546/job/22037916444) with the following error:

...
pyarrow/error.pxi:154: in pyarrow.lib.pyarrow_internal_check_status
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   pyarrow.lib.ArrowNotImplementedError: Function 'not_equal' has no kernel matching input types (string, double)

pyarrow/error.pxi:91: ArrowNotImplementedError

looks like it's got something to do with pyarrow.
Reverting to ixmp4 version 0.6.0 fixed the test.
Just talked to @meksor and he said that you, @glatterf42, would be the best person to take a look.

FYI @danielhuppmann.

@glatterf42
Copy link
Member

I'm not sure this is ixmp4's fault. My interpretation is that

/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/ixmp4/data/db/meta/repository.py:194: in bulk_upsert
    super().bulk_upsert(type_df)
/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/ixmp4/data/db/base.py:376: in bulk_upsert
    self.bulk_upsert_chunk(df)
/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/site-packages/ixmp4/data/db/base.py:394: in bulk_upsert_chunk
    cond.append(df[col] != df[updated_col])

and

 E   pyarrow.lib.ArrowNotImplementedError: Function 'not_equal' has no kernel matching input types (string, double)

pyarrow/error.pxi:91: ArrowNotImplementedError

indicate that df[col] and df[updated_col] don't have compatible types. This could be on ixmp4 for trying to update the wrong column (which seems odd) or it could be on the pyam test setup for trying to pass an incompatible type.
I'll have to install pyam and run the tests myself to inspect test_df more closely, I think.

@glatterf42
Copy link
Member

I've tracked the error down for the most part, but I'm not sure how to resolve it yet. As far as I can tell,t he following is happening:
In

self.backend.meta.bulk_upsert(df)
, we have this dataframe:

   key value  run__id
0  number     1        1
1  string   foo        1

This survives until

null_cols = set(RunMetaEntry._column_map.values()) - set([col])
:

       key value  run__id      type
0  number     1        1  Type.INT
1  string   foo        1  Type.STR

But then, we call bulk_upsert() individually for each type. For the first type, this is not an issue, but for the second type, we already have an existing_df looking like this:

    run__id     key type  value_int value_str value_float value_bool  id
0        1  number  INT          1      None        None       None   1

here:

df = self.merge_existing(df, existing_df)

And for some reason, the comparison then fails. We want to insert a value of type string (according to pandas and pyarrow), but the existing_df already contains a value of type float64 in that column, presumably None is converted to that type.

@meksor, @danielhuppmann, if you have experience with this or immediately know what to do, please jump in here. Otherwise, I'll find a fix tomorrow.

@phackstock
Copy link
Contributor Author

Thanks a lot for the detective work @glatterf42.

@meksor
Copy link
Contributor

meksor commented Feb 28, 2024

What /exactly/ is pyam trying to pass to ixmp4 as meta values?

@danielhuppmann
Copy link
Member

A dataframe like this converted to a dict like this.

@meksor
Copy link
Contributor

meksor commented Feb 28, 2024

OK so that becomes a dict like: {"model": "model_a", "scenario": "scen_a", "number": 1, "string": "foo"}
Inserting that into the ixmp4 tests everything passes... ???

@danielhuppmann
Copy link
Member

And the pyam-ixmp4-test passed last week - so it must be either the pandas-update yesterday or some ixmp4 change since v0.6...

@meksor
Copy link
Contributor

meksor commented Feb 28, 2024

Im running it locally with pandas 2.2.1 and the newest ixmp4 version...

@meksor
Copy link
Contributor

meksor commented Feb 28, 2024

Ok update, if I install pyarrow /alongside/ ixmp4, the tests in ixmp4 also fail... Seems pandas uses pyarrow if its available, breaking this test....

@meksor
Copy link
Contributor

meksor commented Feb 28, 2024

OK so pandas version 1.5.3 still works. Seems pandas version >2 will change its behaviour if pyarrow is installed. converting the columns to each other's types just yields another pyarrow error.
I would suggest reverting the update to pandas 2, seems a bunch of stuff broke...

@danielhuppmann
Copy link
Member

We bumped pyam to depend on pandas >= 2.0 a while ago to take advantage of the fast speed and improved API, so pinning ixmp4<2 isn't really an option...

@glatterf42
Copy link
Member

I've added some auxiliary output to bulk_upsert_chunk() like this:

    def bulk_upsert_chunk(self, df: pd.DataFrame) -> None:
        columns = db.utils.get_columns(self.model_class)
        df = df[list(set(columns.keys()) & set(df.columns))]
        existing_df = self.tabulate_existing(df)
        print(f"df's dtypes: \n {df.dtypes}")
        if existing_df.empty:
            self.bulk_insert(df)
        else:
            df = self.merge_existing(df, existing_df)
            df["exists"] = np.where(pd.notnull(df["id"]), True, False)
            print(f"existing df \n {existing_df}")
            print(f"existing df's dtypes: \n {existing_df.dtypes}")
            print(f"new df's dtypes: \n {df.dtypes}")

And the corresponding out shows this:

df's dtypes: 
 value_bool     object
value_float    object
value_str      object
key            object
run__id         int64
value_int      object
type           object
dtype: object
existing df 
    run__id     key type  value_int value_str value_float value_bool  id
0        1  number  INT          1      None        None       None   1
existing df's dtypes: 
 run__id         int64
key            object
type           object
value_int       int64
value_str      object
value_float    object
value_bool     object
id              int64
dtype: object
new df's dtypes: 
 run__id                    int64
value_bool       string[pyarrow]
value_float      string[pyarrow]
value_str        string[pyarrow]
key              string[pyarrow]
value_int        string[pyarrow]
type             string[pyarrow]
type_y           string[pyarrow]
value_int_y              float64
value_str_y      string[pyarrow]
value_float_y    string[pyarrow]
value_bool_y     string[pyarrow]
id                       float64
exists                      bool
dtype: object

So it looks like one of these

            df = self.merge_existing(df, existing_df)
            df["exists"] = np.where(pd.notnull(df["id"]), True, False)

is responsible for the data conversion to unexpected types.

@glatterf42
Copy link
Member

It's happening in df = self.merge_existing(df, existing_df) already, which makes me think this could be related to dask instead of pandas.

dask/dask#10631 might be related.

@glatterf42 glatterf42 changed the title ixmp 0.7.0 breaks pyam test Version 0.7.0 breaks pyam test Feb 29, 2024
@meksor
Copy link
Contributor

meksor commented Feb 29, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants