Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example 08_join_aggregation broken #1005

Closed
Vincent-Maladiere opened this issue Jul 16, 2024 · 4 comments · Fixed by #1013
Closed

Example 08_join_aggregation broken #1005

Vincent-Maladiere opened this issue Jul 16, 2024 · 4 comments · Fixed by #1013
Labels
bug Something isn't working

Comments

@Vincent-Maladiere
Copy link
Member

Describe the issue linked to the documentation

On the stable and dev version of the doc, in the "Hyper-parameters tuning and cross validation" section of the 08_join_aggregation example, the GridSearchCV outputs scores are all nan.

This is due to the estimator failing during the CV.

ValueError: The feature names should match those that were passed during fit.
Feature names unseen at fit time:
- index__skrub_f3c63e8c__
Feature names seen at fit time, yet now missing:
- index__skrub_634644a2__

It appears that a column name change during one of the preprocessing steps of the pipeline between train and predict breaks the _check_feature_names .

Suggest a potential alternative/fix

I need more time to understand which preprocessor led to this column name change.

@Vincent-Maladiere Vincent-Maladiere added documentation Add or improve the documentation no changelog needed labels Jul 16, 2024
@jeromedockes
Copy link
Member

Here is a minimal reproducer:

>>> import pandas as pd

>>> from skrub import AggTarget

>>> df = pd.DataFrame(dict(a=[1, 1, 2, 2], b=[10, 20, 30, 40]))
>>> y = pd.Series([1, 2, 3, 4], name="b")

>>> transformer_1 = AggTarget(main_key='a', operation='mean')
>>> transformer_2 = AggTarget(main_key='a', operation='mean')
>>> out = transformer_1.fit_transform(df, y)
>>> out = transformer_2.fit_transform(out, y)
>>> out
   a   b  b_mean_target  b_mean_target__skrub_38870b21__
0  1  10            1.5                              1.5
1  1  20            1.5                              1.5
2  2  30            3.5                              3.5
3  2  40            3.5                              3.5

Note the name of the last column changes

>>> transformer_2.transform(transformer_1.transform(df))
   a   b  b_mean_target  b_mean_target__skrub_6f32724b__
0  1  10            1.5                              1.5
1  1  20            1.5                              1.5
2  2  30            3.5                              3.5
3  2  40            3.5                              3.5


>>> transformer_2.transform(transformer_1.transform(df))
   a   b  b_mean_target  b_mean_target__skrub_585a8372__
0  1  10            1.5                              1.5
1  1  20            1.5                              1.5
2  2  30            3.5                              3.5
3  2  40            3.5                              3.5

@jeromedockes jeromedockes added bug Something isn't working and removed documentation Add or improve the documentation no changelog needed labels Jul 16, 2024
@jeromedockes
Copy link
Member

jeromedockes commented Jul 16, 2024

This is due to AggTarget calling left_join in transform with duplicate column names in the dataframe it joins:

return _join_utils.left_join(

left_join is a stateless low-level function and it appends a random string to duplicated column names to avoid errors. To avoid the name changing at each transform, AggTarget should either forbid duplicate column names and raise an error (as the Joiner does) or store the column names during fit_transform and apply them at the end of transform.

we should also check if the AggJoiner has the same issue

@jeromedockes
Copy link
Member

jeromedockes commented Jul 16, 2024

skrub._dataframe._pandas.aggregate seems to be the function that inserts the "index" column in what becomes AggTarget.y_, probably due to drop=False here

@TheooJ
Copy link
Contributor

TheooJ commented Jul 17, 2024

Thanks for pointing this out @Vincent-Maladiere. Just checked -- we have the same issue in the AggJoiner, as both are using _join_utils.left_join but dont forbid duplicate column names or store deduplicated names

>>> import pandas as pd
>>> from skrub import AggJoiner
>>> main = pd.DataFrame({
...     "airportId": [1, 2],
...     "airportName": ["Paris CDG", "NY JFK"],
... })
>>> aux = pd.DataFrame({
...     "flightId": range(1, 7),
...     "airportId": [1, 1, 1, 2, 2, 2],
...     "total_passengers": [90, 120, 100, 70, 80, 90],
... })
>>> agg_joiner_1 = AggJoiner(
...     aux_table=aux,
...     key="airportId",
...     cols=["total_passengers"],
...     operations=["mean"],
... )

>>> agg_joiner_2 = AggJoiner(
...     aux_table=aux,
...     key="airportId",
...     cols=["total_passengers"],
...     operations=["mean"],
... )

>>> out = agg_joiner_1.fit_transform(main)
>>> out = agg_joiner_2.fit_transform(out)
>>> out
   airportId  airportName  total_passengers_mean  total_passengers_mean__skrub_3c2fc647__  
0          1    Paris CDG             103.333333                               103.333333  
1          2       NY JFK              80.000000                                80.000000

>>> agg_joiner_2.transform(agg_joiner_1.transform(main))
   airportId  airportName  total_passengers_mean  total_passengers_mean__skrub_18f488d3__
0          1    Paris CDG             103.333333                               103.333333  
1          2       NY JFK              80.000000                                80.000000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants