Example 08_join_aggregation broken #1005

Vincent-Maladiere · 2024-07-16T16:17:22Z

Describe the issue linked to the documentation

On the stable and dev version of the doc, in the "Hyper-parameters tuning and cross validation" section of the 08_join_aggregation example, the GridSearchCV outputs scores are all nan.

This is due to the estimator failing during the CV.

ValueError: The feature names should match those that were passed during fit.
Feature names unseen at fit time:
- index__skrub_f3c63e8c__
Feature names seen at fit time, yet now missing:
- index__skrub_634644a2__

It appears that a column name change during one of the preprocessing steps of the pipeline between train and predict breaks the _check_feature_names .

Suggest a potential alternative/fix

I need more time to understand which preprocessor led to this column name change.

The text was updated successfully, but these errors were encountered:

jeromedockes · 2024-07-16T16:45:13Z

Here is a minimal reproducer:

>>> import pandas as pd

>>> from skrub import AggTarget

>>> df = pd.DataFrame(dict(a=[1, 1, 2, 2], b=[10, 20, 30, 40]))
>>> y = pd.Series([1, 2, 3, 4], name="b")

>>> transformer_1 = AggTarget(main_key='a', operation='mean')
>>> transformer_2 = AggTarget(main_key='a', operation='mean')
>>> out = transformer_1.fit_transform(df, y)
>>> out = transformer_2.fit_transform(out, y)
>>> out
   a   b  b_mean_target  b_mean_target__skrub_38870b21__
0  1  10            1.5                              1.5
1  1  20            1.5                              1.5
2  2  30            3.5                              3.5
3  2  40            3.5                              3.5

Note the name of the last column changes

>>> transformer_2.transform(transformer_1.transform(df))
   a   b  b_mean_target  b_mean_target__skrub_6f32724b__
0  1  10            1.5                              1.5
1  1  20            1.5                              1.5
2  2  30            3.5                              3.5
3  2  40            3.5                              3.5


>>> transformer_2.transform(transformer_1.transform(df))
   a   b  b_mean_target  b_mean_target__skrub_585a8372__
0  1  10            1.5                              1.5
1  1  20            1.5                              1.5
2  2  30            3.5                              3.5
3  2  40            3.5                              3.5

jeromedockes · 2024-07-16T16:51:07Z

This is due to AggTarget calling left_join in transform with duplicate column names in the dataframe it joins:

skrub/skrub/_agg_joiner.py

Line 442 in 7341c66

return _join_utils.left_join(

left_join is a stateless low-level function and it appends a random string to duplicated column names to avoid errors. To avoid the name changing at each transform, AggTarget should either forbid duplicate column names and raise an error (as the Joiner does) or store the column names during fit_transform and apply them at the end of transform.

we should also check if the AggJoiner has the same issue

jeromedockes · 2024-07-16T17:06:20Z

skrub._dataframe._pandas.aggregate seems to be the function that inserts the "index" column in what becomes AggTarget.y_, probably due to drop=False here

TheooJ · 2024-07-17T11:14:50Z

Thanks for pointing this out @Vincent-Maladiere. Just checked -- we have the same issue in the AggJoiner, as both are using _join_utils.left_join but dont forbid duplicate column names or store deduplicated names

>>> import pandas as pd
>>> from skrub import AggJoiner
>>> main = pd.DataFrame({
...     "airportId": [1, 2],
...     "airportName": ["Paris CDG", "NY JFK"],
... })
>>> aux = pd.DataFrame({
...     "flightId": range(1, 7),
...     "airportId": [1, 1, 1, 2, 2, 2],
...     "total_passengers": [90, 120, 100, 70, 80, 90],
... })
>>> agg_joiner_1 = AggJoiner(
...     aux_table=aux,
...     key="airportId",
...     cols=["total_passengers"],
...     operations=["mean"],
... )

>>> agg_joiner_2 = AggJoiner(
...     aux_table=aux,
...     key="airportId",
...     cols=["total_passengers"],
...     operations=["mean"],
... )

>>> out = agg_joiner_1.fit_transform(main)
>>> out = agg_joiner_2.fit_transform(out)
>>> out
   airportId  airportName  total_passengers_mean  total_passengers_mean__skrub_3c2fc647__  
0          1    Paris CDG             103.333333                               103.333333  
1          2       NY JFK              80.000000                                80.000000

>>> agg_joiner_2.transform(agg_joiner_1.transform(main))
   airportId  airportName  total_passengers_mean  total_passengers_mean__skrub_18f488d3__
0          1    Paris CDG             103.333333                               103.333333  
1          2       NY JFK              80.000000                                80.000000

Vincent-Maladiere added documentation Add or improve the documentation no changelog needed labels Jul 16, 2024

jeromedockes added bug Something isn't working and removed documentation Add or improve the documentation no changelog needed labels Jul 16, 2024

jeromedockes mentioned this issue Jul 24, 2024

ensure aggjoiner & aggtarget have consistent output names #1013

Merged

TheooJ closed this as completed in #1013 Jul 25, 2024

jeromedockes mentioned this issue Jul 29, 2024

aggtarget with small dataframe #1018

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example 08_join_aggregation broken #1005

Example 08_join_aggregation broken #1005

Vincent-Maladiere commented Jul 16, 2024

jeromedockes commented Jul 16, 2024

jeromedockes commented Jul 16, 2024 •

edited

Loading

jeromedockes commented Jul 16, 2024 •

edited

Loading

TheooJ commented Jul 17, 2024 •

edited

Loading

Example 08_join_aggregation broken #1005

Example 08_join_aggregation broken #1005

Comments

Vincent-Maladiere commented Jul 16, 2024

Describe the issue linked to the documentation

Suggest a potential alternative/fix

jeromedockes commented Jul 16, 2024

jeromedockes commented Jul 16, 2024 • edited Loading

jeromedockes commented Jul 16, 2024 • edited Loading

TheooJ commented Jul 17, 2024 • edited Loading

jeromedockes commented Jul 16, 2024 •

edited

Loading

jeromedockes commented Jul 16, 2024 •

edited

Loading

TheooJ commented Jul 17, 2024 •

edited

Loading