Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLEP 8: Propagating feature names #18

Closed

Conversation

jorisvandenbossche
Copy link
Member

With much delay, I did a quick clean-up of the draft I wrote beginning of March at the end of the sprint. So here is an initial version of the SLEP on propagating feature names through pipelines.

The PR implementing it is scikit-learn/scikit-learn#13307

gets recursively called on each step of a ``Pipeline`` so that the feature
names get propagated throughout the full ``Pipeline``. This will allow to
inspect the input and output feature names in each step of a ``Pipeline``.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about having it propagated through the pipeline as the pipeline goes, so that in each step of the pipeline the model could potentially use those names. That's slightly different than recursively calling it to get the names once the pipeline has been fit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we should mention that but maybe you can provide a suggestion for motivation and implementation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about having it propagated through the pipeline as the pipeline goes, so that in each step of the pipeline the model could potentially use those names.

That's maybe partly related to what I mentioned below in one of the questions about standalone estimators (not in a pipeline). If we want those to behave similarly, the fit method of the estimator needs to do something (at least, with the current proposal, calling the "update feature names" method). But if we actually let fit handle the actual feature name logic (needed for the above suggestion), that directly solves the issue of standalone vs within-pipeline consistency.

potentially removing the need to have an explicit output feature names *getter
method*. The "update feature names" method would then mainly be used for
setting the input features and making sure they get propagated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for having them [almost] everywhere.

standing estimators and Pipelines. However, the clear downside of this
consistency is that this would add one line to each ``fit`` method throughout
scikit-learn.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's also the option of having a fit which does some common tasks such as setting the feature names, and letting the child classes only implement _fit. It kinda goes along the lines of what's being done in scikit-learn/scikit-learn#13603

@adrinjalali
Copy link
Member

There was also the concern that the user may want to disable this propagation. (I think this SLEP hasn't addressed that case yet).

@amueller
Copy link
Member

There was also the concern that the user may want to disable this propagation. (I think this SLEP hasn't addressed that case yet).

can you elaborate? I don't remember that part.

@adrinjalali
Copy link
Member

can you elaborate? I don't remember that part.

I think it was specifically in the context of NLP related usecases where the whole "dictionary" becomes the features and it may be very memory intensive to store them. IIRC @jnothman raised the concern.

@jnothman
Copy link
Member

jnothman commented Jun 3, 2019 via email

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the slep needs a list of use-cases, in particular comparing the pandas and non-pandas one and checking if there's other relevant cases.
Do we ever actually change feature names that have been set? Maybe to simplify them?

slep006.rst Outdated Show resolved Hide resolved
slep006.rst Outdated Show resolved Hide resolved
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The core idea of this proposal is that all transformers get a transformative
*"update feature names"* method that can determine the output feature names,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe say that the name is up for discussion?

The ``Pipeline`` and all meta-estimators implement this method by calling it
recursively on all its child steps / sub-estimators, and in this way the input
features names get propagated through the full pipeline. In addition, it sets
the ``input_feature_names_`` attribute on each step of the pipeline.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe explain why that is necessary?

features names get propagated through the full pipeline. In addition, it sets
the ``input_feature_names_`` attribute on each step of the pipeline.

A Pipeline calls the method at the end of ``fit()`` (using the DataFrame column
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @adrinjalali says, there is also the possibility to set them during fit.

slep006.rst Outdated Show resolved Hide resolved
- Transformers based on arbitrary functions


Should all estimators (so including regressors, classifiers, ...) have a "update feature names" method?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method is not properly motivated in the SLEP.
The use case is that X has no column names, right? We fitted a pipeline on a numpy array and we also have feature names and now we want to get the output features.

It might make sense to distinguish the cases where X contains the feature names in the column and where it doesn't because in the first case everything can be automatic.


For consistency, we could also add them to *all* estimators.

For a regressor or classifier, the method could set the ``input_feature_names_``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? you mean the output feature names, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regressors don't have output features?

But in general, I am not fully sure anymore what I was thinking here for this section. It all depends on where what we decide where the responsibility lies to set the attributes (does parent pipeline set the attributes and then does the "update feature names" method look first at the attribute, or does the parent pipeline pass the names to the "update feature names" method which then sets the attribute, or ...)

----------------------

This SLEP does not affect backward compatibility, as all described attributes
and methods would be new ones, not affecting existing ones.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well if we reuse get_feature_names then we add a new parameter in some cases but the old behavior still works.

slep006.rst Outdated Show resolved Hide resolved
with a set of custom input feature names that are not identical to the original
DataFrame column names, the stored column names to do validation and the stored
column names to propagate the feature names would get out of sync. Or should
calling ``get_feature_names`` also affect future validation in a ``predict()``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would vote for this option.

@amueller
Copy link
Member

There's no discussion of what the vectorizers do with their input feature names, if anything. Is that even allowed?

transformative "update feature names" method and calling it recursively in the
Pipeline setting the ``input_feature_names_`` attribute on each step.

1. Only implement the "update feature names" method and require the user to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is the correct distinction but the main point is to always just operate on output feature names, never on input feature names, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand this comment. What do you man with "operating on output feature names" ?

(this alternative of course depends on the idea of having such a "update feature names" method that does the work, but if we decide that actually it should happen in fit that would change things)

@amueller
Copy link
Member

amueller commented Jul 2, 2019

Coming back to this, I feel like I now favor a more invasive approach.
What I'm mostly thinking about right now is how feature names can enter an estimator. I feel if they enter any other way than fit, the user might call fit with a mismatching X and names and results will be inconsistent.
So there would be a benefit in providing feature names only via fit. If X is a dataframe, that's easy. If X is not a dataframe, I can see two options, both of which are very invasive:
a) add a feature_names_in parameter to fit (everywhere)
b) Create a ndarry subclass that has a feature_name attribute that stores the feature names.

a) requires a lot of code changes but is pretty nice otherwise, while b) requires no code changes to the fit methods, but creates a new custom class, which could be pretty confusing.

If we have always the feature names in fit, we can also do the transformation in fit, and so the user never has to call any method.

I feel that it would be good if we can eliminate having a public method. That means if you need to change your feature names, you have to refit again. An expert might use private methods to prevent this, but I think it's not that important a use-case.
I think it's more important to ensure that feature names and actual computation are in sync.

Therefore my preferred solution right now:

  1. Ensure feature_names_in is available during fit
  2. Set feature_names_out at the end of fit
  3. have the pipeline pass the out from the previous step to the in from the next step
  4. profit

The main question is then how to implement 1) in the case of numpy arrays, and I think the three options are setting it beforehand, passing it in as an argument, and passing it in as a custom class.

Given that 2) requires touching every fit (of a transformer) anyway, maybe passing it in as an argument is easiest? I'm unsure about adding a custom class, but I don't really like the "set it before fit" that is currently implemented because it's very implicit and side-effect-y.

@amueller
Copy link
Member

amueller commented Jul 2, 2019

The downside of passing the feature names into fit as separate argument is that it requires special-casing passthrough in the pipeline which is a mess :-/ but well..

@jnothman
Copy link
Member

jnothman commented Jul 2, 2019 via email

@amueller
Copy link
Member

amueller commented Jul 3, 2019

@jnothman I thought there was a concern that pandas might become a column store and wrapping and unwrapping become non-trivial operations? @jorisvandenbossche surely knows more.

That would be kind of "pandas in"-"pandas out" then, but only applied to X - which would allow a lot of things already.

Yes the passthrough is not actually a big deal, I was just annoyed I actually have to think about the code ;)

Not having sample-aligned fit parameters certainly breaks with tradition. If non-sklearn meta-estimators handle them as kwargs they should assume they are sample aligned, so this will break backward-compatibility.

Which might show that allowing kwargs wasn't / isnt a good way to do sample props?
In a way passing things through fit is basically "attaching properties to features/columns", only it looks to me as if the routing questions are much easier (or just different) than with sample props.

We could go even a step further and do feature_props={'names': feature_names} but for now that seems like unnecessary indirections.

I haven't finished writing the subclassing thing, but I think it's pretty trivial. We add a function make_named_array(X, feature_names) and that creates an object of type NamedNdarray that has an attribute feature_names and we expect the user to do that if they don't want to use pandas as input, and we wrap the output of transformers with that.
It's basically the same as pandas-in, pandas-out, only that we ensure there's really zero copy and it's future-proof as long as we rely on numpy.

BTW any of these designs get rid of the weird subestimator discovery I had to do, because now everything is done in fit, as it was meant to be ;)

@amueller
Copy link
Member

amueller commented Jul 3, 2019

Asked pandas: pandas-dev/pandas#27211

@amueller
Copy link
Member

amueller commented Jul 3, 2019

Also asked xarray:
pydata/xarray#3077

I think the answer from pandas is as I remembered which is they might change to using 1d slices, and while conversion to pandas might be possible, coming back is not possible without a copy.

It seems a bit unnatural to me if we'd produce DataArrays if someone passes in a DataFrame but if DataFrame is not an option then we should consider it.

@amueller
Copy link
Member

amueller commented Jul 3, 2019

@jorisvandenbossche also brought up the option of natively supporting pandas in some estimators and not requiring conversion at all. That might be possible for many of the preprocessing transformers, but not for all. It's certainly desirable to avoid copies, but I don't think it'll provide a full solution to the feature names issue.
Also, I assume it will require a lot of code for many of the estimators as pandas dataframes and numpy arrays likely require different codepaths even for simple things like scaling.

tldr; not casting pandas to numpy when we don't have to would be nice, but it probably won't solve the feature name issue.

@adrinjalali
Copy link
Member

If we go down the path of having a NamedNdarray, then why not keep the sample props also right attached to the array? If we do that, it starts getting closer to the xarray's implementation, so to me it seems like a better solution to just support/use the xarray.DataArray. (They even have a Dataset object BTW).

@jnothman do you see major drawbacks to using an xarray.DataArray like object to pass around sample props? I know it'd be tricky to handle the case where we want a scorer or an estimator to ignore a sample property but the prop is there, but that can be handled by a metaestimator removing those attributes before feeding the data to the estimator.

@amueller are you already working on a NamedNdarray like solution?

@amueller
Copy link
Member

amueller commented Jul 5, 2019

Stephan said in pydata/xarray#3077 that it's a core property of xarray to do zero copy for numeric dtypes, so I think it would be a feasible candidate.

@adrinjalali attaching sample props to X wouldn't solve the routing issues, right, as in where to use sample_weights etc? So there's definitely some benefit but the hard problems are not actually addressed :-/

I haven't started implementing a NamedNDarray solution and I think right now I would prefer xarray.DataArray. I'm still not convinced that fit arguments are bad, though they have backward compatibility issues as @jnothman pointed out.

Right now, ColumnTransformer works on xarray.DataArray if we call the column dimension columns. Is that something we would want to enforce? That would allow us to write similar code to support pandas and xarray, but it might be a bit odd from an xarray perspective? From an sklearn view it would probably make the most sense to call the two dimensions samples and features, but then we would need to do more duck-typing for xarray vs pandas.

slep006.rst Outdated Show resolved Hide resolved
@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Oct 27, 2019 via email

@shoyer
Copy link

shoyer commented Oct 27, 2019

Alternative: create a scikit-learn 1.0 beta with feature names using duck arrays relying on numpy 1.7 ;)
Late to the party (catching up): no problem on the numpy 1.7 requirement.

I think the requirement is actually NumPy 1.17 for __array_function__?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Oct 27, 2019 via email

@adrinjalali
Copy link
Member

I'm not sure how using __array_function__ would hamper people's access to the numerics @GaelVaroquaux . People can still rely on older sklearn if they really want to use old numpy. Besides, if we're talking about v1.0, we should be able to think about such changes.

@jorisvandenbossche
Copy link
Member Author

What's the status of this SLEP? Is it blocked by #22?

I don't think it is directly blocked by #22 (n_features_in_) (there are a few overlapping aspects, eg regarding naming, but I think that is not touching the essential discussion here).

I think the status is that we should update the SLEP with the alternative being discussed above (or, in case that is preferred, write an alternative SLEP for it, but personally I think it would be good to describe the possible options in this SLEP).

@adrinjalali
Copy link
Member

I think the alternative which we mostly agree with, is the one proposed in scikit-learn/scikit-learn#14315. We either need a new slep, or to update this one to reflect the the ideas there.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Oct 28, 2019 via email

@amueller
Copy link
Member

amueller commented Nov 6, 2019

@GaelVaroquaux My idea was to have a config option that explicitly enables the new behavior and that this config would fail with older numpy (i.e. the config option would have a soft dependency on numpy 1.17)
I would definitely not trigger an update.

If you have an idea how to implement something like names arrays without numpy 1.17 I think we're all ear. The alternative would be to implement feature names via a different mechanism.

@amueller
Copy link
Member

amueller commented Nov 6, 2019

@hermidalc I agree that more metadata would be nice, but actually we haven't even solved this problem with **fit_parms (see #16). Adding additional metadata later will probably be relatively straight-forward if/when we agree on a mechanism.
What do you do with the meta-data for PolynomialFeatures or PCA?

@amueller
Copy link
Member

amueller commented Nov 6, 2019

I would suggest we write a separate new slep on the NamedArray. @adrinjalali do you want to do that? You could reuse parts of this one if appropriate.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Nov 6, 2019 via email

@adrinjalali
Copy link
Member

To be clear, the current implementation of NamedArray only requires numpy 1.13, which is what we'll have once we bump our python support to 3.6 anyway. So that's not a big issue. However, if we want to include more features in the NamedArray as @thomasjpfan has also suggested, then we'll need 1.17.

@amueller yes I'll write up a new SLEP.

@lorentzenchr
Copy link
Member

Superseded by the acceptance of SLEP007 in #59. (Correct me if I'm wrong.)

@adrinjalali
Copy link
Member

This SLEP is about how to propagate feature names, whereas SLEP007 is about how to generate them. @thomasjpfan might have something which would supersede this.

@adrinjalali adrinjalali reopened this May 6, 2022
@thomasjpfan
Copy link
Member

I think SLEP007 covers both generation and propagation and supersedes this SLEP. In SLEP007's abstract it states:

We here discuss the generation of such attributes and their propagation through pipelines.

Also the feature name propagation described in SLEP007 is consistent with the implementation on main.

@adrinjalali
Copy link
Member

We still don't really have a way to propagate feature names during fit and transform, and SLEP007 does not talk about that.

@thomasjpfan
Copy link
Member

This SLEP does not propose propagating names during fit or transform. This SLEP's abstract states:

This SLEP proposes to add a transformative get_feature_names() method that
gets recursively called on each step of a Pipeline so that the feature
names get propagated throughout the full Pipeline. This will allow to
inspect the input and output feature names in each step of a Pipeline.

@adrinjalali
Copy link
Member

It's been discussed in the conversations, not addressed maybe, e.g.: #18 (comment)

@lorentzenchr
Copy link
Member

I thought I do some spring cleaning 🧹. From a quick read over the text, I did not notice a difference to the now (v1.1) implemented get_feature_names_out. If this SLEP is about making feature names available during fit time, that would be great, but should be better reflected in the actual text (not only in github comments).

Maybe, we can discuss our SLEP plans during the next dev meeting?

@ogrisel
Copy link
Member

ogrisel commented May 30, 2022

SLEP 18 with pandas out will be a partial solution to the problem of propagating features names in pipelines at fit time when all transformers output dense values that naturally fit in a pandas dataframe.

Out to propagate feature names for transformers that typically output sparse matrices (e.g. OneHotEncoder, KBinsDiscretizer, PolynomialFeatures, maybe SplineTransformer in the future...) was not fully resolved at the meeting and was left as a next step.

@adrinjalali
Copy link
Member

Happy to have SLEP18 instead of this one then.

@adrinjalali adrinjalali closed this Jun 1, 2022
@hermidalc
Copy link

hermidalc commented Jun 4, 2022

SLEP 18 with pandas out will be a partial solution to the problem of propagating features names in pipelines at fit time when all transformers output dense values that naturally fit in a pandas dataframe.

Apologies I've been out of the loop for a while regarding SLEP enhancements, I thought this was implemented the current release? So you cannot get feature names out of the end of a Pipeline and access them at each step in fit?

@adrinjalali
Copy link
Member

Right now you can only access the feature names outside the fit/transform methods after you fit a Pipeline. We're working towards making it available during fit as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.