-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add column-wise transforms & refactor TableVectorizer #902
Add column-wise transforms & refactor TableVectorizer #902
Conversation
f8636a4
to
9f16bff
Compare
This reverts commit 87e5a3c.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Third pass, thank you @jeromedockes !
Co-authored-by: Théo Jolivet <57430673+TheooJ@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @TheooJ
I'll make a pass now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just ignore the things for the example style. I think that we should in another PR.
# | ||
# Let's first retrieve the dataset: | ||
# Let's first retrieve the dataset, using one of the downloaders from the :mod:`skrub.datasets` module. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you to make the example black
complient now (less than 88 characters) or make an automatic pass of the tool in another PR?
############################################################################### | ||
# A simple prediction pipeline | ||
# ---------------------------- | ||
# Easily encoding a dataframe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we change the example, I would probably use the # %%
delimiter nowadays.
############################################################################### | |
# A simple prediction pipeline | |
# ---------------------------- | |
# Easily encoding a dataframe | |
# %% | |
# Easily encoding a dataframe |
|
||
from skrub.datasets import fetch_employee_salaries | ||
|
||
dataset = fetch_employee_salaries() | ||
employees, salaries = dataset.X, dataset.y | ||
employees | ||
|
||
############################################################################### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
############################################################################### | |
# %% |
|
||
X = dataset.X | ||
y = dataset.y | ||
############################################################################### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
############################################################################### | |
# %% |
############################################################################### | ||
# We observe diverse columns in the dataset: | ||
# - binary (``'gender'``), | ||
# - numerical (``'employee_annual_salary'``), | ||
# - categorical (``'department'``, ``'department_name'``, ``'assignment_category'``), | ||
# - datetime (``'date_first_hired'``) | ||
# - dirty categorical (``'employee_position_title'``, ``'division'``). | ||
# | ||
# Using skrub's |TableVectorizer|, we can now already build a machine-learning | ||
# pipeline and train it: | ||
# From our 8 columns, the |TableVectorizer| has extracted 143 numerical | ||
# features. Most of them are one-hot encoded representations of the categorical | ||
# features. For example, we can see that 3 columns ``'gender_F'``, ``'gender_M'``, | ||
# ``'gender_nan'`` were created to encode the ``'gender'`` column. | ||
|
||
############################################################################### | ||
# By performing appropriate transformations on our complex data, the |TableVectorizer| produced numeric features that we can use for machine-learning: | ||
|
||
from sklearn.ensemble import HistGradientBoostingRegressor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
############################################################################### | |
# We observe diverse columns in the dataset: | |
# - binary (``'gender'``), | |
# - numerical (``'employee_annual_salary'``), | |
# - categorical (``'department'``, ``'department_name'``, ``'assignment_category'``), | |
# - datetime (``'date_first_hired'``) | |
# - dirty categorical (``'employee_position_title'``, ``'division'``). | |
# | |
# Using skrub's |TableVectorizer|, we can now already build a machine-learning | |
# pipeline and train it: | |
# From our 8 columns, the |TableVectorizer| has extracted 143 numerical | |
# features. Most of them are one-hot encoded representations of the categorical | |
# features. For example, we can see that 3 columns ``'gender_F'``, ``'gender_M'``, | |
# ``'gender_nan'`` were created to encode the ``'gender'`` column. | |
############################################################################### | |
# By performing appropriate transformations on our complex data, the |TableVectorizer| produced numeric features that we can use for machine-learning: | |
from sklearn.ensemble import HistGradientBoostingRegressor | |
# %% | |
# From our 8 columns, the |TableVectorizer| has extracted 143 numerical | |
# features. Most of them are one-hot encoded representations of the categorical | |
# features. For example, we can see that 3 columns ``'gender_F'``, ``'gender_M'``, | |
# ``'gender_nan'`` were created to encode the ``'gender'`` column. | |
# | |
# By performing appropriate transformations on our complex data, the |TableVectorizer| produced numeric features that we can use for machine-learning: | |
from sklearn.ensemble import HistGradientBoostingRegressor |
############################################################################### | ||
# The simple pipeline applied on this complex dataset gave us very good results. | ||
# We can see that this new pipeline achieves a similar score but is fitted much faster. | ||
# This is mostly due to replacing |GapEncoder| with |MinHashEncoder| (however this makes the features less interpretable). | ||
|
||
############################################################################### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
############################################################################### | |
# The simple pipeline applied on this complex dataset gave us very good results. | |
# We can see that this new pipeline achieves a similar score but is fitted much faster. | |
# This is mostly due to replacing |GapEncoder| with |MinHashEncoder| (however this makes the features less interpretable). | |
############################################################################### | |
# %% | |
# We can see that this new pipeline achieves a similar score but is fitted much faster. | |
# This is mostly due to replacing |GapEncoder| with |MinHashEncoder| (however this makes the features less interpretable). | |
# |
pipeline = make_pipeline(TableVectorizer(), regressor) | ||
pipeline.fit(X, y) | ||
pipeline = make_pipeline(vectorizer, regressor) | ||
pipeline.fit(employees, salaries) | ||
|
||
############################################################################### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
############################################################################### | |
# %% |
remainder="drop", | ||
) | ||
|
||
X_enc = encoder.fit_transform(X) | ||
pprint(encoder.get_feature_names_out()) | ||
# pprint(encoder.get_feature_names_out()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remove it then.
@@ -85,6 +85,7 @@ def cols(*columns): | |||
>>> s.all() & ['kind', 'ID'] | |||
(all() & cols('kind', 'ID')) | |||
|
|||
# noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for noqa
?
Here we can see the input to ``transform`` has been converted back to the | ||
timezone used during ``fit`` and that we get the same result for "hour". | ||
|
||
# noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK so this is to avoid the check on the docstring. I assume that we can clean it afterwords
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is actually looking good.
Oops so sorry @glemaitre I should have said so but I think @GaelVaroquaux was planning to review it as well ... @GaelVaroquaux , if you would like LMK if you want to review maybe the easiest way will be to revert the merge commit and open a new PR to un-revert |
Thanks a lot for the review @glemaitre ! |
No, no, it's good to have merged. I can give feedback via issues. Hurray for merge. Thanks a lot to everyone involved!! |
Thanks @GaelVaroquaux. We will address the subsequent issues. Let's roll ;) |
ok, thanks. there will be a few follow-up PRs in any case, @TheooJ and I are going to open a couple of issues |
closes #874, #886, #894, #877, #848, #904, #905, #830, #626, #870
This is the last part of the changes outlined in #877 (the first two parts have been merged in #895 and #888)
The main addition is
OnEachColumn
, a transformer that applies a transformation independently to each column in a dataframe, and is used to refactor the TableVectorizer and ensure it does consistent operations across calls totransform
.