Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] pca: n_features_ attribute of decomposition.PCA is deprecated in favor of n_features_in_ #6249

Merged
merged 1 commit into from
Apr 21, 2023

Conversation

JakaKokosar
Copy link
Member

@JakaKokosar JakaKokosar commented Dec 9, 2022

Issue

Fixes #6248

Description of changes
Includes
  • Code changes
  • Tests
  • Documentation

@@ -201,7 +201,7 @@ def _fit_truncated(self, X, n_components, svd_solver):
random_state=random_state,
)

self.n_samples_, self.n_features_ = n_samples, n_features
self.n_samples_ = n_samples
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this works because:

Most of the code in ImprovedPCA is copied directly from sklearn.decomposition.PCA. This was done because it wasn't possible to add the randomized algorithm that enables PCA on sparse data without modifying these functions.
Until recently, sklearn's PCA used self.n_features_. However, they recently switched over to using self.n_features_in, which is more generally used in sklearn, and is part of their BaseEstimator class. The self.n_features_in attribute is set in BaseEstimator._check_n_features, which is, in turn, called by BaseEstimator._validate_data. In scikit-learn's PCA, this method is called in PCA._fit, which ensures that the self.n_features_in attribute is set on sklearn's implementation.

They also deprecated self.n_features and replaced it with a readonly property which just returns self.n_featuers_in anyways. This is still the same functionality as before, but you have to trace it through these different methods. So, this is still backwards compatible.

So, we can do the same thing in our ImprovedPCA class. We can get rid of self.n_features_ = features, since this will be set when we call self._validate_data, which is inherited from sklearn.decomposition.PCA <- sklearn.BaseEstimator. So, the other change we need is to replace our previous call to check_array to self._validate_data. And I believe this should ensure the exact same functionality as before.

Copy link
Collaborator

@pavlin-policar pavlin-policar Dec 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transform method should also probably be updated to reflect how sklearn is doing things now.

So, this would mean changing

        X = check_array(
            X,
            accept_sparse=["csr", "csc"],
            dtype=[np.float64, np.float32],
            ensure_2d=True,
            copy=self.copy,
        )

on lines 224-230 to

X = self._validate_data(
    X,
    dtype=[np.float64, np.float32],
    reset=False,
    accept_sparse=["csr", "csc"],
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pavlin-policar, thanks for the explanation.

@JakaKokosar JakaKokosar force-pushed the scikit-learn-api-change branch 2 times, most recently from 09939d3 to faba3cd Compare December 9, 2022 15:12
@JakaKokosar JakaKokosar marked this pull request as ready for review December 9, 2022 15:12
@markotoplak
Copy link
Member

Before merging this please ensure that new code also properly runs with the oldest supported scikit-learn. The -oldest tests do not fail, so it likely works, but double-check, please.

For now, I would avoid raising the dependency to 1.2. I am OK with raising it to 1.1 if needed.

@markotoplak
Copy link
Member

I couldn't wait. :) Please, add a commit that reverts #6255 when merging.

@JakaKokosar JakaKokosar marked this pull request as draft December 9, 2022 22:13
@JakaKokosar JakaKokosar marked this pull request as ready for review March 17, 2023 10:04
@markotoplak markotoplak added this to the 3.35.0 milestone Mar 31, 2023
@markotoplak
Copy link
Member

/rebase

n_features_ attribute of decomposition.PCA is deprecated in favor of n_features_in_
@pavlin-policar
Copy link
Collaborator

The PCA changes in this PR look good to me, but tests related to AdaBoost are failing somewhat consistently. Could this be due to a change in the newer version of scikit-learn? If so, we probably want to include that here as well.

@markotoplak
Copy link
Member

The remaining tests are failing because of scikit-learn/scikit-learn/issues/26241. I also intend to do a PR that fixes the issue.

I'll remove the last commit (allowing scikit-learn 1.2) and merge this as is and then we can fix adaboost separately.

@codecov
Copy link

codecov bot commented Apr 21, 2023

Codecov Report

Merging #6249 (a823c86) into master (d248136) will increase coverage by 0.03%.
The diff coverage is 94.91%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6249      +/-   ##
==========================================
+ Coverage   87.61%   87.65%   +0.03%     
==========================================
  Files         321      321              
  Lines       69024    69148     +124     
==========================================
+ Hits        60475    60610     +135     
+ Misses       8549     8538      -11     

@markotoplak markotoplak merged commit a45fce1 into biolab:master Apr 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API change in scikit-learn latest release
3 participants