scikit-learn · adrinjalali · Nov 7, 2019 · Sep 23, 2019 · Sep 23, 2019 · Sep 23, 2019
diff --git a/slep010/proposal.rst b/slep010/proposal.rst
@@ -0,0 +1,112 @@
+.. _slep_010:
+
+=====================================
+SLEP010: ``n_features_in_`` attribute
+=====================================
+
+:Author: Nicolas Hug
+:Status: Under review
+:Type: Standards Track
+:Created: 2019-11-23
+
+Abstract
+########
+
+This SLEP proposes the introduction of a public ``n_features_in_`` attribute
+for most estimators (where relevant). This attribute is automatically set
+when calling a new method ``BaseEstimator._validate_data(X, y=None)`` which
+is meant to replace ``check_array`` and ``check_X_y`` in most cases, calling
+those under the hood.
+
+Motivation
+##########
+
+Knowing the number of features that an estimator expects is useful for
+inspection purposes, as well as for input validation.
+
+Solution
+########
+
+The proposed solution is to replace most calls to ``check_array()`` or
+``check_X_y()`` by calls to a newly created private method::
+
+    def _validate_data(self, X, y=None, reset=True, **check_array_params)
+        ...
+
+The ``_validate_data()`` method will call ``check_array()`` or
+``check_X_y()`` function depending on the ``y`` parameter.
+
+If the ``reset`` parameter is True (default), the method will set the
+``n_feature_in_`` attribute of the estimator, regardless of its potential
+previous value. This should typically be used in ``fit()``, or in the first
+``partial_fit()`` call. Passing ``reset=False`` will not set the attribute but
+instead check against it, and potentially raise an error. This should typically
+be used in ``predict()`` or ``transform()``, or on subsequent calls to
+``partial_fit``.
+
+In most cases, the ``n_features_in_`` attribute exists only once ``fit`` has
+been called, but there are exceptions (see below).
+
+A new common check is added: it makes sure that for most estimators, the
+``n_features_in_`` attribute does not exist until ``fit`` is called, and
+that its value is correct. Instead of raising an exception, this check will
+raise a warning for the next two releases. This will give downstream
+packages some time to adjust (see considerations below).
+
+The logic that is proposed here (calling a stateful method instead of a
+stateless function) is a pre-requisite to fixing the dataframe column
+ordering issue: with a stateless ``check_array``, there is no way to raise
+an error if the column ordering of a dataframe was changed between ``fit``
+and ``predict``.
+
+Considerations
+##############
+
+The main consideration is that the addition of the common test means that
+existing estimators in downstream libraries will not pass our test suite,
+unless the estimators also have the `n_features_in_` attribute (which can be
+done by updating calls to ``check_XXX()`` into calls to ``_validate_data()``).
+
+The newly introduced checks will only raise a warning instead of an exception
+for the next 2 releases, so this will give more time for downstream packages
+to adjust.
+
+Note that we have never guaranteed any kind of backward compatibility
+regarding the test suite: see e.g. `#12328
+<https://github.com/scikit-learn/scikit-learn/pull/12328>`_, `14680
+<https://github.com/scikit-learn/scikit-learn/pull/14680>`_, or `9270
+<https://github.com/scikit-learn/scikit-learn/pull/9270>`_ which all add new
+checks.
+
+There are other minor considerations:
+
+- In most meta-estimators, the input validation is handled by the
+  sub-estimator(s). The ``n_features_in_`` attribute of the meta-estimator
+  is thus explicitly set to that of the sub-estimator, either via a
+  ``@property``, or directly in ``fit()``.
+- Some estimators like the dummy estimators do not validate the input
+  (the 'no_validation' tag should be True). The ``n_features_in_`` attribute
+  should be set to None, though this is not enforced in the common tests.
+- Some estimators expect a non-rectangular input: the vectorizers. These
+  estimators expect dicts or lists, not a ``n_samples * n_features`` matrix.
+  ``n_features_in_`` makes no sense here and these estimators just don't have
+  the attribute.
+- Some estimators may know the number of input features before ``fit`` is
+  called: typically the ``SparseCoder``, where ``n_feature_in_`` is known at
+  ``__init__`` from the ``dictionary`` parameter. In this case the attribute
+  is a property and is available right after object instantiation.
+
+References and Footnotes
+------------------------
+
+.. [1] Each SLEP must either be explicitly labeled as placed in the public
+   domain (see this SLEP as an example) or licensed under the `Open
+   Publication License`_.
+
+.. _Open Publication License: https://www.opencontent.org/openpub/
+
+
+Copyright
+---------
+
+This document has been placed in the public domain. [1]_
diff --git a/under_review.rst b/under_review.rst
@@ -1,4 +1,7 @@
 SLEPs under review
 ==================
 
-Nothing here
+.. toctree::
+    :maxdepth: 1
+
+    slep010/proposal