Skip to content
This repository has been archived by the owner on Sep 10, 2020. It is now read-only.

Feature Selection #63

Open
redshiftzero opened this issue Oct 12, 2016 · 2 comments
Open

Feature Selection #63

redshiftzero opened this issue Oct 12, 2016 · 2 comments

Comments

@redshiftzero
Copy link
Contributor

Many of our features are not very useful. We should include a first step of feature selection before passing the features matrix to the classifier. This could be something simple, e.g. a variance threshold, or something more complex. See a reference here in scikit-learn for how we can do this (no wheel invention necessary).

@psivesely
Copy link
Contributor

The goal of this is to reduce computational time, not increase accuracy, correct? I would assume removing any features would have a negative effect on accuracy (although I assume for features with extremely low variance--to use one metric of usefulness--this would be negligible).

@redshiftzero
Copy link
Contributor Author

You're certainly right that it will make things faster, but feature selection is primarily for improving our classifier's results (on our metrics: AUC, etc.). If we add a lot of not-very-useful features, then we are adding a lot of noise which makes the learning problem significantly harder. It means we'll need more data to explore a much larger feature space and we also run the risk of the classifier picking up on noise and overfitting.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants