You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 10, 2020. It is now read-only.
We have a very imbalanced machine learning problem, where we have far fewer SecureDrop users than non-SecureDrop users. There are many ways of handling this situation - including oversampling the minority class or undersampling the majority class. Some of the techniques used for machine learning with very skewed classes are implemented in this library: https://github.com/scikit-learn-contrib/imbalanced-learn, so we could give some of these a try.
The text was updated successfully, but these errors were encountered:
@redshiftzero and I discussed this in person for a minute and whether we should increase the monitored_nonmonitored_ratio in fpsd/config.ini. We decided to leave it for now, but in the future if we realize we want more SD data it might be better to bump that from 10 to 100, which would give us roughly a 50:50 class split in terms of frontpage_traces. That's not to say there isn't good stuff in the library linked and we shouldn't see what we can get out of some of the functionality there. The conclusion was that getting more raw data will give more accurate results than oversampling from the same data-set where you are essentially replicating traces. Let me know if I missed anything here @redshiftzero.
Matthews correlation coefficient (sklearn.metrics.matthews_corrcoef) "is used in machine learning as a measure of the quality of binary (two-class) classifications... generally regarded as a balanced measure which can be used even if the classes are of very different sizes."
We have a very imbalanced machine learning problem, where we have far fewer SecureDrop users than non-SecureDrop users. There are many ways of handling this situation - including oversampling the minority class or undersampling the majority class. Some of the techniques used for machine learning with very skewed classes are implemented in this library: https://github.com/scikit-learn-contrib/imbalanced-learn, so we could give some of these a try.
The text was updated successfully, but these errors were encountered: