Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best options choice for classification of small and unbalanced dataset #46

Open
mattvan83 opened this issue Oct 28, 2019 · 5 comments
Open

Comments

@mattvan83
Copy link

Hi Pradeep,

For small and unbalanced dataset, do you recommend to use -t 0.8 or -t 0.9 ?

Isn't possible to deactivate in the implemented pipeline the feature selection? If not, what is the advantage of always using feature selection when dealing with a small features' dataset?

Best,
Matthieu

@raamana
Copy link
Owner

raamana commented Oct 28, 2019

-k all is equivalent to no feature selection.

there is no way to tell which percentage of training (80% or 90%) is best! Depending on the sample size, you want to ensure there is enough training (helps improve performance), while also ensuring reasonable test set sizes.. If the test set size is too small, violin plots will have large variance. So pick accordingly.

@mattvan83
Copy link
Author

Hi Pradeep,

I tried both of them and indeed violin plots have a large variance compared to the violin plots you show in the neuropredict documentation (I have 75 CN and 15 AD).

Below with -t 0.9:
balanced_accuracy.pdf

and below with -t 0.8:
balanced_accuracy.pdf

  1. Based on these violin plots, the 80% training isn't it better (less variation)?
  2. How could I determine the best set of features ? Just comparing median of the 3 violin plots of my above figures? Or are there other metrics to look at?
  3. Where could I find mean balanced accuracy, sensitivity and specificity?
  4. In these binary classification cases, aren't there ROC curves plotted?

@raamana
Copy link
Owner

raamana commented Oct 28, 2019

  1. no clear answers there - I'd report both (one in main, and other in supplementary?)

  2. you can run siginificance tests on the data saved in CSV files - look in the exported_results folder

  3. they are not exported by default - will add them to exported results soon.

  4. Not all predictive models have a natural ROC associated with them, hence it's not produced by default. I'll implement it soon. Current results should be enough to include in your paper?

@mattvan83
Copy link
Author

  1. Don't we need to privilege violin plot with less accuracy, so 80% one?
  2. What kind of significance tests and which .csv files could I run these one?
  3. OK, thanks. Is there any way I can for the moment deduce sensitivity and specificity based on actual produced results?
  4. Yes.

@raamana
Copy link
Owner

raamana commented Oct 29, 2019

Use the confusion matrices and the misclassification rate plots to deduce alternative performance metrics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants