Use grid_search in notebook and add visualization #18

dhimmel · 2016-07-28T00:44:12Z

Addresses issues with example notebook brought up at July 26 meetup:

Standardize training and testing separately
Use AUROC on continuous rather than binary predictions

Clean up variable names. Simplify to to testing/training terminology. No more "hold out".

Use sklearn.grid_search.GridSearchCV to optimize hyperparameters. Expand range of l1_ratio and alpha. Specify random_state in GridSearchCV, which should prevent having to set the seed manually using the random module. Grid search should enable a more modular architecture enabling swapping in different algorithms as long as their param_grid is defined.

Add exploratory analysis of predictions.

Add parallel processing using joblib to speed up cross validation.

Remove median absolute deviation feature selection. This step had to be removed or modified because it used testing data for feature selection.

cgreene · 2016-07-28T00:46:29Z

scripts/1.TCGA-MLexample.py

-url = 'https://ndownloader.figshare.com/files/5514386'
-if not os.path.exists('data/expression.tsv.bz2'):
-    urllib.request.urlretrieve(url, os.path.join('data', 'expression.tsv.bz2'))
+get_ipython().run_cell_magic('time', '', "path = os.path.join('data', 'expression.tsv.bz2')\nX = pd.read_table(path, index_col=0)")


What's up with these run_cell_magic things?

See Cell 7. I used the time magic because I thought it would be helpful to report runtime. The downside is that it doesn't convert nicely to scripts.

FYI, here's the code of cell 7 that created the line you commented on:

%%time path = os.path.join('data', 'expression.tsv.bz2') X = pd.read_table(path, index_col=0)

dhimmel · 2016-07-28T00:48:21Z

Suggesting @RenasonceGent as one reviewer. @RenasonceGent you should see if you understand the code as the changes address some of the modularity goals that we have in mind (#12 (comment)).

Addresses issues with example notebook brought up at July 26 meetup: 1. Standardize training and testing separately 2. Use AUROC on continuous rather than binary predictions Clean up variable names. Simplify to to testing/training terminology. No more "hold out". Use sklearn.grid_search.GridSearchCV to optimize hyperparameters. Expand range of l1_ratio and alpha. Specify random_state in GridSearchCV, which should prevent having to set the seed manually using the random module. Grid search should enable a more modular architecture enabling swapping in different algorithms as long as their `param_grid` is defined. Add exploratory analysis of predictions. Add parallel processing using joblib to speed up cross validation. Remove median absolute deviation feature selection. This step had to be removed or modified because it used testing data for feature selection.

cgreene · 2016-07-28T10:00:57Z

"Remove median absolute deviation feature selection. This step had to be removed or modified because it used testing data for feature selection."

Was the concern around this overfitting or something else? It should not be an issue for overfitting since it is independent of whatever the end goal of prediction is.

dhimmel · 2016-07-28T14:08:27Z

@cgreene, I agree that overfitting is only an issue when y_test is included in any step other than testing. Since the MAD feature selection @gwaygenomics implemented uses only X (X_train + X_test), there shouldn't be overfitting.

However, @gwaygenomics mentioned that at the meetup someone suggested not using X_test before testing. This would more realistically estimate performance on new samples. I interpreted this to mean that all feature selection should only use X_train and feature transformation should be independently performed for X_test. I'm not sure this is the best approach, any thoughts?

htcai · 2016-07-29T18:04:03Z

@dhimmel maybe we can compare the test results of the two different approaches, using X_test before testing vs. not using, and see whether the former is likely to lead to over-fitting.

gwaybio · 2016-07-29T18:59:00Z

@dhimmel - it was @yl565 who suggested not z-scoring the holdout set with the training/testing set. Which is the right thing to do and something I saw you amended in this PR.

Simplify to to testing/training terminology. No more "hold out".

I like using hold_out terminology here (and in general) especially when cross validation is being used to select parameters. The CV folds are training and testing sets and are built in to GridSearchCV.

Remove median absolute deviation feature selection. This step had to be removed or modified because it used testing data for feature selection.

I am ok with removing this step here but it's important to keep in mind that pancan classifier performance does not scale linearly with including increasing amounts of genes. Performance generally plateaus surprisingly early in number of MAD genes included and makes the algorithms speedy. Here is a plot describing this phenomenon with RAS predictions.

On another note - I really like the plots and analysis you added to the notebook! I particularly like the 'potential hidden responders' table. What's really nice about these samples is that they are really well characterized and several resources exist to visualize what's going on in them. For example TCGA-E2-A1LI-01 is a breast tumor and can be visualized really nicely in the COSMIC Browser. I was looking for other potential reasons why this sample may "look like" a TP53 mutant based on gene expression signatures (like copy number loss, structural variants, etc.) but it does not look like there is anything obvious popping out to me. Cool results!

dhimmel · 2016-07-29T19:11:59Z

Terminology

I like using hold_out terminology here (and in general) especially when cross validation is being used to select parameters. The CV folds are training and testing sets and are built in to GridSearchCV.

@gwaygenomics, I agree that the terminology for these concepts is muddled. And hold_out has an intuitive meaning. However, the best online advice I could find on the matter (1 and 2) seem to support to following terminology:

training: comparable to the training fold in cross-validation
validation: comparable to the evaluation fold in cross-validation
testing: comparable to the hold out set.

Since a good implementation of cross-validation makes it so you never actually have to touch the validation data (2 above), I thought it would be simplest to just use training/testing terminology. Let me know if you still disagree -- I'd like to find out the optimal terminology.

gwaybio · 2016-07-29T19:35:51Z

@dhimmel the terminology is often confusing and I definitely agree that we need to keep it consistent!

One key difference in biological sciences as compared to AI (as in those links you sent over) a validation set usually means something slightly different. Using this classifier as an example - we perform the following scenario:

"hold out" - this is split out from the original data at the very beginning and not touched until hyperparamaters are selected to make a good estimate of how the classifier will perform with data it has never seen before.
1. A contentious point that is also domain specific - do we combine the hold_out set with the cross_validation/ training/testing set to build the final classifier? In most cases when we use a gene mutation status as our Y matrix, we will not have the luxury of defining a hold_out set because of very low positive samples. I've adopted a different strategy here that I've been calling bootstrap holdout but it can get pretty computationally intensive and then isn't a true holdout.
"cross validation" - usually k-fold cross validation where in each iteration there are k-1 folds used as training and the k-i fold is used as testing (i'm not sure what you mean by "a good implementation of cross-validation makes it so that you never actually have to touch the validation data (2 above)"?)
"validation" - apply the optimal classifier on a completely different cancer dataset. In cognoma's case, a user may want to apply a cognoma derived classifier to their own data to "validate".

dhimmel · 2016-08-01T16:15:30Z

So going through the discussion so far, here the following unresolved issues:

MAD feature selection, which now has a dedicated issue Median absolute deviation feature selection #22
feature transformation/selection on X versus X_train, which now has a dedicated issue Should testing data be used for unsupervised feature tranformation or selection #23
terminology for dataset partitions.

Items 1 & 2 are not intrinsic to this pull request -- we can address them in future pull requests.

For 3, we perhaps should settle on a terminology for the example notebook. I used:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)

I believe @gwaygenomics advocates for (Y/N?):

X_train, X_holdout, y_train, y_holdout = train_test_split(X, y, test_size=0.1, random_state=0)

@cgreene, what's your opinion?

yl565 · 2016-08-01T16:27:55Z

About using X_test to improve classifier, sometimes it has been done in the field of EEG classification using adaptive/semi-supervised algorithm to improve classifier especially when training sample size is small, e.g. combining COV(X_test) and COV(X_train) in real-time to update a LDA whenever a new instance of x_test is available. Here are some examples:

I think using X instead of X_train for feature transformation/selection can be considered as a special case of semi-supervised learning.

Note by @dhimmel: I modified this comment to use DOI links for persistence and unique identification.

gwaybio · 2016-08-01T17:53:02Z

I believe @gwaygenomics advocates for (Y/N?):

Yep! That is how I think about Train/Test/Holdout for cross validation and hyperparameter selection.

If we decide a more simple structure of Train/Test is preferred, we should use evaluation set before validation set in cross validation. Using validation here is not how biologists would view "validation"

RenasonceGent · 2016-08-02T02:41:14Z

@dhimmel Everything looks good and seems to run fine. Just to check, how many plots should I see on the output? It took a few days to get all of the packages working on my system, and this is my first time using IPython.

Also, I'm still not sure I understand what you want. I've been writing a class with each section of the process broken up as functions. I thought it was mentioned before that the idea is to receive a JSON from one of the other groups that will tell us what to do. I'm leaving the parsing for later, but it is set up with that expectation.

Do we have details on what we will receive from the data group? Right now I'm only incorporating grid search. I assume they will tell us the location of the data, which classifier algorithm to use, the parameters that go with that classifier, which metrics to use, and what type of plots to do.

dhimmel · 2016-08-02T17:55:07Z

@RenasonceGent glad you got things running. You should see four plots, like you see here. Regarding installation, #15 should make it considerably easier.

Right now I'm only incorporating grid search.

I think that's a safe assumption. Grid search is really versatile, even with just one hyperparameter combination, is an easy way to perform cross validation.

Do we have details on what we will receive from the data group?

Let's assume right now that we get a list of sample_ids and gene_ids for subsetting the feature matrix, as well as a mutation status vector for the outcome. We will also get an algorithm. For the algorithm we should have a default hyperparameter grid, since user's aren't going to want to deal with this. The goal of the next week should be to have individuals pick an algorithm to identify whether it's appropriate for our application and identify a default hyperparameter grid.

I've been writing a class with each section of the process broken up as functions.

This sounds useful. Make sure not to duplicate functionality with the sklearn API. For the task for the next week, I think people can just modify 1.TCGA-MLexample.ipynb to swap in their algorithm. What do you think?

dhimmel · 2016-08-02T17:57:28Z

I'm merging despite there being some unresolved issues. Let's move discussion elsewhere and submit additional pull requests to modify 1.TCGA-MLexample.ipynb in an incremental fashion. We need to get these updates in for next week's activities.

cgreene reviewed Jul 28, 2016
View reviewed changes

dhimmel force-pushed the example-notebook branch from 9557e47 to 84a3271 Compare July 28, 2016 00:56

dhimmel mentioned this pull request Jul 28, 2016

Preventing overfitting when evaluating many hyperparameters #20

Open

This was referenced Aug 1, 2016

Median absolute deviation feature selection #22

Open

Should testing data be used for unsupervised feature tranformation or selection #23

Closed

dhimmel merged commit ae27311 into cognoma:master Aug 2, 2016

dhimmel mentioned this pull request Aug 7, 2016

Claim an sklearn algorithm to implement and troubleshoot #27

Open

dhimmel deleted the example-notebook branch August 8, 2016 14:55

gwaybio mentioned this pull request Nov 28, 2017

MAD: mean or median? greenelab/tybalt#99

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use grid_search in notebook and add visualization #18

Use grid_search in notebook and add visualization #18

dhimmel commented Jul 28, 2016 •

edited

Loading

cgreene Jul 28, 2016

dhimmel Jul 28, 2016 •

edited

Loading

dhimmel commented Jul 28, 2016

cgreene commented Jul 28, 2016

dhimmel commented Jul 28, 2016 •

edited

Loading

htcai commented Jul 29, 2016

gwaybio commented Jul 29, 2016

dhimmel commented Jul 29, 2016 •

edited

Loading

gwaybio commented Jul 29, 2016 •

edited

Loading

dhimmel commented Aug 1, 2016

yl565 commented Aug 1, 2016 •

edited by dhimmel

Loading

gwaybio commented Aug 1, 2016 •

edited

Loading

RenasonceGent commented Aug 2, 2016

dhimmel commented Aug 2, 2016

dhimmel commented Aug 2, 2016

Use grid_search in notebook and add visualization #18

Use grid_search in notebook and add visualization #18

Conversation

dhimmel commented Jul 28, 2016 • edited Loading

cgreene Jul 28, 2016

Choose a reason for hiding this comment

dhimmel Jul 28, 2016 • edited Loading

Choose a reason for hiding this comment

dhimmel commented Jul 28, 2016

cgreene commented Jul 28, 2016

dhimmel commented Jul 28, 2016 • edited Loading

htcai commented Jul 29, 2016

gwaybio commented Jul 29, 2016

dhimmel commented Jul 29, 2016 • edited Loading

Terminology

gwaybio commented Jul 29, 2016 • edited Loading

dhimmel commented Aug 1, 2016

yl565 commented Aug 1, 2016 • edited by dhimmel Loading

gwaybio commented Aug 1, 2016 • edited Loading

RenasonceGent commented Aug 2, 2016

dhimmel commented Aug 2, 2016

dhimmel commented Aug 2, 2016

dhimmel commented Jul 28, 2016 •

edited

Loading

dhimmel Jul 28, 2016 •

edited

Loading

dhimmel commented Jul 28, 2016 •

edited

Loading

dhimmel commented Jul 29, 2016 •

edited

Loading

gwaybio commented Jul 29, 2016 •

edited

Loading

yl565 commented Aug 1, 2016 •

edited by dhimmel

Loading

gwaybio commented Aug 1, 2016 •

edited

Loading