This repository has been archived by the owner on Jun 21, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 67
Evaluation step TP53 classifier #385
Merged
jaclyn-taroni
merged 22 commits into
AlexsLemonade:master
from
kgaonkar6:validation_step
Jan 10, 2020
Merged
Changes from 8 commits
Commits
Show all changes
22 commits
Select commit
Hold shift + click to select a range
5a65632
evaluation script and plots
kgaonkar6 6dc9e41
evaluation plots for classifier
kgaonkar6 8245189
added plots TP53
kgaonkar6 dbb48be
added plots TP53
kgaonkar6 1be0689
evaluation script
kgaonkar6 8f06982
evaluation script
kgaonkar6 d93f481
add NF1 plots
kgaonkar6 5a35444
updated TP53 NF1 gencode cds snv alt
kgaonkar6 6b8df5c
Merge branch 'master' into validation_step
jaclyn-taroni 610224e
adding clinical to status file; rerun
kgaonkar6 66fec2b
Merge branch 'validation_step' of https://github.com/kgaonkar6/OpenPB…
kgaonkar6 c4eaf74
keep only ids from clinical
kgaonkar6 7b7d5f2
Merge branch 'master' into validation_step
jaclyn-taroni 266e54a
Run black
jaclyn-taroni 805911f
Change names to reflect current study
jaclyn-taroni f8cdedd
Add documentation
jaclyn-taroni bfd8935
Merge branch 'master' into validation_step
jaclyn-taroni d5666d9
@gwaygenomics suggested doc changes
jaclyn-taroni 740161b
Merge branch 'master' into validation_step
jaclyn-taroni d40b6e6
Merge branch 'master' into validation_step
jaclyn-taroni 1bed156
Update README.md
jaclyn-taroni 9bd6281
Merge branch 'master' into validation_step
jaclyn-taroni File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,176 @@ | ||
import os | ||
import random | ||
from decimal import Decimal | ||
from scipy.stats import ttest_ind | ||
import numpy as np | ||
import pandas as pd | ||
|
||
from sklearn.metrics import average_precision_score, roc_auc_score | ||
from sklearn.metrics import roc_curve, precision_recall_curve | ||
|
||
import seaborn as sns | ||
import matplotlib.pyplot as plt | ||
from optparse import OptionParser | ||
|
||
parser = OptionParser(usage="usage: %prog [options] arguments") | ||
parser.add_option( | ||
"-s", "--statusfile", dest="status_file", help="TP53 and NF1 status file" | ||
) | ||
parser.add_option( | ||
"-f", "--file", dest="filename", help="scores output file " | ||
) | ||
parser.add_option( | ||
"-c", "--clinical", dest="clinical", help="pbta-histologies.tsv clinical file" | ||
) | ||
parser.add_option( | ||
"-o", "--output_basename", dest="outputfile", help="output plots basename for TP53 and NF1 ROC curves" | ||
) | ||
|
||
(options, args) = parser.parse_args() | ||
status_file = options.status_file | ||
scores_file = options.filename | ||
clinical = options.clinical | ||
outputfilename = options.outputfile | ||
|
||
|
||
np.random.seed(123) | ||
|
||
status_df = pd.read_table(status_file,low_memory=False) | ||
|
||
# Value count of variant classification | ||
print(status_df.Variant_Classification.value_counts()) | ||
|
||
# Obtain a binary status matrix | ||
full_status_df = pd.crosstab(status_df['sample_id'], status_df.Hugo_Symbol,dropna=False) | ||
full_status_df.head(3) | ||
full_status_df[full_status_df > 1] = 1 | ||
full_status_df = full_status_df.reset_index() | ||
full_status_df=full_status_df.drop(['No_TP53_NF1_alt'],axis=1) | ||
|
||
# read in clinical file | ||
clinical_df = pd.read_table(clinical) | ||
|
||
|
||
# add clinical info to TP53 and NF1 binary status df | ||
full_status_df = ( | ||
full_status_df | ||
.assign(tp53_status = full_status_df.loc[:, 'TP53'], | ||
nf1_status = full_status_df.loc[:, 'NF1']) | ||
) | ||
|
||
full_status_df = ( | ||
full_status_df.merge( | ||
clinical_df, | ||
how='left', left_on='sample_id', right_on='sample_id' | ||
) | ||
) | ||
|
||
|
||
# read in scores from 01 | ||
file = os.path.join( scores_file) | ||
scores_df = pd.read_table(file) | ||
scores_df=scores_df.rename(str.upper, axis='columns') | ||
|
||
scores_df = ( | ||
scores_df.merge( | ||
full_status_df, | ||
how='left', left_on='SAMPLE_ID', right_on='Kids_First_Biospecimen_ID' | ||
) | ||
) | ||
|
||
print("scores df shape") | ||
print(scores_df.shape) | ||
scores_df.tp53_status.value_counts() | ||
|
||
scores_df = ( | ||
scores_df | ||
.assign(SAMPLE_ID = scores_df.loc[:, 'sample_id']) | ||
) | ||
|
||
|
||
gene_status = ['tp53_status','nf1_status'] | ||
scores_df.loc[:, gene_status] = ( | ||
scores_df.loc[:, gene_status].fillna(0) | ||
) | ||
|
||
scores_df.loc[scores_df['tp53_status'] != 0, 'tp53_status'] = 1 | ||
scores_df.loc[scores_df['nf1_status'] != 0, 'nf1_status'] = 1 | ||
|
||
scores_df['tp53_status'] = scores_df['tp53_status'].astype(int) | ||
scores_df['nf1_status'] = scores_df['nf1_status'].astype(int) | ||
|
||
# binary counts for tp53 and nf1 loss status | ||
print ("TP53 status") | ||
print(scores_df.tp53_status.value_counts()) | ||
print ("NF1 status") | ||
print(scores_df.nf1_status.value_counts()) | ||
|
||
def get_roc_plot(scores_df, gene, outputfilename,color): | ||
""" | ||
Show roc plot of classifier scores per gene | ||
|
||
Arguments: | ||
df - the dataframe of scores | ||
gene - the name of the gene to input | ||
outputfilename - the name of <filename>_ROC_plot.pdf | ||
|
||
""" | ||
lower_gene = gene.lower() | ||
scores_df=scores_df.rename(str.lower, axis='columns') | ||
# Obtain Metrics | ||
sample_status = scores_df.loc[:,'{}_status'.format(lower_gene)] | ||
sample_score = scores_df.loc[:,'{}_score'.format(lower_gene)] | ||
shuffle_score = scores_df.loc[:,'{}_shuffle'.format(lower_gene)] | ||
fpr_pdx, tpr_pdx, thresh_pdx=roc_curve(sample_status, sample_score, drop_intermediate=False) | ||
precision_pdx, recall_pdx, _ = precision_recall_curve(sample_status, sample_score) | ||
auroc_pdx = roc_auc_score(sample_status, sample_score) | ||
aupr_pdx = average_precision_score(sample_status, sample_score) | ||
|
||
# Obtain Shuffled Metrics | ||
fpr_shuff, tpr_shuff, thresh_shuff = roc_curve(sample_status, shuffle_score, drop_intermediate=False) | ||
precision_shuff, recall_shuff, _ = precision_recall_curve(sample_status, shuffle_score) | ||
auroc_shuff = roc_auc_score(sample_status, shuffle_score) | ||
aupr_shuff = average_precision_score(sample_status, shuffle_score) | ||
|
||
roc_df = ( | ||
pd.DataFrame([fpr_pdx, tpr_pdx, thresh_pdx], index=['fpr', 'tpr', 'threshold']) | ||
.transpose() | ||
.assign(gene=gene, | ||
shuffled=False) | ||
) | ||
plt.subplots(figsize=(5, 5)) | ||
plt.axis('equal') | ||
plt.plot([0, 1], [0, 1], 'k--') | ||
plt.xlim([0.0, 1.0]) | ||
plt.ylim([0.0, 1.0]) | ||
plt.plot(fpr_pdx, tpr_pdx, | ||
label='{} (AUROC = {})'.format(gene, round(auroc_pdx, 2)), | ||
linestyle='solid', | ||
color=color) | ||
|
||
# Shuffled Data | ||
plt.plot(fpr_shuff, tpr_shuff, | ||
label='{} Shuffle (AUROC = {})'.format(gene, round(auroc_shuff, 2)), | ||
linestyle='dotted', | ||
color=color) | ||
|
||
plt.xlabel('False Positive Rate', fontsize=12) | ||
plt.ylabel('True Positive Rate', fontsize=12) | ||
plt.tick_params(labelsize=10) | ||
|
||
lgd = plt.legend(bbox_to_anchor=(0.3, 0.15), | ||
loc=2, | ||
borderaxespad=0., | ||
fontsize=10) | ||
plt.savefig(outputfilename +'_'+gene+'.png') | ||
|
||
outputfilename = os.path.join("analyses", "tp53_nf1_score", "results", outputfilename) | ||
|
||
get_roc_plot(scores_df, gene = "TP53", outputfilename = outputfilename , color = '#7570b3') | ||
|
||
get_roc_plot(scores_df, gene = "NF1", outputfilename = outputfilename , color = '#d95f02') | ||
|
||
|
||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.