-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filtering RU alleles #49
Conversation
@standage I've added additional functionalities since you've looked at this PR. I've added in the ability to create STRmix output files for NGS data (i.e. using the sequence strings from lusSTR) and perform filtering using detection thresholds, analytical thresholds and same size thresholds. Same size thresholds are when two sequences differ but are the same CE allele (sequence with highest number of reads are compared to other sequences, those with reads < threshold are dropped). Stutter filters are not yet implemented for sequence strings (much more complicated). |
lusSTR/filter_settings.py
Outdated
def filters(data_order, locus, total_reads, datatype): | ||
metadata = filter_marker_data[locus] | ||
if len(data_order) == 1: | ||
if thresholds('Detection', metadata, total_reads, data_order['Reads'][0])[1] is False: | ||
data_order = pd.DataFrame() | ||
elif thresholds('Analytical', metadata, total_reads, data_order['Reads'][0])[1] is False: | ||
data_order[['allele_type', 'perc_noise']] = ['noise', 1.0] | ||
elif thresholds('Analytical', metadata, total_reads, data_order['Reads'][0])[1] is True: | ||
data_order['allele_type'] = 'real_allele' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function is my biggest remaining source of concern. It's doing a lot of things, and I'm wondering if it can be decomposed into 2-4 smaller functions: one that handles len(data_order) == 1
, one that handles datatype == "ce"
, and so on.
Also, data_order
is the allele read count information, right? Another pass over this code to revisit variable names would probably improve legibility quite a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I'll take a look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some refactoring and renaming variables to hopefully make things a little more clear. Let me know when you get a chance to look @standage and if it's up to your standard 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol, just so we're clear, I don't think I code up to "my standard" very frequently. I hope this review has been helpful, or at least that it increases the likelihood that someone (possibly one of us) looking at this code 3 years will make sense of it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm kidding! Hence the laughing face. The Standage standard is great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And much needed for my code 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is much clearer now. Thanks for humoring me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your help!!
This PR incorporates filtering using detection thresholds, analytical thresholds and stutter thresholds. This PR will only address the RU allele; LUSPlus allele filtering will be included in a later PR.
This PR also produces output files for use in either EuroForMix or STRmix.