Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Penalize far apart pairs not more than as if they were two single ends #332

Merged
merged 2 commits into from
Aug 28, 2023

Conversation

marcelm
Copy link
Collaborator

@marcelm marcelm commented Aug 27, 2023

Closes #321.

The accuracy increases indeed a little bit.

Comparing accuracy

dataset main (4893484) this PR (d723bb4) difference
drosophila-50 90.189 90.1903 +0.0013
drosophila-75 91.6441 91.64485 +0.0007
drosophila-100 92.38785 92.38855 +0.0007
drosophila-150 93.2112 93.21115 -0.0001
drosophila-200 93.52085 93.52115 +0.0003
drosophila-300 95.36085 95.36125 +0.0004
drosophila-500 95.6936 95.6938 +0.0002
maize-50 71.4703 71.47195 +0.0016
maize-75 82.1255 82.12625 +0.0007
maize-100 87.13175 87.1324 +0.0007
maize-150 91.6731 91.67345 +0.0003
maize-200 92.92045 92.9206 +0.0002
maize-300 96.70835 96.70855 +0.0002
maize-500 97.2899 97.29025 +0.0003
CHM13-50 90.635 90.63505 +0.0001
CHM13-75 92.5158 92.5161 +0.0003
CHM13-100 93.21985 93.22015 +0.0003
CHM13-150 94.14035 94.1402 -0.0002
CHM13-200 94.43405 94.43425 +0.0002
CHM13-300 95.62665 95.62645 -0.0002
CHM13-500 95.9505 95.9505 +0.0000
rye-50 69.1402 69.14095 +0.0007
rye-75 80.53445 80.5352 +0.0007
rye-100 85.6483 85.6488 +0.0005
rye-150 90.20375 90.2038 +0.0001
rye-200 91.4773 91.4774 +0.0001
rye-300 94.5574 94.55795 +0.0005
rye-500 95.1326 95.1331 +0.0005

Average difference: +0.0004

I tried a couple of other things. In the formula you suggested in #321, you had an epsilon_pair term. I tested with epsilon_pair=2 and epsilon_pair=5 and while individual accuracies change a little bit for each dataset, the average difference remains at +0.0004 in both cases. I guess the term would have an effect for other datasets, so if you want, we can include it, but we canot measure its effect with the data we have.

Also, as you suggested, I tried changing the three occurrences of the hard-coded -20 in the function to either -10 or -30 (this was in addition to introducing the std::max()). This made the average difference worse (-0.1860 and -0.0056, respectively).

@ksahlin
Copy link
Owner

ksahlin commented Aug 27, 2023

Great to have this fix!

if you want, we can include it, but we canot measure its effect with the data we have.

With epsilon_pair I meant that it should be the deciding factor only in cases where 'all else equal'. An epsilon_pair 2 or 5 sounds a bit high and may interfere with disjoint alignments that have better alignment score. How about epsilon_pair=0.001? I would be ok with adding such term without having a proper SV dataset.

@marcelm
Copy link
Collaborator Author

marcelm commented Aug 28, 2023

Alright, I added an epsilon of 0.001. I also think it’s fine to have this without more extensive testing.

@ksahlin
Copy link
Owner

ksahlin commented Aug 28, 2023

Ok, approved

@marcelm marcelm merged commit f6b6208 into main Aug 28, 2023
9 checks passed
@marcelm marcelm deleted the pair-score branch August 28, 2023 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scoring of read pairs on same vs. different chromosome
2 participants