Penalize far apart pairs not more than as if they were two single ends #332

marcelm · 2023-08-27T06:44:52Z

Closes #321.

The accuracy increases indeed a little bit.

Comparing accuracy

dataset	main (`4893484`)	this PR (`d723bb4`)	difference
drosophila-50	90.189	90.1903	+0.0013
drosophila-75	91.6441	91.64485	+0.0007
drosophila-100	92.38785	92.38855	+0.0007
drosophila-150	93.2112	93.21115	-0.0001
drosophila-200	93.52085	93.52115	+0.0003
drosophila-300	95.36085	95.36125	+0.0004
drosophila-500	95.6936	95.6938	+0.0002
maize-50	71.4703	71.47195	+0.0016
maize-75	82.1255	82.12625	+0.0007
maize-100	87.13175	87.1324	+0.0007
maize-150	91.6731	91.67345	+0.0003
maize-200	92.92045	92.9206	+0.0002
maize-300	96.70835	96.70855	+0.0002
maize-500	97.2899	97.29025	+0.0003
CHM13-50	90.635	90.63505	+0.0001
CHM13-75	92.5158	92.5161	+0.0003
CHM13-100	93.21985	93.22015	+0.0003
CHM13-150	94.14035	94.1402	-0.0002
CHM13-200	94.43405	94.43425	+0.0002
CHM13-300	95.62665	95.62645	-0.0002
CHM13-500	95.9505	95.9505	+0.0000
rye-50	69.1402	69.14095	+0.0007
rye-75	80.53445	80.5352	+0.0007
rye-100	85.6483	85.6488	+0.0005
rye-150	90.20375	90.2038	+0.0001
rye-200	91.4773	91.4774	+0.0001
rye-300	94.5574	94.55795	+0.0005
rye-500	95.1326	95.1331	+0.0005

Average difference: +0.0004

I tried a couple of other things. In the formula you suggested in #321, you had an epsilon_pair term. I tested with epsilon_pair=2 and epsilon_pair=5 and while individual accuracies change a little bit for each dataset, the average difference remains at +0.0004 in both cases. I guess the term would have an effect for other datasets, so if you want, we can include it, but we canot measure its effect with the data we have.

Also, as you suggested, I tried changing the three occurrences of the hard-coded -20 in the function to either -10 or -30 (this was in addition to introducing the std::max()). This made the average difference worse (-0.1860 and -0.0056, respectively).

ksahlin · 2023-08-27T08:14:37Z

Great to have this fix!

if you want, we can include it, but we canot measure its effect with the data we have.

With epsilon_pair I meant that it should be the deciding factor only in cases where 'all else equal'. An epsilon_pair 2 or 5 sounds a bit high and may interfere with disjoint alignments that have better alignment score. How about epsilon_pair=0.001? I would be ok with adding such term without having a proper SV dataset.

See #321

marcelm · 2023-08-28T08:39:38Z

Alright, I added an epsilon of 0.001. I also think it’s fine to have this without more extensive testing.

ksahlin · 2023-08-28T13:56:22Z

Ok, approved

marcelm force-pushed the pair-score branch from a1a5098 to bc8baa0 Compare August 27, 2023 20:06

Penalize far apart pairs not more than as if they were two single ends

19d5d20

See #321

marcelm force-pushed the pair-score branch from bc8baa0 to 0984eed Compare August 27, 2023 20:15

Update baseline commit

bf1df6b

marcelm force-pushed the pair-score branch from 0984eed to bf1df6b Compare August 27, 2023 20:40

marcelm merged commit f6b6208 into main Aug 28, 2023
9 checks passed

marcelm deleted the pair-score branch August 28, 2023 14:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Penalize far apart pairs not more than as if they were two single ends #332

Penalize far apart pairs not more than as if they were two single ends #332

marcelm commented Aug 27, 2023

ksahlin commented Aug 27, 2023

marcelm commented Aug 28, 2023

ksahlin commented Aug 28, 2023

Penalize far apart pairs not more than as if they were two single ends #332

Penalize far apart pairs not more than as if they were two single ends #332

Conversation

marcelm commented Aug 27, 2023

Comparing accuracy

ksahlin commented Aug 27, 2023

marcelm commented Aug 28, 2023

ksahlin commented Aug 28, 2023