Fix GFA1 path overlaps #8

jurikuronen · 2021-11-03T13:22:07Z

Hi and thanks a lot for your work on Cuttlefish!

I've been working on a GFA1 file parser and have used Cuttlefish to generate GFA1 files. In doing so, I noticed that the Path lines output by Cuttlefish are ambiguous and break the GFA 1.0 Format Specification. Quoting https://github.com/GFA-spec/GFA-spec/blob/master/GFA1.md#p-path-line:

If specified, the Overlaps field must have one fewer values than the number of segment names and orientations in the SegmentNames field.

However, Cuttlefish currently outputs an extra overlap value.

Here are two small reproducible examples which highlight the issue. First, consider the FASTA file with contents:

>Example1.fasta
CTANAAC

I used the symbol 'N' to induce a zero overlap link so that we can track which overlaps correspond to which links. Running Cuttlefish (and KMC) on this file with -k 3 produces the GFA1 file with contents:

H	VN:Z:1.0
P	Reference:1_Sequence:Example1.fasta	1+,0+	0M,0M
S	1	CTA	LN:i:3
S	0	AAC	LN:i:3
L	1	+	0	+	0M

Here, "0M" is given twice. The correct overlap output for the path would be a single "0M". Next, also consider the FASTA file with contents:

>Example2.fasta
ACTANAACT

and the produced GFA1 file with contents:

H	VN:Z:1.0
P	Reference:1_Sequence:Example2.fasta	2+,1+,0+,2+	2M,2M,0M,2M
S	2	ACT	LN:i:3
S	1	CTA	LN:i:3
L	2	+	1	+	2M
S	0	AAC	LN:i:3
L	1	+	0	+	0M
L	0	+	2	+	2M

Here, the correct overlap output would be "2M,0M,2M".

It appears that the overlap buffer for each thread already contains the correct overlaps and the manually output first overlap of the path is a duplicate. My proposed fix is to simply output the overlap buffer (without the first comma). After applying the fix, Cuttlefish produces the GFA1 files

H	VN:Z:1.0
P	Reference:1_Sequence:Example1.fasta	1+,0+	0M
S	1	CTA	LN:i:3
S	0	AAC	LN:i:3
L	1	+	0	+	0M

and

H	VN:Z:1.0
P	Reference:1_Sequence:Example1.fasta	2+,1+,0+,2+	2M,0M,2M
S	2	ACT	LN:i:3
S	1	CTA	LN:i:3
L	2	+	1	+	2M
S	0	AAC	LN:i:3
L	1	+	0	+	0M
L	0	+	2	+	2M

which are correct.

jamshed · 2021-11-12T20:15:29Z

Hello @jurikuronen,

Glad to know that Cuttlefish has been useful to you. And also thanks a lot for noticing the bug, as well as tracking the source and fixing it! I'm pulling it in to develop for now, and will merge it to master and bump the version including this bug-fix patch ASAP!

Regards.

Fix GFA1 path overlaps

1941018

jamshed changed the base branch from master to develop November 12, 2021 20:12

jamshed merged commit f442101 into COMBINE-lab:develop Nov 12, 2021

jurikuronen mentioned this pull request Dec 1, 2021

Fix GFA1 path overlaps, revisited #9

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GFA1 path overlaps #8

Fix GFA1 path overlaps #8

jurikuronen commented Nov 3, 2021 •

edited

Loading

jamshed commented Nov 12, 2021

Fix GFA1 path overlaps #8

Fix GFA1 path overlaps #8

Conversation

jurikuronen commented Nov 3, 2021 • edited Loading

jamshed commented Nov 12, 2021

jurikuronen commented Nov 3, 2021 •

edited

Loading