Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GFA1 path overlaps #8

Merged
merged 1 commit into from
Nov 12, 2021
Merged

Fix GFA1 path overlaps #8

merged 1 commit into from
Nov 12, 2021

Conversation

jurikuronen
Copy link
Contributor

@jurikuronen jurikuronen commented Nov 3, 2021

Hi and thanks a lot for your work on Cuttlefish!

I've been working on a GFA1 file parser and have used Cuttlefish to generate GFA1 files. In doing so, I noticed that the Path lines output by Cuttlefish are ambiguous and break the GFA 1.0 Format Specification. Quoting https://github.com/GFA-spec/GFA-spec/blob/master/GFA1.md#p-path-line:

If specified, the Overlaps field must have one fewer values than the number of segment names and orientations in the SegmentNames field.

However, Cuttlefish currently outputs an extra overlap value.

Here are two small reproducible examples which highlight the issue. First, consider the FASTA file with contents:

>Example1.fasta
CTANAAC

I used the symbol 'N' to induce a zero overlap link so that we can track which overlaps correspond to which links. Running Cuttlefish (and KMC) on this file with -k 3 produces the GFA1 file with contents:

H	VN:Z:1.0
P	Reference:1_Sequence:Example1.fasta	1+,0+	0M,0M
S	1	CTA	LN:i:3
S	0	AAC	LN:i:3
L	1	+	0	+	0M

Here, "0M" is given twice. The correct overlap output for the path would be a single "0M". Next, also consider the FASTA file with contents:

>Example2.fasta
ACTANAACT

and the produced GFA1 file with contents:

H	VN:Z:1.0
P	Reference:1_Sequence:Example2.fasta	2+,1+,0+,2+	2M,2M,0M,2M
S	2	ACT	LN:i:3
S	1	CTA	LN:i:3
L	2	+	1	+	2M
S	0	AAC	LN:i:3
L	1	+	0	+	0M
L	0	+	2	+	2M

Here, the correct overlap output would be "2M,0M,2M".

It appears that the overlap buffer for each thread already contains the correct overlaps and the manually output first overlap of the path is a duplicate. My proposed fix is to simply output the overlap buffer (without the first comma). After applying the fix, Cuttlefish produces the GFA1 files

H	VN:Z:1.0
P	Reference:1_Sequence:Example1.fasta	1+,0+	0M
S	1	CTA	LN:i:3
S	0	AAC	LN:i:3
L	1	+	0	+	0M

and

H	VN:Z:1.0
P	Reference:1_Sequence:Example1.fasta	2+,1+,0+,2+	2M,0M,2M
S	2	ACT	LN:i:3
S	1	CTA	LN:i:3
L	2	+	1	+	2M
S	0	AAC	LN:i:3
L	1	+	0	+	0M
L	0	+	2	+	2M

which are correct.

@jamshed jamshed changed the base branch from master to develop November 12, 2021 20:12
@jamshed
Copy link
Member

jamshed commented Nov 12, 2021

Hello @jurikuronen,

Glad to know that Cuttlefish has been useful to you. And also thanks a lot for noticing the bug, as well as tracking the source and fixing it! I'm pulling it in to develop for now, and will merge it to master and bump the version including this bug-fix patch ASAP!

Regards.

@jamshed jamshed merged commit f442101 into COMBINE-lab:develop Nov 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants