Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix get_end VCF parser - ENSINT-2005 #171

Merged
merged 4 commits into from
Nov 8, 2024
Merged

Conversation

sgiorgetti
Copy link
Contributor

Changes so far

  • patched get_end (not for SV)
  • fixed test suite and re-arranged test data

NOTES - This is still draft

  • These changes are on top of release/114
  • @nakib103 to confirm it is OK, as now get_end and get_raw_end are the same unless $info->{SVLEN}
  • add tests to check get_end Vs get_raw_end esp. when the two differ ... when?

- patched get_end (not for SV)
- fixed test suite and re-arranged test data
@nakib103
Copy link

nakib103 commented Oct 20, 2024

@sgiorgetti,

About the question that get_end and get_raw_end becoming the same functionality, I think that makes sense. In this wiki we can see that in VCF with correct normalization, the alleles can never end in same nucleotide. Hence we should not need to do trimming on the right (and the parser do not do any such thing). As there is no operation done on on the right, it should be correct if both of the function return the same value.

PS: with the latest change request I made the get_end checks an extra case. The get_raw_end function should also get that case (although it is not affecting us too much as I saw get_raw_end used for non-SV cases in our code, but it would be more correct to do so).

or, better, if we add all the changes to get_end, from the arguments made above - we can call call get_end from get_raw_end function and return the same result.

@sgiorgetti sgiorgetti marked this pull request as ready for review October 29, 2024 09:53
@sgiorgetti
Copy link
Contributor Author

@nakib103 - I have updated the code, basically as per your suggestions (thanks!). Would you mind to double check the logic of the changes?

@TamaraNaboulsi - would you mind to have a look at these changes, should I have missed anything?

Thank you both!

@@ -310,17 +310,20 @@ sub get_end {
my $self = shift;
my $info = $self->get_info;
my $end;
my $alternatives = join(",", @{$self->get_alternatives});
if (defined($info->{END})) {
$end = $info->{END};
}
elsif(defined($info->{SVLEN})) {
Copy link

@nakib103 nakib103 Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgiorgetti,

Regarding - https://github.com/Ensembl/ensembl-io/pull/171/files#r1814776276

  1. The expected change would be to change the order in the if-elsif like so,

currently we have -

    if (defined($info->{END})) {
      $end = $info->{END};
    }
    elsif(defined($info->{SVLEN})) {
      return unless $self->get_start;
      my $svlen = (split(',',$info->{SVLEN}))[0];
      $end = $self->get_start + abs($svlen) - 1;
    }

change it to -

    if (defined($info->{SVLEN})) {
      return unless $self->get_start;
      my $svlen = (split(',',$info->{SVLEN}))[0];
      $end = $self->get_start + abs($svlen) - 1;
    }
    elsif (defined($info->{END})) {
      $end = $info->{END};
    }

So we calculate end using SVLEN first if available.

  1. Also, for sub get_raw_end, it is better to change it to call get_end as all these changes should be added in get_raw_end too (but we do not want to duplicate effort) -
sub get_raw_end {
    my $self = shift;
    return $self->get_end;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes ... and you mentioned that before ... bear with me

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @sgiorgetti !

@nakib103
Copy link

Hi @sgiorgetti , thanks a lot for the changes! There still one change request that needs to be addressed. I have added that above.

Also tagging @likhitha-surapaneni as she will also be reviewing the changes.

@sgiorgetti
Copy link
Contributor Author

Re-requested review to confirm - also added @likhitha-surapaneni explicitly :-)

Copy link
Member

@TamaraNaboulsi TamaraNaboulsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change includes:

  • Updated the get_raw_end method to be equivalent to get_end
  • Updated get_end method to include alternative sequences, checks for additional info (SVTYPE), and changing the order of checking SVLEN vs END
  • Updated the test file with some clarity in assigning the test_sample and added a new test for loading the 7th record

Looks good to me and all requests were resolved.

Copy link

@likhitha-surapaneni likhitha-surapaneni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix @sgiorgetti, the code works as expected for the examples tested. Currently, testing on a larger SV VCF file to check that there are no breaking changes.

Copy link

@likhitha-surapaneni likhitha-surapaneni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with larger file and looks good to me

@sgiorgetti
Copy link
Contributor Author

Thanks @likhitha-surapaneni ! Shall go ahead and merge this in.

@sgiorgetti sgiorgetti merged commit ec3b610 into release/114 Nov 8, 2024
2 checks passed
@sgiorgetti sgiorgetti deleted the fix/vcf_parser branch November 8, 2024 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants