-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ingest INRB data with permission #242
Conversation
From slack channel: https://bedfordlab.slack.com/archives/C01LCTT7JNN/p1714159380201999 it was requested that the new INRB data on mpox clade 1b be added to our Nextstrain analysis from * https://www.medrxiv.org/content/10.1101/2024.04.12.24305195v2 INRB is working to add to NCBI, so this is a temporary solution which is similar to: * fb871ef#diff-2b15577b072066f9c4c63eeb20343e6dc4f1e40ed43239d702743648ef35325eR2 After obtaining permission to do so, this PR temporarily adds the records here to be included in the curated dataset. * Mainly followed instructions from [Adding new sequences not from GenBank](https://github.com/nextstrain/mpox/tree/59eaf472e7ca870567f21d83e082942fd31a3646/ingest#static-files) * Assigned records temporary IDs `TMP0000` to `TMP0046` * Set `authors` to "INRB"
ee7e06b
to
613d9ab
Compare
Thanks @j23414! How did you populate the metadata? From the FASTA headers? We might need to Have you done a test run of ingest to see whether the output looks right? Would be good to do that and link to the results! I'll see whether I can do that now. This might work: https://github.com/nextstrain/mpox/actions/runs/8883875132 |
Hi @corneliusroemer! I hacked a fix for the fasta file headers using the following perl script (add_ids.pl): #! /usr/bin/env perl
use strict;
use warnings;
my @TMPIDS=();
for my $i ("TMP0000" .. "TMP0099") {
push @TMPIDS, $i;
}
my $i=0;
while(<>){
if(/>(.*)/){
my $header=$1;
print ">$TMPIDS[$i++]";
print "|INRB";
print "|Africa";
print "|Democratic Republic of the Congo";
print "|$header\n";
}else{
print;
}
} Then ran perl add_ids.pl ingest/submission01_mpox47_2024.fasta > fixedheaders.fasta
./ingest/bin/fasta-to-ndjson \
--fasta fixedheaders.fasta \
--fields genbank_accession authors region country strain host ocountry division collected \
--exclude ocountry \
> ingest/data/inrb.ndjson And then kept checking
Ohh, thanks for submitted the github action check! 🙌 Should be able to grep "TMP" from the final sequences.fasta and metadata.tsv files. |
Great, thanks for filling me in on the details! There might be a typo in one of your commands |
Thanks for pointing this out! This was on purpose (
While |
Test run seems to have worked!
I'll merge then as it simplifies including the sequences in our builds. If there are outliers/issues, we can always simply exclude the accessions post-ingest, in the phylogenetic/nextclade workflows. |
Nice work @j23414! I hope the old instructions were somewhat helpful? Please feel free to update it with any extra steps you had to take here.
FYI, since all sources go through the ingest pipeline, you could have added this to the geolocation-rules.tsv as
|
Description of proposed changes
From slack, it was requested that the new INRB data on mpox clade I from https://www.medrxiv.org/content/10.1101/2024.04.12.24305195v2 be added to our Nextstrain analysis. INRB is working to add to NCBI, so this is a temporary solution similar to what has been done previously.
After obtaining permission to do so, this PR temporarily adds the records here to be included in the curated dataset.
TMP0000
toTMP0046
authors
to "INRB"Please feel free to push further commits to this branch or suggest changes.
Related issue(s)
Checklist