Ingest INRB data with permission #242

j23414 · 2024-04-26T23:04:22Z

Description of proposed changes

From slack, it was requested that the new INRB data on mpox clade I from https://www.medrxiv.org/content/10.1101/2024.04.12.24305195v2 be added to our Nextstrain analysis. INRB is working to add to NCBI, so this is a temporary solution similar to what has been done previously.

After obtaining permission to do so, this PR temporarily adds the records here to be included in the curated dataset.

Mainly followed instructions from Adding new sequences not from GenBank
Assigned records temporary IDs TMP0000 to TMP0046
Set authors to "INRB"

Please feel free to push further commits to this branch or suggest changes.

Related issue(s)

Checklist

Checks pass

From slack channel: https://bedfordlab.slack.com/archives/C01LCTT7JNN/p1714159380201999 it was requested that the new INRB data on mpox clade 1b be added to our Nextstrain analysis from * https://www.medrxiv.org/content/10.1101/2024.04.12.24305195v2 INRB is working to add to NCBI, so this is a temporary solution which is similar to: * fb871ef#diff-2b15577b072066f9c4c63eeb20343e6dc4f1e40ed43239d702743648ef35325eR2 After obtaining permission to do so, this PR temporarily adds the records here to be included in the curated dataset. * Mainly followed instructions from [Adding new sequences not from GenBank](https://github.com/nextstrain/mpox/tree/59eaf472e7ca870567f21d83e082942fd31a3646/ingest#static-files) * Assigned records temporary IDs `TMP0000` to `TMP0046` * Set `authors` to "INRB"

corneliusroemer · 2024-04-29T18:51:40Z

Thanks @j23414! How did you populate the metadata? From the FASTA headers? We might need to s/find/replace some of the fields to conform with what ingest expects them to be called - unless you've already done so manually!

Have you done a test run of ingest to see whether the output looks right? Would be good to do that and link to the results! I'll see whether I can do that now.

This might work: https://github.com/nextstrain/mpox/actions/runs/8883875132
Will have to check once workflow is done.

j23414 · 2024-04-29T19:02:20Z

How did you populate the metadata? From the FASTA headers?

Hi @corneliusroemer! I hacked a fix for the fasta file headers using the following perl script (add_ids.pl):

#! /usr/bin/env perl

use strict;
use warnings;

my @TMPIDS=();

for my $i ("TMP0000" .. "TMP0099") {
    push @TMPIDS, $i;
}

my $i=0;
while(<>){
  if(/>(.*)/){
    my $header=$1;
    print ">$TMPIDS[$i++]";
    print "|INRB";
    print "|Africa";
    print "|Democratic Republic of the Congo";
    print "|$header\n";
  }else{
    print;
  }
}

Then ran

perl add_ids.pl ingest/submission01_mpox47_2024.fasta > fixedheaders.fasta
./ingest/bin/fasta-to-ndjson \
 --fasta fixedheaders.fasta \
 --fields genbank_accession authors region country strain host ocountry division collected \
 --exclude ocountry \
 > ingest/data/inrb.ndjson

And then kept checking nextstrain build ingest runs, editing the field names as needed to get it to run successfully.

This might work: https://github.com/nextstrain/mpox/actions/runs/8883875132
Will have to check once workflow is done.

Ohh, thanks for submitted the github action check! 🙌 Should be able to grep "TMP" from the final sequences.fasta and metadata.tsv files.

corneliusroemer · 2024-04-29T19:06:19Z

Great, thanks for filling me in on the details! There might be a typo in one of your commands ocountry rather than country.

j23414 · 2024-04-29T19:10:59Z

ocountry

Thanks for pointing this out! This was on purpose (-exclude ocountry) ;) It's so I could create a new country column and avoid

sed 's/DRC/Democratic Republic of the Congo/g'

While DRC shouldn't be in the nucleotides section of a fasta file, I've seen stranger things.

corneliusroemer · 2024-04-30T15:03:24Z

Test run seems to have worked!

wget data.nextstrain.org/files/workflows/mpox/branch/add-inrb-with-permission/metadata.tsv.gz
wget data.nextstrain.org/files/workflows/mpox/branch/add-inrb-with-permission/sequences.fasta.xz

I'll merge then as it simplifies including the sequences in our builds. If there are outliers/issues, we can always simply exclude the accessions post-ingest, in the phylogenetic/nextclade workflows.

joverlee521 · 2024-04-30T17:06:12Z

Nice work @j23414! I hope the old instructions were somewhat helpful? Please feel free to update it with any extra steps you had to take here.

Thanks for pointing this out! This was on purpose (-exclude ocountry) ;) It's so I could create a new country column and avoid

sed 's/DRC/Democratic Republic of the Congo/g'

While DRC shouldn't be in the nucleotides section of a fasta file, I've seen stranger things.

FYI, since all sources go through the ingest pipeline, you could have added this to the geolocation-rules.tsv as

Africa/DRC/*/*    Africa/Democratic Republic of the Congo/*/*

j23414 requested review from trvrb and a team April 26, 2024 23:04

j23414 added 2 commits April 26, 2024 16:06

Include inrb data in ingest

613d9ab

j23414 force-pushed the add-inrb-with-permission branch from ee7e06b to 613d9ab Compare April 26, 2024 23:08

corneliusroemer merged commit 56fb8cb into master Apr 30, 2024
26 checks passed

corneliusroemer deleted the add-inrb-with-permission branch April 30, 2024 15:04

j23414 mentioned this pull request Apr 30, 2024

Add submission date to the INRB sequences #243

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingest INRB data with permission #242

Ingest INRB data with permission #242

j23414 commented Apr 26, 2024 •

edited by corneliusroemer

Loading

corneliusroemer commented Apr 29, 2024 •

edited

Loading

j23414 commented Apr 29, 2024 •

edited

Loading

corneliusroemer commented Apr 29, 2024

j23414 commented Apr 29, 2024 •

edited

Loading

corneliusroemer commented Apr 30, 2024

joverlee521 commented Apr 30, 2024

Ingest INRB data with permission #242

Ingest INRB data with permission #242

Conversation

j23414 commented Apr 26, 2024 • edited by corneliusroemer Loading

Description of proposed changes

Related issue(s)

Checklist

corneliusroemer commented Apr 29, 2024 • edited Loading

j23414 commented Apr 29, 2024 • edited Loading

corneliusroemer commented Apr 29, 2024

j23414 commented Apr 29, 2024 • edited Loading

corneliusroemer commented Apr 30, 2024

joverlee521 commented Apr 30, 2024

j23414 commented Apr 26, 2024 •

edited by corneliusroemer

Loading

corneliusroemer commented Apr 29, 2024 •

edited

Loading

j23414 commented Apr 29, 2024 •

edited

Loading

j23414 commented Apr 29, 2024 •

edited

Loading