Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download fails after several hours, 'processData' not found #3

Open
zbendiks opened this issue Mar 4, 2020 · 14 comments
Open

Download fails after several hours, 'processData' not found #3

zbendiks opened this issue Mar 4, 2020 · 14 comments
Labels
enhancement New feature or request

Comments

@zbendiks
Copy link

zbendiks commented Mar 4, 2020

Hi,

I'm having trouble downloading annotations from MG-RAST. Here's an example of the terminal output I got when trying to download MG-RAST sample ID 'mgm4824985.3'

> chordomics::launchApp()
Loading required package: shiny

Listening on http://127.0.0.1:6428
Warning: Error in if: argument is of length zero
  [No stack trace available]
trying URL 'http://api.metagenomics.anl.gov/annotation/sequence/mgm4824985.3?evalue=10&type=ontology&source=COG'
Content type 'text/plain; charset=ISO-8859-1' length unknown
downloaded 491.0 MB

trying URL 'http://api.metagenomics.anl.gov/annotation/sequence/mgm4824985.3?evalue=10&type=organism&source=RefSeq'
Content type 'text/plain; charset=ISO-8859-1' length unknown
downloaded 1868.5 MB

after running for several hours

trying URL 'https://api.metagenomics.anl.gov/download/mgm4824985.3?file=050.1'
Warning in download.file(url = raw_input_path, destfile = input_dest_file) :
  cannot open URL 'https://api.metagenomics.anl.gov/download/mgm4824985.3?file=050.1': HTTP status was '404 Not Found'
Warning: Error in value[[3L]]: object 'processData' not found
  [No stack trace available]

Chordomics is giving errors when it is trying to locate the MG-RAST FASTA files, and this has happened for several of my samples now. When the error occurs, the entire Chordomics browser goes grey and I can no longer interact with it. I'm not sure how to move forward with my analysis and I was hoping to get some advice

@zbendiks
Copy link
Author

zbendiks commented Mar 4, 2020

Follow-up post:

I just tried to download annotations for sample mgm4824992.3, and I received a similar error when downloading the RefSeq annotation file. Except this time it was due to a timeout error and not a 404 error

> chordomics::launchApp()
Loading required package: shiny

Listening on http://127.0.0.1:6521
trying URL 'http://api.metagenomics.anl.gov/annotation/sequence/mgm4824992.3?evalue=10&type=ontology&source=COG'
Content type 'text/plain; charset=ISO-8859-1' length unknown
downloaded 29.7 MB

trying URL 'http://api.metagenomics.anl.gov/annotation/sequence/mgm4824992.3?evalue=10&type=organism&source=RefSeq'
Warning in download.file(url = full_organism_path, destfile = organism_dest_file) :
  InternetOpenUrl failed: 'The operation timed out'
Warning: Error in value[[3L]]: object 'processData' not found
  [No stack trace available]

Are these issues related to the MG-RAST server? Every download I've tried so far has failed like this and I'm not sure how to address it

@nickp60
Copy link
Collaborator

nickp60 commented Mar 5, 2020

Hi @zbendiks ,
Thanks for bearing with these issues! Download large amounts of data from MG-RAST is really time-consuming, especially for such a large metagenome. The Shiny App tries to automate this, but looks like its just taking too long, and either Shiny or your network or MG-RAST's network gives up.

Chordomics creates a chordomics/<mgrast_id> directory to store the downloaded data. Can you manually download the files to that folder, and try again? for your data, you will need the following files:

# $HOME is wherever ~ evaluates to on Windows, usually something like C:\Users\<username>\
$HOME\chordomics\mgm4824985.3\ontology # download http://api.metagenomics.anl.gov/annotation/sequence/mgm4824985.3?evalue=10&type=ontology&source=COG
$HOME\chordomics\mgm4824985.3/organism # download http://api.metagenomics.anl.gov/annotation/sequence/mgm4824985.3?evalue=10&type=organism&source=RefSeq
$HOME\chordomics\mgm4824985.3/input_data # download http://api.metagenomics.anl.gov/annotation/sequence/mgm4824992.3?evalue=10&type=organism&source=RefSeq

It looks like those first two files will already be there; if input_data is there, delete it and download a fresh copy -- its likely incomplete.

Thanks again for letting us know about the issued; let me know how this goes!

@zbendiks
Copy link
Author

zbendiks commented Mar 6, 2020

Hi @nickp60 ,

Thanks for the prompt response! I downloaded the COG, RefSeq, and FASTA files for sample 'mgm4824993.3' via the MG-RAST API with the following commands:

mkdir ~/PATH/chordomics/mgm4824993.3
cd ~/PATH/chordomics/mgm4824993.3

curl "http://api.metagenomics.anl.gov/annotation/sequence/mgm4824993.3?evalue=10&type=ontology&source=COG" > ontology
curl "http://api.metagenomics.anl.gov/annotation/sequence/mgm4824993.3?evalue=10&type=organism&source=RefSeq" > organism
curl "https://api.metagenomics.anl.gov/download/mgm4824993.3?file=050.1" > input_data

I then ran Chordomics in 'Automatic' mode with the MG-RAST ID#. Chordomics correctly recognized the folder and input files. It did spit out some warnings regarding 'single-line footers' but everything kept running so I figure it wasn't a big deal:

Loading required package: shiny

Listening on http://127.0.0.1:7533
Warning in dir.create(THIS_TMP_DIR) :
  'C:\Users\zbendiks\Documents\chordomics\mgm4824993.3' already exists
Warning in data.table::fread(ontology_dest_file, drop = 3, col.names = names) :
  Discarded single-line footer: <<Download complete. 577686 rows retrieved>>
|--------------------------------------------------|
|==================================================|
Warning in data.table::fread(organism_dest_file, drop = 3, col.names = names) :
  Discarded single-line footer: <<Download complete. 4210913 rows retrieved>>

Chordomics then matched the taxids and now it is trying to merge the data

matching scientific name at the species level to taxid; this can take some time
complete 
merging data

But it's been stuck here for ~ 4 hours now. I'll let it run overnight and see if it works

@zbendiks
Copy link
Author

zbendiks commented Mar 6, 2020

Hi @nickp60

It's been ~18 hours since the data merging step began, but it doesn't seem like there's been any progress. Realistically, how long should I expect this step to take?

@nickp60
Copy link
Collaborator

nickp60 commented Mar 7, 2020

Hmm, I will give it a go on my machine and try to see what the story is.

@nickp60
Copy link
Collaborator

nickp60 commented Mar 9, 2020

... still downloading ...

@nickp60
Copy link
Collaborator

nickp60 commented Mar 12, 2020

Hey, so sorry for the delay; I wasn't paying attention, and didn't realize this wasn't an assembled metagenome. Right now, Chordomics is really only geared to deal with assembled metagenomes rather than raw reads. The code just isn't build to handle raw reads. Your best be would probably be to assemble the reads yourself and upload them to MG-RAST as a companion project. I'll keep trying to process it here on my end, but I don't know how far I'll get.

@zbendiks
Copy link
Author

zbendiks commented Mar 12, 2020

Hi,

These are RNA-Seq metatranscriptome samples. Previously I assembled them with Trinity but ran into problems with chimeric sequences, and only a small percentage of reads were successfully annotated. An MG-RAST developer suggested that I skip the upstream assembly + abundance estimation and just submit my metatranscriptomes as short reads rather than assembled contigs, and this improved my data quality quite a bit.

To my knowledge, MG-RAST doesn't provide any means to map short reads back to assembled contigs to estimate contig abundance. How is Chordomics able to describe changes in function across time, experimental group, etc. with just contig annotations but no abundance information?

@nickp60
Copy link
Collaborator

nickp60 commented Mar 13, 2020

Hi @zbendiks ,
Yes, its a known limitation of this way of displaying the data. Currently Chordomics's abundances are a reference to the proportion of sequences jointly belonging to a given taxonomic and functional group (eg how does the protease diversity of the community change over time at the genus level, rather than how does the relative abundance of each genus's a protease change over time). This decision was based more on the type of data you get from metaproteomes rather than sequencing studies -- coverage is frequently very low, so the functional proportional change is often more of interest. Granted, as you mention, it only tells the diversity story.

@KevinMcDonnell6, how hard would it be to add an option to display coverage as color, based on an additional line in the input data? I may be forgetting some of the details, but say we have a dataset with 100 taxa-function links, but 50 of those taxa link to "No COG". I believe currently we would end up with a big arc between for that link representing that this is 50% of the data. Say, however that all those 50 links had low values in a "coverage". column. Could we use either color alpha or an intensifying color scale to display that, rather than assigning colors just based on COG?

I have attached a small dataset of 20, where I have added a column for "Coverage". The Thermococcaceae have much higher coverage than the others, despite only representing 15%. Could we (optionally) display this as color intensity?

example_w_coverage.csv.zip

Screenshot 2020-03-13 at 13 29 43

@nickp60 nickp60 added the enhancement New feature or request label Mar 13, 2020
@zbendiks
Copy link
Author

Thanks @nickp60 , that cleared things up for me. It's unfortunate that my data won't work with Chordomics as is, but I'll look into submitting contig assemblies to MG-RAST.

It looks like my data is getting hung up around here (from MG-RAST_preprocess.R)

#  we need to get rid of the bit they put after the seqname
  ont$id <- gsub("(.*)\\|(.*)\\|.*", "\\1|\\2", ont$id)
  org$id <- gsub("(.*)\\|(.*)\\|.*", "\\1|\\2", org$id)

  ont_names <- unique(ont$id)
  org_names <- unique(org$id)
  #  Most sequences with ontology have organism
  table(ont_names %in% org_names)
  #  But only half sequences with organism have ontology
  table(org_names %in% ont_names)


  # dplyr to the rescue
  # we need to unnest the possible taxids, and then, for each sequence,
  # merge them to a single column so we can join with the ontology data later
  min_org <- org %>%
    transform(taxids = strsplit(taxids, ";")) %>%
    tidyr::unnest(taxids) %>%
    dplyr::group_by(id) %>%
    dplyr::mutate(taxids_by_seq = paste0(taxids,collapse =  ";")) %>%
    dplyr::select(-"annotations", -"taxids", -"m5nr")  %>%
    dplyr::distinct() %>%
    dplyr::ungroup()


  # process the ontology data
  # 1) extract the COG (I didn't see any that had more than 1 per row)
  ont$annotations <- gsub(".*COG(.*?)\\].*", "COG\\1", ont$annotations)
  # 2) as we did before, summarize by sequence
  min_ont <- ont %>%
    dplyr::ungroup() %>%
    dplyr::group_by(id) %>%
    dplyr::mutate(COGs_by_seq = paste0(unique(annotations), collapse =  ";")) %>%
    dplyr::select(-"annotations", -"m5nr") %>%
    dplyr::distinct() %>%
    dplyr::ungroup()


  # merge, but only keep the union of the dataset
  combined <- merge(min_org, min_ont, by="id", all = T)

I haven't played around with your code directly, but R has a bunch of different aggregation methods (https://stackoverflow.com/questions/3685492/r-speeding-up-group-by-operations) and I'm wondering if we could speed up the merging step.

@nickp60
Copy link
Collaborator

nickp60 commented Mar 17, 2020

Hi @zbendiks
For sure, the tidy functions are not known for their speed. I am working on a script (join-mgrast.py) that should help with this. It works on test data, but I still have a fair amount of error handling to do for it. But in the meantime, feel free to give it a try on your data!

conda create -n joinmg ete3 requests
conda activate joinmg
python join-mgrast.py --organism ~/chordomics/mgm4762935.3/organism --ontology ~/chordomics/mgm4762935.3/ontology -o mgm4762935.3.csv

@KevinMcDonnell6
Copy link
Owner

Hi @zbendiks
Thanks for the feedback!
I will look into implementing what @nickp60 suggested. Hopefully this will be of benefit to you. I'll check back when I have it working.

@nickp60
Copy link
Collaborator

nickp60 commented Mar 19, 2020

Hi @zbendiks,
I got the program working with your data -- its SLOW, but it works! The merged file is about 4GB, we are gonna have to do some optimization to get things peppier, but it should still be useful.

I'll submit a pull request updating some of the documentation.

Attached are the first 250k lines of the data.

mgm4824993_joined_first250k.csv.zip

@zbendiks
Copy link
Author

@KevinMcDonnell6 @nickp60

Thank you so much! I'll use the updated script to create a merged object for mgm4824993.3 and continue with the Chordomics pipeline. Assuming that all goes well, I'll continue with my remaining 13 samples. I'll update you guys once I do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants