Access individual DOI archive files for better performance #25

tomgoddard · 2017-03-25T05:48:37Z

Currently fetching the DOI archive for the exosome from Zenodo takes about 30 minutes (1.3 Gbytes). This means the exosome IHM file is not viewable in ChimeraX for 30 minutes after trying to open it. This is ridiculously slow and unusable given it is just trying to get some small localization density maps from the file. I believe the bulk archive is ensemble models (which are currently not referenced by the IHM file).

If Zenodo allows accessing individual files from the DOI that should be used in the IHM file (ihm_external_files table) to improve performance, so only the data files that are actually being viewed get downloaded.

The current slow performance will inhibit most users of these files. If the files are only available as one Gbyte download this is a poor design and other archiving methods that allow access to individual files should be investigated.

tomgoddard · 2017-03-25T05:53:02Z

Not only is the access slow, but my current attempt to have ChimeraX download the exosome DOI archive (1.4 Gbytes) failed after 30 minutes after 0.7 Gbytes were received. Total download time would likely be 1 hour if it succeeds.

benmwebb · 2017-03-27T18:01:58Z

I don't think we can currently access individual files, since Zenodo simply archives a zip file of the entire GitHub repository. So we have three options:

Ask Zenodo to unzip the file at their end and allow downloading individual files.
Follow the link from Zenodo to GitHub (e.g. for exosome, https://zenodo.org/record/60731 links to https://github.com/integrativemodeling/exosome/tree/v1.0). GitHub then allows downloading individual files.
Break the archive up into several smaller zip files (e.g. input data, bulk ensemble, density localizations, cluster representatives).

Obviously (1) is more work for Zenodo, (2) is more work for Chimera, (3) is more work for depositors.

tomgoddard · 2017-03-28T01:21:50Z

The above timing of 30-60 minutes for downloading the exosome archive was on a home network (~5 Mbits / sec). From UCSF on a fast network the download took 5 minutes for the 1.3 Gbytes.

A single zipfile of the entire repository is very large (1.4GB) and most of this is occupied by output trajectories, which most users won't need to access anyway. Split these larger files out into their own zipfiles. Relates ihmwg/IHMCIF#25.

This splits the externally referenced data into multiple files, to make it more convenient to download. Relates #25.

A single zipfile of the entire repository is very large (~1GB) and most of this is occupied by output trajectories, which most users won't need to access anyway. Split these larger files out into their own zipfiles. Relates ihmwg/IHMCIF#25.

benmwebb mentioned this issue Mar 29, 2017

Add references to sphere model ensemble PDB files for nup84, exosome and mediator #24

Closed

benmwebb added a commit that referenced this issue Apr 7, 2017

Update exosome example to split archives.

d24f0dd

This splits the externally referenced data into multiple files, to make it more convenient to download. Relates #25.

benmwebb added a commit that referenced this issue Apr 20, 2017

Update mediator example to split archives.

b1c510f

This splits the externally referenced data into multiple files, to make it more convenient to download. Relates #25.

benmwebb closed this as completed in dd5d214 May 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Access individual DOI archive files for better performance #25

Access individual DOI archive files for better performance #25

tomgoddard commented Mar 25, 2017

tomgoddard commented Mar 25, 2017

benmwebb commented Mar 27, 2017

tomgoddard commented Mar 28, 2017

Access individual DOI archive files for better performance #25

Access individual DOI archive files for better performance #25

Comments

tomgoddard commented Mar 25, 2017

tomgoddard commented Mar 25, 2017

benmwebb commented Mar 27, 2017

tomgoddard commented Mar 28, 2017