Add script for downloading local reference files and images #230

jashapiro · 2022-10-27T17:22:50Z

For the case when compute nodes may not have access to the internet, we will need to download all reference files and docker images before the workflow begins, as described in #211. This PR adds a script to perform these downloads, preserving the file structure that would be found on S3, which allows us to change only one parameter for running from these local files.

To make things a bit easier, the script also outputs a file, local_refs.yaml by default, which contains the setting that needs updating (ref_rootdir, is the only required one, but I also included assembly for clarity). I discovered that this setting can not be changed in a user-provided config file that specifies params.ref_rootdir unless that config file also contains all of the affected param variables in reference_paths.config. This probably has to do with the order in which config files are read and some opaque precedence rules. So the solution is either to use --ref_rootdir on the command line, which will properly override the setting in reference_paths.config or to use -params-file local_refs.yaml or similar. The latter is more traceable, so that is the direction I took with this download.

Downloading STAR and Cell Ranger indexes is optional (these are large), as is downloading docker or singularity packages to the local cache. I ran into some strange behavior with singularity seeming to not wanting to download files to one cache location if it had already downloaded to another, so I force those pull requests.

One unresolved issue is downloading any Space Ranger image that might be required; it won't get those from the github repo, but that is probably part of a larger issue, if people are going to use that functionality. I will try to address the way to handle this in docs.

For now, the script is in the root of the repository. If there are thoughts on where it should ultimately live, please let me know!

Speaking of docs, those will come in a separate PR.

should make local files a bit easier

…l-refs

allyhawkins

Thanks for doing this! My one main ideas is if you think there is a way we could get the paths to the reference files from the config file we have rather than directly declaring them here. I'm just worried about things getting out of sync in the future, but if it's too much of a headache this way works too.

get_refs.py

allyhawkins · 2022-10-28T14:08:50Z

get_refs.py

+
+## download all the files and put them in the correct locations ##
+print("Downloading reference files...")
+for path in ref_paths[0:2]:


Suggested change

for path in ref_paths[0:2]:

for path in ref_paths:

I don't think this should be here, otherwise it would only download the first 2

Oh yes, that was for testing. I thought I took it out!

jashapiro · 2022-10-28T18:21:35Z

My one main ideas is if you think there is a way we could get the paths to the reference files from the config file we have rather than directly declaring them here.

Okay, I have now implemented this. I thought I was being clever enough to make it work with the version of reference_paths.config that is in main, but there were more changes there than I expected, and the logic to handle them does not seem worth it at all. So if you want to test this, you will need to run it with --revision development.

sjspielman

I mostly left some textual comments here!

Speaking of docs, those will come in a separate PR.

Checking that this also will include some additional comments within the script?

This is almost certainly a problem on my end, but just in case it's not! I was running from the continuumio/miniconda3:4.10.3p0 container, and encountered this excitement:

(base) root@b815efb3cc5b:/home# python3 get_refs.py 
Getting list of required reference files
Downloading reference files...
Getting homo_sapiens/ensembl-104/fasta/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
Getting homo_sapiens/ensembl-104/fasta/Homo_sapiens.GRCh38.dna.primary_assembly.fa.fai
Getting homo_sapiens/ensembl-104/annotation/Homo_sapiens.GRCh38.104.gtf.gz
Getting homo_sapiens/ensembl-104/annotation/Homo_sapiens.GRCh38.104.mitogenes.txt
Getting homo_sapiens/ensembl-104/annotation/Homo_sapiens.GRCh38.104.spliced_intron.tx2gene_3col.tsv
Getting homo_sapiens/ensembl-104/annotation/Homo_sapiens.GRCh38.104.spliced_cdna.tx2gene.tsv
Getting homo_sapiens/ensembl-104/salmon_index/Homo_sapiens.GRCh38.104.spliced_intron.txome/complete_ref_lens.bin
Getting homo_sapiens/ensembl-104/salmon_index/Homo_sapiens.GRCh38.104.spliced_intron.txome/ctable.bin
ERRO[2189] error waiting for container: invalid character 'c' looking for beginning of value

I'm blaming this on weird comms with the docker engine, but wanted to share just in case this happened to someone else and it's not just a me-problem. I'm re-running now in my conda base environment since Docker got weirdly borked after this and required a couple restarts for the daemon to get back up and running. So far, so good.

sjspielman · 2022-10-31T13:04:28Z

get_refs.py

+parser.add_argument("--paramfile", type=str,
+                    default="localref_params.yaml",
+                    help = "nextflow param file to write (default: `localref_params.yaml`)")
+parser.add_argument("--revision", type=str,


I might suggest this be named --version, --workflow_version, --release, etc. or similar. Up to you.

I am using revision because that is the term used by nextflow. I don't love it, but I wanted to be consistent with the terminology used there.

Agreed for consistency! But it's a disappointing fact to learn as I start my nextflow learning adventure.

sjspielman · 2022-10-31T13:06:52Z

get_refs.py

+    ref_file =  urllib.request.urlopen(reffile_url)
+except urllib.error.URLError as e:
+    print(e.reason)
+    print(f"The file download failed for {reffile_url}, please check the URL for errors")


Suggested change

print(f"The file download failed for {reffile_url}, please check the URL for errors")

print(f"The file download failed for {reffile_url}. Please check the URL for errors.")

sjspielman · 2022-10-31T13:28:47Z

get_refs.py

+parser.add_argument("--refdir", type=str,
+                    default="scpca-references",
+                    help = "destination directory for downloaded reference files")
+parser.add_argument("--replace",


I'd suggest --overwrite here instead of --replace. Maybe even overwrite_refs?

get_refs.py

sjspielman · 2022-10-31T13:32:17Z

get_refs.py

+                    default="localref_params.yaml",
+                    help = "nextflow param file to write (default: `localref_params.yaml`)")
+parser.add_argument("--revision", type=str,
+                    default="main",


Wondering if there is a reason to set this default to development instead, since that matches the current functionality. But, I am disagreeing with myself a good deal even as I write this, because of this that will get messy when this eventually is merged into main and until merge one just provides via command line.

Yes... in theory it should work with main once merged, so this should only be a problem for us and the initial testers, who will get a bit more hand-holding.

sjspielman · 2022-10-31T13:34:20Z

get_refs.py

+ref_paths += [barcode_dir / f for f in barcode_files]
+
+## download all the files and put them in the correct locations ##
+print("Downloading reference files...")


Might be nice to add a blurb about "this will take a while, why not go take a coffee break?" Any kind of expectation-setting statement about runtime, ☕ or otherwise.

sjspielman · 2022-10-31T13:42:34Z

get_refs.py

+    pfile = Path(args.paramfile)
+    # check if paramfile exists & move old if needed
+    if pfile.exists():
+        print(f"A file already exists at `{pfile}`, renaming previous file to `{pfile.name}.bak`")


Suggested change

print(f"A file already exists at `{pfile}`, renaming previous file to `{pfile.name}.bak`")

print(f"A file already exists at `{pfile}`. Renaming existing file to `{pfile.name}.bak` and writing new file to `{pfile}`.")

sjspielman · 2022-10-31T13:43:22Z

get_refs.py

+        urllib.request.urlretrieve(file_url, outfile)
+    except urllib.error.URLError as e:
+        print(e.reason)
+        print(f"The file download failed for {file_url}, please check the URL for errors",


Suggested change

print(f"The file download failed for {file_url}, please check the URL for errors",

print(f"The file download failed for {file_url}. Please check the URL for errors.",

sjspielman · 2022-10-31T13:43:41Z

get_refs.py

+        container_file =  urllib.request.urlopen(containerfile_url)
+    except urllib.error.URLError as e:
+        print(e.reason)
+        print(f"The file download failed for {container_url}, please check the URL for errors")


Suggested change

print(f"The file download failed for {container_url}, please check the URL for errors")

print(f"The file download failed for {container_url}. Please check the URL for errors.")

allyhawkins

Just a few minor comments, but this looks good to me. I was able to test pulling docker and everything seemed to work for me.

get_refs.py

allyhawkins · 2022-10-31T15:35:18Z

get_refs.py

+# get assembly and root location
+assembly = refs.get("assembly", "NA")
+root_parts = refs.get("ref_rootdir").split('://')
+if root_parts[0] == 's3':


Can you add a comment here that you are grabbing the bucket name? I was a bit confused here on what 0, 1 were

Oh, this isn't quite correct, I realized (or as flexible as it needs to be). I will add comments as I update.

jashapiro · 2022-10-31T17:46:57Z

This is almost certainly a problem on my end, but just in case it's not! I was running from the continuumio/miniconda3:4.10.3p0 container, and encountered this excitement:
(base) root@b815efb3cc5b:/home# python3 get_refs.py 

I'm not sure what that would be, to be honest... I'm somewhat surprised this worked as written above: without specifying development I would have expected failure based on how the input file changed.

Seems related to docker/for-mac#5139... There should be no reason to run this in docker, as long as you have some version of python3 installed; if I am using something not available in an earlier version of python3, I'd like to know about it!

Checking that this also will include some additional comments within the script?

I added a few more comments, but I was not planning to add much more. Was there something specific you were looking for?

sjspielman · 2022-11-01T13:17:33Z

I added a few more comments, but I was not planning to add much more. Was there something specific you were looking for?

Mostly looking for a docstring or so at the top of the script with 1-2 sentences about what the script does and why it might be useful.

jashapiro · 2022-11-01T13:34:36Z

I added a few more comments, but I was not planning to add much more. Was there something specific you were looking for?

Mostly looking for a docstring or so at the top of the script with 1-2 sentences about what the script does and why it might be useful.

Yes, that should be there; done!

sjspielman

👍

jashapiro added 14 commits September 8, 2022 10:45

add script to download reference files

472fc5c

update reference files to use params

5ed43b4

should make local files a bit easier

Fix barcode file

f26a508

add containerfile

9b8c912

Merge remote-tracking branch 'origin/development' into jashapiro/loca…

5786463

…l-refs

Add python reference file

850772d

write params file

48ed950

add docker and singularity pulls

fcb2673

default param file name

4f9232f

Add messages for containers

407db7d

make star index optional

8312600

Force singularity

72d343a

missing makeJson

aaab20f

Add cell ranger option & download

f1d36bb

jashapiro requested review from allyhawkins and sjspielman October 27, 2022 17:22

remove bash script

1a1a2af

allyhawkins reviewed Oct 28, 2022

View reviewed changes

parse reference file locations from repo

c910a14

jashapiro requested a review from allyhawkins October 28, 2022 18:21

sjspielman reviewed Oct 31, 2022

View reviewed changes

allyhawkins approved these changes Oct 31, 2022

View reviewed changes

jashapiro added 3 commits October 31, 2022 13:25

Improve root URI parsing

293fded

Updates from review (semicolons, arg names)

f2d71dc

move refs intialization

9ba2bcb

a few more comments

154b1b6

Add header comments

d64f075

sjspielman approved these changes Nov 1, 2022

View reviewed changes

jashapiro merged commit 3b8b5a3 into development Nov 1, 2022

jashapiro deleted the jashapiro/local-refs branch November 1, 2022 19:23

jashapiro mentioned this pull request Nov 23, 2022

Add to documentation that compute nodes need internet access #204

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add script for downloading local reference files and images #230

Add script for downloading local reference files and images #230

jashapiro commented Oct 27, 2022

allyhawkins left a comment

allyhawkins Oct 28, 2022

jashapiro Oct 28, 2022

jashapiro commented Oct 28, 2022

sjspielman left a comment

sjspielman Oct 31, 2022

jashapiro Oct 31, 2022

sjspielman Nov 1, 2022

sjspielman Oct 31, 2022

sjspielman Oct 31, 2022

sjspielman Oct 31, 2022

jashapiro Oct 31, 2022

sjspielman Oct 31, 2022

sjspielman Oct 31, 2022

sjspielman Oct 31, 2022

sjspielman Oct 31, 2022

allyhawkins left a comment

allyhawkins Oct 31, 2022

jashapiro Oct 31, 2022

jashapiro commented Oct 31, 2022

sjspielman commented Nov 1, 2022

jashapiro commented Nov 1, 2022

sjspielman left a comment

	print(f"The file download failed for {reffile_url}, please check the URL for errors")
	print(f"The file download failed for {reffile_url}. Please check the URL for errors.")

	print(f"A file already exists at `{pfile}`, renaming previous file to `{pfile.name}.bak`")
	print(f"A file already exists at `{pfile}`. Renaming existing file to `{pfile.name}.bak` and writing new file to `{pfile}`.")

	print(f"The file download failed for {file_url}, please check the URL for errors",
	print(f"The file download failed for {file_url}. Please check the URL for errors.",

	print(f"The file download failed for {container_url}, please check the URL for errors")
	print(f"The file download failed for {container_url}. Please check the URL for errors.")

Add script for downloading local reference files and images #230

Add script for downloading local reference files and images #230

Conversation

jashapiro commented Oct 27, 2022

allyhawkins left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jashapiro commented Oct 28, 2022

sjspielman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allyhawkins left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jashapiro commented Oct 31, 2022

sjspielman commented Nov 1, 2022

jashapiro commented Nov 1, 2022

sjspielman left a comment

Choose a reason for hiding this comment