Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snakemake workflow #52

Merged
merged 39 commits into from
Apr 21, 2023
Merged

Snakemake workflow #52

merged 39 commits into from
Apr 21, 2023

Conversation

rnmitchell
Copy link
Contributor

This PR will create a snakemake workflow to simplify running the entire lusSTR pipeline on either a single file or a set of files.

@rnmitchell rnmitchell marked this pull request as draft March 9, 2023 14:47
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope.

@standage
Copy link
Member

No concerns about the Snakefile at the moment. I'd be happy to chat about some streamlining ideas some time, but that would be icing on the cake.

@rnmitchell
Copy link
Contributor Author

Alright so I've been trying to get lusSTR into a snakemake workflow so that you can just put in the command lusstr all and it runs the entire STR pipeline. I've stored the arguments in a config file that can be easily edited. The strs.smk snake file itself does run successfully, now I've been trying to incorporate it into the lusSTR cli. And here is where I have absolutely no idea what I'm doing. 😅 I've tried to refactor the code to do this (see the most recent commit), but continue to run into the error:

Traceback (most recent call last):
  File "/Users/rebecca.mitchell/mambaforge/envs/lusstr/bin/lusstr", line 33, in <module>
    sys.exit(load_entry_point('lusSTR', 'console_scripts', 'lusstr')())
  File "/Users/rebecca.mitchell/mambaforge/envs/lusstr/bin/lusstr", line 25, in importlib_load_entry_point
    return next(matches).load()
  File "/Users/rebecca.mitchell/mambaforge/envs/lusSTR/lib/python3.7/site-packages/importlib_metadata/__init__.py", line 94, in load
    module = import_module(match.group('module'))
  File "/Users/rebecca.mitchell/mambaforge/envs/lusSTR/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 965, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'lusSTR.__main__'

I think I need your assistance with this @standage. No rush and we can definitely talk in person about what I'm trying to do. My goal is to have two workflows: one for STRs and one for SNPs. At this time, I just have a placeholder for the SNPs one and I figured once I can successfully run the STR side, I'll fill that one in. Just trying to learn how to create this atm 😃

@standage
Copy link
Member

We've done thin CLI wrappers around Snakemake workflows for several projects now, and I really like it as a strategy. I'd love to take a look at this myself, but if I can't make enough time soon anyone else on my team has experience and should be qualified to help you hash it out!

@rnmitchell
Copy link
Contributor Author

Sounds good!

Comment on lines +21 to +28
def main(args):
Path(args.workdir).mkdir(parents=True, exist_ok=True)
final_dest = f"{args.workdir}/config.yaml"
config = resource_filename("lusSTR", "data/config.yaml")
final_config = edit_config(config, args)
with open(final_dest, "w") as file:
yaml.dump(final_config, file)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created the command lusstr config to create the config file. Settings can be specified as command line arguments. If no arguments are provided, the default settings are used and the user can edit manually. If a working directory is specified, the config file is dumped there, otherwise to the cwd.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Comment on lines 20 to 29
def main(args):
snakefile = resource_filename("lusSTR", "workflows/snps.smk")
pretarget = "annotate" if args.filter else "all"
result = snakemake.snakemake(
snakefile, config=args.config, targets=pretarget,
workdir=args.work_dir
)
if result is not True:
raise SystemError('Snakemake failed')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The snps command is not ready yet. I figured I'd do another PR after this one (since this one is already quite a lot). This is just a placeholder.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python has an Exception class dedicated for this scenario: I'd recommend clearing out—or commenting out—the current code in the main function and replacing it with a raise NotImplementedError('SNP workflow implementation pending') statement.

Comment on lines +10 to +18
## placeholder until I update for snps

configfile: "config.yaml"
output_name = config["output"]
input_name = config["samp_input"]
software = config["output_type"]
prof = config["profile_type"]
data = config["data_type"]
filter_sep = config["filter_sep"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a placeholder for the snps snake file

Comment on lines 249 to 252
if args.separate:
indiv_files(autosomal_final_table, input_name, ".txt")
else:
autosomal_final_table.to_csv(args.out, sep="\t", index=False)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the separate argument for the annotate step. This was really just used when lusSTR files were being fed into LLAMAS which required separate files for each sample. lusSTR however requires a single file with all samples to be fed into the next step, so it's unnecessary.

@rnmitchell
Copy link
Contributor Author

This is ready for further review @standage. I added a new command (lusstr config) to create the config file to store all the settings for the STR workflow, allowing the user to specify settings via the command line or they could open the file once created and manually edit it (probably the preferred way for non-bioinformatics folks 😄 ). The entire workflow doesn't have to be run (can just run up to the annotate step or just the format step if wanted).

I also updated all tests and all are passing with the exception of test_snps.py, as I haven't updated any of the snp workflow at this time. Whenever you have some time to poke around and give any advice and try out the tests, that would be great!

@rnmitchell
Copy link
Contributor Author

I do still need to update the README.

@rnmitchell
Copy link
Contributor Author

I do still need to update the README.

Nevermind! updated!

Copy link
Member

@standage standage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks pretty good. A few comments, but hopefully should be quick to resolve.

One thing you might consider is that the various lusstr subcommands give users the flexibility to run the STR analysis workflow step-wise, so having to specify format, annotate, or all seems a bit redundant. Why is the user going to run lusstr strs if they don't want to run all? There might be a good reason for this, and I could probably formulate the contours of that reason myself. But I'm curious what you think. If we cut out that argument, that could simplify the interface a bit (I'm not sure it would simplify the workflow implementation at all). If you decide that it's necessary and should stay, how would you feel about making it an option with all as the default?

setup.py Outdated
"lusSTR/tests/data/NGS_stutter_test/*",
"lusSTR/workflows/*",
"lusSTR/wrappers/*",
]
},
include_package_data=True,
install_requires=["pandas>=1.0", "openpyxl>=3.0.6"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New package dependencies need to be declared here at a minimum.

  • snakemake
  • pyyaml(?)

data["separate"] = True
if args.nocombine:
data["nocombine"] = True
if args.efm_sep:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get an error when I run lusstr config with no arguments. It points to this line, which doesn't appear to be a perfect match for any option declared in the argument parser. I assume this should be if args.efm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, old argument that I forgot to remove from that function. There were two different separate pieces, one for the annotate step and one for the filter step. Now only the filter step argument remains.

Comment on lines +21 to +28
def main(args):
Path(args.workdir).mkdir(parents=True, exist_ok=True)
final_dest = f"{args.workdir}/config.yaml"
config = resource_filename("lusSTR", "data/config.yaml")
final_config = edit_config(config, args)
with open(final_dest, "w") as file:
yaml.dump(final_config, file)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Comment on lines 20 to 29
def main(args):
snakefile = resource_filename("lusSTR", "workflows/snps.smk")
pretarget = "annotate" if args.filter else "all"
result = snakemake.snakemake(
snakefile, config=args.config, targets=pretarget,
workdir=args.work_dir
)
if result is not True:
raise SystemError('Snakemake failed')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python has an Exception class dedicated for this scenario: I'd recommend clearing out—or commenting out—the current code in the main function and replacing it with a raise NotImplementedError('SNP workflow implementation pending') statement.

Comment on lines 274 to 277
def test_snakemake(command, output, format_out, annot_out, all_out, tmp_path):
config = str(tmp_path / "config.yaml")
inputfile = data_file("UAS_bulk_input/Positive Control Sample Details Report 2315.xlsx")
exp_output = data_file(output)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running this data through lusstr strs all on the command line went seamlessly for me.

@rnmitchell
Copy link
Contributor Author

So do you mean by running lusstr format or lusstr annotate? Those subcommands no longer work since I refactored all the code for snakemake.

@standage
Copy link
Member

Ah, gotcha. I saw that lusSTR still had a subcommand interface, but glossed over the fact that the old subcommands aren't supported any more. Carry on!

@rnmitchell
Copy link
Contributor Author

I believe this is ready now, @standage!

@rnmitchell rnmitchell marked this pull request as ready for review April 20, 2023 16:02
Copy link
Member

@standage standage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@standage
Copy link
Member

It looks like some filter tests are failing due to the DataFrame append method not being found. I noted that pandas 2.0 was released recently and probably has some breaking changes. We need to pin both minimum and maximum version numbers for pandas now, it looks like: pandas>=1.0,<2.0.

SNP tests failed as well. If you're saving the SNP workflow for another PR, you should probably disable those tests for now.

@rnmitchell
Copy link
Contributor Author

Ok will look into this!

@rnmitchell
Copy link
Contributor Author

Ok looks like things are passing now! Everything good for you?

@standage standage merged commit 6730754 into master Apr 21, 2023
@standage standage deleted the snakemake branch April 21, 2023 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants