Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metagenomics Mode (MetaFlye and Binning) #160

Open
Tracked by #15
LeeBergstrand opened this issue Apr 19, 2024 · 10 comments
Open
Tracked by #15

Metagenomics Mode (MetaFlye and Binning) #160

LeeBergstrand opened this issue Apr 19, 2024 · 10 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@LeeBergstrand
Copy link
Collaborator

LeeBergstrand commented Apr 19, 2024

We have previously discussed adding metagenomics compatibility by running fly in meta mode and doing genome binning.

@LeeBergstrand LeeBergstrand mentioned this issue Apr 19, 2024
11 tasks
@LeeBergstrand LeeBergstrand changed the title Optional genome binning -- but how to do this? It might end up as a separate tool from rotary. Discussion might be needed. Metagenomics Mode (MetaFly and Binning) Apr 19, 2024
@LeeBergstrand LeeBergstrand changed the title Metagenomics Mode (MetaFly and Binning) Metagenomics Mode (MetaFlye and Binning) Apr 19, 2024
@jmtsuji
Copy link
Collaborator

jmtsuji commented Apr 22, 2024

@LeeBergstrand For now, I'd suggest to leave genome binning out of the MVP. I think the pipeline should already be compatible with metagenomes up until the end of the circularization step (i.e., before annotation), although I'd need to double check this just to make sure. This level of metagenome compatibility might be enough for the MVP -- users can use rotary to assembly metagenomes with properly closed circular contigs, and then they can handle genome binning themselves. Once the MVP is out, we could consider a meta-mode for rotary as an extension. How does this sound?

@jmtsuji
Copy link
Collaborator

jmtsuji commented Apr 22, 2024

P.S. The current config file already has a way to turn meta mode on or off for Flye, so that aspect is already addressed. Meta mode is sometimes helpful for genome assemblies (e.g., if you're not sure if the culture is pure... I wonder if it might also help with assembling differentially abundant plasmids).

@LeeBergstrand
Copy link
Collaborator Author

@LeeBergstrand For now, I'd suggest to leave genome binning out of the MVP. I think the pipeline should already be compatible with metagenomes up until the end of the circularization step (i.e., before annotation), although I'd need to double check this just to make sure. This level of metagenome compatibility might be enough for the MVP -- users can use rotary to assembly metagenomes with properly closed circular contigs, and then they can handle genome binning themselves. Once the MVP is out, we could consider a meta-mode for rotary as an extension. How does this sound?

@jmtsuji This sounds good to me. To me, it's a low priority at this time.

@LeeBergstrand
Copy link
Collaborator Author

Here are some things to think about down the road:

  • Are there binners that work with Nanopore or Hybrid data?
  • Would we only be binning off the Illumina data?

@jmtsuji
Copy link
Collaborator

jmtsuji commented Apr 23, 2024

@LeeBergstrand Good points. My guess is that existing genome binners (e.g., MetaBAT2) should work fine with Illumina, Nanopore, or hybrid data. MetaBAT2 just uses coverage info of the contigs (obtained from BAM files) and the contig sequences themselves to guide genome binning, in my understanding. So long as read mapping is accurate and the contigs are error-free, I think genome binning from a mix of different read types should be OK. It would be worthwhile to check this carefully later on, though.

@LeeBergstrand
Copy link
Collaborator Author

@jmtsuji This is becoming more and more of an issue for me. We are finding out that more and more of the genomes we are processing are actually co-cultures even though they are originally thought to be single strain.

@jmtsuji jmtsuji added enhancement New feature or request question Further information is requested and removed long_term labels Oct 11, 2024
@jmtsuji
Copy link
Collaborator

jmtsuji commented Oct 11, 2024

@LeeBergstrand Thanks for picking up this thread again. Yeah, it sounds like adding some basic genome binning could be helpful even for "pure culture" genome work.

We would probably just need some basic binning rules for rotary -- for example, map the reads to the assembled contigs (within the same sample), then just run 1 genome binner and split out the contigs. Then, the annotation module could be run on each bin separately. This might be pretty simple to implement. (Later on, we could always consider adding more genome binners and aggregating their results to improve binning accuracy, but I am not sure if this would improve things much given that the cultures should generally have a pretty simple microbial community.)

One potential issue we would need to address is how to handle binning of true isolates. The last time I tested binning tools carefully (a few years ago), they generally errored out if they could not produce at least 2 bins. We should see if this is still the case. If so, then we would need some strategy (e.g., based on CheckM2 scores) to figure out if the raw contigs are likely for a single isolate and then skip binning if that is the case.

Also, we could consider changing the default Flye mode to --meta in the config file. My guess is that this might make some assemblies of true isolates worse in a few edge cases, but if the input data quality is good, it would have limited impact on isolate assemblies. Based on a quick look at the methods of the metaFlye paper, I assume the way that repeats in the assembly graph are identified in metaFlye should still work for isolates, but it might be more prone to errors than the algorithm used in the original Flye. I don't have any real evidence, though. I have seen some discussion on X that some folks prefer to use metaFlye by default. The alternative would be to try to predict if a dataset is pure or not before assembly and then choose the Flye mode based on that, but this approach might be too complicated.

@LeeBergstrand Any thoughts?

@LeeBergstrand
Copy link
Collaborator Author

LeeBergstrand commented Oct 11, 2024

@jmtsuji, Questions:

  1. How would polishing affect binning? Do you want to bin before or after polishing?
  2. How would a mixed metagenome affect our circularization code? Would you like to bin before circularization?

Where would the optimal place to put binning be?

@LeeBergstrand
Copy link
Collaborator Author

LeeBergstrand commented Oct 11, 2024

Right now, a vital issue is that Rotary needs to understand the concept of sub-samples (bins). We use the following design pattern throughout Rotary:

rule annotation:
    input:
        summaries=expand("{sample}/{sample}_annotation_summary.zip",sample=SAMPLE_NAMES),

In this pattern, we frequently use the SAMPLE_NAMES variable. However, this will not work when there are bins.

This issue is going to require significant refactoring to fix this issue.

I suggest waiting until we refactor things into pipeline-independent modules before pursuing binning. That way, you can call the annotation module on the bins or the single genomes.

@LeeBergstrand
Copy link
Collaborator Author

@jmtsuji, Questions:

  1. How would polishing affect binning? Do you want to bin before or after polishing?
  2. How would a mixed metagenome affect our circularization code? Would you like to bin before circularization?

Where would the optimal place to put binning be?

Another option is that Rotary has a meta-mode but we do things in two steps. You run rotary in normal mode and we give you a list of genomes that are contamianted via CheckM. Then you take these samples and manually do a second run with them in meta-mode. The meta-mode in the config turns flye-meta and binning on and off depending on the flag.

It really depends on where binning happens. It will be easier to add a bunch of bin wild cards the later in the pipeline the binning occurs. There is also some modularization tools that might help here to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants