Questions related to interpretation results phased assembly. #245

rfinkers · 2024-04-05T08:18:11Z

I'm attempting an assembly (relatively heterozygous diploid genome; 5Gb haploid size; with Hifi, ONT & Hi-C).

In the directory 7-consensus, the uniting-popped fast = 7.3Gb; however the uniting.popped.haplotype1, haplotype2 and unsigned is 0 bytes in size.

Thus this imply that there is no contribution of the ONT data to the phasing; though there are reads in the ont_subset.fasta.gz?

The final assembly.fasta is still 7.3Gb; with the haplotype1 and 2 fast both being 1.3G in size. This phasing is the sole contribution from the Hi-C data?

What would be good parameters to re-atempt this assembly; to deal with the higher diversity in this species (compared to human)?

Tnx!

skoren · 2024-04-05T14:38:19Z

The initial assembly (the 7.3 Gbp) file is still phased, it's just not split into haplotypes. Each contig is from a single haplotype based on HiFi + ONT data information. In the second case, the assembly.fasta will still be the full assembly (haplotype1 + haplotype2 + unassigned) so it makes sense that it has not changed. It may have linked some sequences together within haplotypes that couldn't be linked without HiC, making them longer which would be reflected in sequence stats.

I suspect the issue is the assembly is almost completely phased already and there isn't anything for HiC to do except assign to a haplotype. This requires homology detection which may not be tolerating the diversity of this species vs human. You can try increasing the --haplo-divergence parameter from the default of 0.05 to 0.10 or 0.2 and see if more of the assembly is assigned. Perhaps @Dmitry-Antipov has other suggestions. If you can share the colors file and the noseq.gfa file here from the 8-hicPipeline folder that should give more info on what is happening with the assembly.

Dmitry-Antipov · 2024-04-06T21:59:56Z

Without graph and colors it is hard to say anything, but anyway, such large unassigned file is not normal.
Increasing --haplo-divergence is the first idea for me too.

Which species is it? Do you expect to have large rDNA arrays, what is estimated level of heterozygosity?

rfinkers · 2024-04-08T09:34:53Z

share.zip
Thanks @skoren , and yes, I was mixing-up terminology. See the files attached for feedback; I'l rerun the assembly with different --haplo-divergence settings and assess the impact on the final results.

Dmitry-Antipov · 2024-04-10T18:13:11Z

Aha, this is actually another problem. Graph nodes (which are constructed from ONT & HiFi but not HiC) are quite fragmented. This is not normal, those regions should be resolved with ONT reads. Without it there is just not enough HiC signal to assign labels for those short haplotype-specific regions - for some chromosome pairs there are thousands of nodes.
We expect to have larger haploblocks with HiFi & ONT, and then phase/assign labels for them with HiC

The reason seems to be a verkko problem, but on the graph simplification/ONT alignment stages, and not on HiC phasing - I see huge amount of fake bulges with really different coverage between "haplotypes" - @skoren can you have a look?

--haplo-divergence should not actually change anything here.

Dmitry-Antipov · 2024-04-10T18:36:58Z

Yet another assumption - is it a cell line or a tissue sample? Possibly those numerous 5x vs 50x bubbles can represent problems with the cell line?

skoren · 2024-04-10T18:45:00Z

The genome seems to be a combination of quite diverge but then some extremely homozygous regions (like the one @Dmitry-Antipov posted above). In normal homozygous genomes those low coverage nodes would be simplified and you'd end up with a single large node but here, because it's double the coverage of most of the genome, they aren't. I do think that --haplo-divergence will assign more of the assembly but these very homozygous regions are likely to remain in the unassigned bin and probably belong in both haplotypes.

Dmitry-Antipov · 2024-04-10T19:35:27Z

yep, Sergey is right, I did not notice that in addition to that underresolved problem there are some unlabeled large nodes - those labels can be fixed with haplo-divergence. But it definitely will not help with the fragmented+unassigned problem i was writing about

rfinkers · 2024-04-11T12:49:23Z

Gentleman, thanks so far, there is definitively some food for thought here before providing some feedback (next week). In the meanwhile, I'll rerun the final steps with a larger haplotype divergence setting and take that also along int he feedback.

skoren · 2024-04-11T16:06:34Z

When you re-run with the halo-divergence change, I'd suggest also adding the --haploid option. You don't need to re-run the full pipeline from scratch, in fact it's better if you don't so you can keep the read/node correspondences to compare the results. This would mean duplicating your current asm folder and removing all the assembly.* files along with the 5-*, 6-*, 7-*, and 8-* steps. That should better handle these very homozygous regions and pop those bubbles. It will likely still not assign them if the entire component is homozygous but at least the assembly will be less fragmented.

rfinkers · 2024-06-06T11:00:16Z

Still working on understanding this issue; Besides the example you picked out that part of the genome is relatively homozygous (on one chromosome), on other chromosomes we might have much higher diversity. Could diversity be to high, which results in failure to detect that two homolouges belong to the same chromosome, result in not exporting these sequences to the fased fasta file? As in, this probably would als impact your previous suggestion regarding the assembly as '--haploid'?

skoren · 2024-06-06T15:17:04Z

Right, there's two issues. One is there is high diversity in some regions and two is there is some very homozygous regions. The --haplo-divergence would help with the first one. Without proper haplotype phasing the HiC signal won't be there to push the two haplotypes away. The change I suggested (--haploid) would address the second one by more aggressively removing bubbles in this assembly, hopefully extending the sequences for Hi-C phasing.

rfinkers · 2024-06-09T12:46:55Z

No improvement with --haploid and --haplo-divergence 0.2. What divergence does the value of 0.2 imply?

Dmitry-Antipov · 2024-06-10T14:54:03Z

Could you share graph and colors for the most recent run once again?

What divergence does the value of 0.2 imply?

It should help to detect homologous regions and thus better assign colors(by Hi-C phasing) for them. But corresponding regions (aka nodes in graph) should be long enough to make it work.

rfinkers · 2024-06-12T10:14:46Z

share.tar.gz
Tnx, please find the info attached. The species is a tricky diploid one with a haploid genome estimated to be 5Gb. At least one chromosome is estimated to be (mostly) homozygous; estimated about heterozygosity, of the two genomes, in the other chromosomes are estimated to be above 3-5%, but likely varies from chromosome-to-chromosome. But the prediction flows out there have difficulties with these estimates. This relates to my divergence question. Is 0.2 sufficiently high enough (though it is the max) to detect homologous regions in diverse species? Looking forward to your insights.

rfinkers · 2024-06-17T10:22:55Z

As an addition to above, Hifi heterozygous/homozygous depth = 40x/80x and ONT heterozygous/homozygous depth = 11x/21x

Dmitry-Antipov · 2024-06-17T12:08:11Z

I've checked, still see both problems we discussed above.
Can you also share node sequences (8-hicPipeline/unitigs.fasta & 8-hicPipeline/unitigs.hpc.fasta)?

Alternatively to fastas I can look on mashmap's mappings counted on your side - 8-hicPipeline/run_mashmap.sh & 8-hicPipeline/mashmap.out
Last file can be big; you can safely filter out short alignments by
awk '$11 >= 50000' mashmap.out > mashmap50.out
and send only smaller mashmap50.out

rfinkers · 2024-06-17T12:24:22Z

With respect to sharing the fastas; we'll have to see how to perform this in a secure manner. I'll generate the mashmap50.out file as alternative. But as the mashmap.out file was cleaned, I'll rerun 8-hicPipeline/run_mashmap.sh; which will take some days. Not sure if this helps, but untigs.fasta (7.4G) and unitigs.hpc.fasta (5.2G) raw sizes. Haploid is approx 4.5-5Gb

Dmitry-Antipov · 2024-06-17T12:56:03Z

yeah, sharing fastas can be sensitive issue. I need them exactly to look on mashmap's results, so running it on your side and sending us resulting mappings is perfectly fine.

rfinkers · 2024-06-18T17:04:24Z

mashmap60.out.gz
mashmap50.out would be to large to upload (GitHub limits); so used awk '$11 >= 60000' mashmap.out > mashmap60.out instead. Hope this is fine. Looking forward hearing your insights.

Dmitry-Antipov · 2024-06-20T12:05:54Z

OK, so mashmap "alignments" are really fragmented, we just do not see enough similarity to phase corresponding nodes.
I.e. utig4-14078 is likely homologous to utig4-18672, but the total length of found homologous stretches reported by mashmap is just 2.5M of >20M length.
So, returning to your question, it looks that 0.2 is really not enough here.
Actually you can try something like 0.25 as --haplo-divergence but mashmap is extremely slow with low sequence identity (verkko's haplo_divergence equals to (100 - mashmap_percent_identity)/100 , so efficiently it is running mashmap with --pi 75) and we never were patient enough to check the results.

We can make some improvements on our side, to test that they are somehow reasonable we'll also need hic.byread.compressed (just pairwise hi-c mapping counts) from 8-hicPipeline

@skoren, do you have any suggestions about improving bulge removal in haploid part of this genome(see attached figure)

rfinkers · 2024-06-20T13:31:26Z

xab.gz
xaa.gz
Ok, needed to split the file in two, in order to be able to circumvent GitHub limits.

@Dmitry-Antipov if I read your suggestion correctly, it would be ok te reexecute the run_mashmap.sh script again with the modified --pi settings. Correct? Or would a rerun of the complete pipeline be necessary with the modified haplo-divergence 0.25 setting?

Dmitry-Antipov · 2024-06-20T13:35:39Z

you can edit that script and then rerun hic_phasing.sh to see whether the number of non-phased nodes (phased ones are listed in hicverkko.colors.tsv) will be significantly reduced or not.

skoren · 2024-06-30T15:20:18Z

On the bubble popping issue, verkko doesn't expect to have such a wide range of heterozygosity in the same genome and we rely on a global coverage estimate from the graph. This is why the haploid option didn't help. Some chromosomes/components are diploid and their average coverage is per haplotype while those with the unpopped bubbles are almost homozygous so the coverage there is per haplotype * 2. These are considered repeats and so aren't processed. I've made a branch to address this (using local component coverage and not global coverage) and the results look much more reasonable on the graph you shared. Once it's tested more and integrated into master you should be able to re-run from the 5-untip step.

rfinkers · 2024-07-02T07:29:52Z

the size of hicverkko.colors.tsv of the previous run and after rerun hic_phasing.sh is similar. Few changes, but the majority of contigs between both files are in common. Food for thought. Also, I'm still puzzled by the 5x coverage bubbles in the graph above. Haploid HiFi depth is 40x/ homozygous 80x. These bubbles could not come from incorporation of the ONT or HIC data?

skoren · 2024-07-02T20:33:07Z

The coverage on those nodes is based solely on HiFi kmers so they couldn't have come from ONT data (which would have 0 HiFi k-mers) and Hi-C data doesn't add any sequence to the graph. I suspect there is either some low-level somatic variation (if this is a cell line or similar) or some kind of systematic error.

rfinkers · 2024-07-03T06:31:14Z

It's a diploid plant species, haploid genome ~5Gb. Slowly getting some additional information from orthogonal datasets. Some chromosomes are similar / some chromosomes are diverged. With ~7.5Gb, the total size of the assembly Is not that bad (Verkko / p_utg Hifiasm ~8Gb) . I see signals that contigs from some chromosomes are nicely phase separated while in others the hamming rate is extremely high. I can share some more insights, but not via this medium. The update in terms of "using local component coverage and not global coverage" you commented on three days ago sounds like one strategy that would push things forward. I'm happy to give it a try, even if it is still in a test branch, but will wait util you give the go ahead @skoren. @Dmitry-Antipov does it make sense to look further at the mashmap output / hicverkko.colors.tsv? Or is it clear enough that this was not the way forward?

Dmitry-Antipov · 2024-07-03T15:21:58Z

I have all the data needed for now. believe that it's possible to see at least some improvement on unphased contigs, but just didn't implement corresponding fix yet. Hope to update on this issue at the beginning on next week.

rfinkers · 2024-07-04T06:17:43Z

I have some other things to do, but can have more focus on this project again in August

rfinkers · 2024-07-29T12:51:57Z

@Dmitry-Antipov is there an update (d branch) for me to test? Tnx!

Dmitry-Antipov · 2024-08-10T21:44:56Z

Hi,
sorry for the delay, didn't finish experiments before going on vacations.

You can try the branch het_fixes
Constants used there are very aggressive on phasing (attaching the phasing results on the last shared graph), but still some long nodes are not phased - i.e. for utig4-18391 it is really unclear what should be its homologous one based on sequence similarity.
hicverkko.colors10.csv

Anyway, I'd be very accurate with phasing results here, and definitely check them with gene-based verification like compleasm

rfinkers · 2024-09-09T06:30:45Z

Rerun stuff from scratch; but getting stuck in processing files in 8-hic.
Both mashmap jobs completes (2 results from the first set; 1 result from the second step (paths2ref.mashmap)), but than the flow crashes. A restart of slurm results in restarting a lot of the completed jobs (including the long (approx 1..5 weeks) mashmap jobs).

See the dag of jobs:
job count

HiC_rdnascaff 1
alignBWA 37
hicPhasing 1
indexBWA 1
mergeBWA 1
prepareScaffolding 1
runRukkiHIC 1
scaffoldMergeBWA 1
scaffoldPrefilterBAM 37
transformBWA 1
total 82

Error likely has to do with not being able to find a file, but have not yet figured out which.

Is there a specific batch script left-over by Snakemake (batch-script directory) to restart only the last part manually? To have more fine-grained control on where to start debugging?

Dmitry-Antipov · 2024-09-11T22:17:39Z

What was the job crashed? tail of log file should help to get what happened.

If you have all mashmap jobs finished, there's quite a few scripts to run: bunch of scaffold_prefilter*.sh scripts: scaffold_mergeBWA.sh and hic_scaff.sh. Possibly they are already created in your 8-hicPipeline folder - do you see them?

My first suggestion is that rukki.paths.tsv & rukki.paths.gaf are the files that cause rerun because of suboptimal snakemake pipeline design. Do you see them in 8-hicPipeline folder?

rfinkers · 2024-09-19T15:37:56Z

rukki.paths.tsv & rukki.paths.gaf (they are identical?) are there.

The scaffold_prefilter*.sh, scaffold_mergeBWA.sh and hic_scaff.sh are lacking.

Error executing rule prepareScaffolding on cluster (jobid: 110, external: 109538, jobscript: /media/ont/ASM3.91/.snakemake/tmp.enu3oklq/verkko.prepareScaffolding.110.sh). For error details see the cluster log and the log files of the involved rule(s).

but cannot find more.

Dmitry-Antipov · 2024-09-19T16:07:16Z

Then it likely means that third mashmap crashed and paths2ref.mashmap is not valid. You can check what is the error in log, 8-hicPipeline/prepare_ref_mashmap.err

rfinkers · 2024-09-19T16:43:22Z

---Preparing rukki paths fasta
---Running MashMap
[mashmap] MashMap v3.1.3
[mashmap] Reference = [../8-hicPipeline/paths.hpc.fasta]
[mashmap] Query = [../8-hicPipeline/paths.hpc.fasta]
[mashmap] Kmer size = 19
[mashmap] Sketch size = 690
[mashmap] Segment length = 10000 (read split allowed)
[mashmap] No block length filtering
[mashmap] Chaining gap max = 10000
[mashmap] Mappings per segment = 1
[mashmap] Percentage identity threshold = 80%
[mashmap] Do not skip self mappings
[mashmap] No hypergeometric filter
[mashmap] Mapping output file = /dev/fd/63
[mashmap] Filter mode = 3 (1 = map, 2 = one-to-one, 3 = none)
[mashmap] Execution threads = 8
[mashmap::skch::Sketch::build] minmer windows picked from reference = 737188003
[mashmap::skch::Sketch::index] unique minmers = 40515556
[mashmap::skch::Sketch::computeFreqHist] Frequency histogram of minmer interval points = (2, 10102799) ... (704532, 1)
[mashmap::skch::Sketch::computeFreqHist] With threshold 0.001%, ignore minmers occurring >= 47924 times during lookup.
[mashmap::map] time spent computing the reference index: 417.324 sec
[mashmap::skch::Map::mapQuery] WARNING, no .fai index found for ../8-hicPipeline/paths.hpc.fasta, reading the file to sum sequence length (slow)
[mashmap::skch::Map::mapQuery] mapped 100.00% @ 1.26e+04 bp/s elapsed: 05:01:41:40 remain: 00:00:00:006:04
[mashmap::skch::Map::mapQuery] count of mapped reads = 8976, reads qualified for mapping = 8978, total input reads = 8978, total input bp = 5522795566
[mashmap::map] time spent mapping the query: 4.38e+05 sec
[mashmap::map] mapping results saved in: /dev/fd/63

Dmitry-Antipov · 2024-09-19T17:24:37Z

This output looks normal, strange.

Can you rerun verkko with snakemake dry-run option (verkko <> --snakeopts '--dry-run' > dry_rerun.log) and upload the resulting log?

rfinkers · 2024-09-19T18:10:26Z

dry_rerun.log
sure, see attachment.

Dmitry-Antipov · 2024-09-19T21:06:39Z

Ok, it seems that there's a bug in our pipeline logic, which is present in v2.2 release too and affects runs with --cleanup option you likely used. Thank you for the help with locating it.

Simplest way to use current run is to skip scaffolding step completely. Hi-C phasing is already done since you have rukki.paths.gaf file. We added scaffolding after v2.1 release, so results should be similar you had before with exception of more contigs phased because of my changes for higher heterozygosity in het_fixes branch.
To do so, you should use run verkko --paths <old_assembly>/8-hicPipeline/rukki.paths.gaf --assembly <old_assembly> -d <new_output_dir> --hifi <> --ont <>

If you actually want to run scaffolding and not only phasing I'll create instructions, but since we do not want to rerun mashmap jobs it would be a bit tricky.

However, we had no chance to test scaffolding on such heterozygous data (and we use diploid structure of the genome in the algorithm), so possibly not running scaffolding will just be right.

rfinkers · 2024-09-20T05:46:17Z

Ok, thanks. I'll try that over the weekend. I'm interested in the scaffolding as well, but that can be a later experiment. I'll first check the phasing step and validate with orthogonal datasets. Will keep you informed. tnx.

Dmitry-Antipov self-assigned this Jun 10, 2024

Questions related to interpretation results phased assembly. #245

Questions related to interpretation results phased assembly. #245

Comments

rfinkers commented Apr 5, 2024

skoren commented Apr 5, 2024 • edited Loading

Dmitry-Antipov commented Apr 6, 2024

rfinkers commented Apr 8, 2024

Dmitry-Antipov commented Apr 10, 2024 • edited Loading

Dmitry-Antipov commented Apr 10, 2024

skoren commented Apr 10, 2024

Dmitry-Antipov commented Apr 10, 2024

rfinkers commented Apr 11, 2024

skoren commented Apr 11, 2024

rfinkers commented Jun 6, 2024

skoren commented Jun 6, 2024

rfinkers commented Jun 9, 2024

Dmitry-Antipov commented Jun 10, 2024 • edited Loading

rfinkers commented Jun 12, 2024

rfinkers commented Jun 17, 2024

Dmitry-Antipov commented Jun 17, 2024

rfinkers commented Jun 17, 2024 • edited Loading

Dmitry-Antipov commented Jun 17, 2024

rfinkers commented Jun 18, 2024

Dmitry-Antipov commented Jun 20, 2024

rfinkers commented Jun 20, 2024

Dmitry-Antipov commented Jun 20, 2024 • edited Loading

skoren commented Jun 30, 2024

rfinkers commented Jul 2, 2024

skoren commented Jul 2, 2024

rfinkers commented Jul 3, 2024

Dmitry-Antipov commented Jul 3, 2024

rfinkers commented Jul 4, 2024

rfinkers commented Jul 29, 2024

Dmitry-Antipov commented Aug 10, 2024 • edited Loading

rfinkers commented Sep 9, 2024 • edited Loading

Dmitry-Antipov commented Sep 11, 2024

rfinkers commented Sep 19, 2024

Dmitry-Antipov commented Sep 19, 2024

rfinkers commented Sep 19, 2024

Dmitry-Antipov commented Sep 19, 2024

rfinkers commented Sep 19, 2024

Dmitry-Antipov commented Sep 19, 2024 • edited Loading

rfinkers commented Sep 20, 2024

skoren commented Apr 5, 2024 •

edited

Loading

Dmitry-Antipov commented Apr 10, 2024 •

edited

Loading

Dmitry-Antipov commented Jun 10, 2024 •

edited

Loading

rfinkers commented Jun 17, 2024 •

edited

Loading

Dmitry-Antipov commented Jun 20, 2024 •

edited

Loading

Dmitry-Antipov commented Aug 10, 2024 •

edited

Loading

rfinkers commented Sep 9, 2024 •

edited

Loading

Dmitry-Antipov commented Sep 19, 2024 •

edited

Loading