Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange bias between percent_modified and valid_coverage from modkit pileup #262

Open
flofloflo-cell opened this issue Sep 13, 2024 · 8 comments
Labels
question Looking for clarification on inputs and/or outputs

Comments

@flofloflo-cell
Copy link

Hi,

I am interested in context specific methylation events and I am seeing some strange association between the predicted percentage of modification and the coverage at a given site. I followed the default dorado > modkit pipeline from the nanopore documentation (auto model selection from dorado and auto threshold setting from modkit).

On the plot below, intermediate levels of modifications become less frequent as the coverage goes up.

  • Is there some coverage based correction I missed during the pileup or another step? (for example, some correction trying to push percent_modified toward 0 or 1 to define each site as 'only modified' or 'not modified').

  • The median depth in this sample is 104 and I do not understand why there should be a higher chance to observe intermediate levels of methylation at the lower end of the coverage distribution (<50) compared to the higher end (>150). If this was the result of sampling bias, the plot would be symmetrical around the median coverage?

image

Thanks!!

@Theo-Nelson
Copy link

Hi flofloflo-cell,

Not the developer, but I think for this issue it would be helpful if you a.] posted the modkit filtering thresholds that were calculated and b.] run modkit sample-probs as specified here to see if you have lower quality modification calls within your analysis (nanoporetech/dorado#951 (comment))

Feel free to post the sampled histograms for feedback. Best of luck!

Sincerely,
Theo

@flofloflo-cell
Copy link
Author

Hi @Theo-Nelson,

Thanks for the feedback. For this sample, I got the following filtering thresholds calculated by modkit:

  • Using filter threshold 0.7832031 for A
  • Threshold of 0.640625 for base C is low. Consider increasing the filter-percentile or specifying a higher threshold.

I ran the sample-probs command and plotted the probabilities.tsv output:
image

As a test, tried increasing the --filter-threshold to 0.9 for A and replicated the plot from my previous post using only As but I am still getting the same pattern:
image

Best,
Flo

@ArtRand
Copy link
Contributor

ArtRand commented Sep 19, 2024

Hello @flofloflo-cell,

I ran a quick simulation, and it seems that the pattern you're seeing is expected. Here's how it works:

$$\displaylines{ \textbf{C} \sim \text{Pois}(100) \\\ \textbf{M} \sim \text{Beta}(0.5, 0.5) \\\ n_{x} \leftarrow \textbf{C} \\\ p_{x} \leftarrow \textbf{M} \\\ \text{me}_{x} \leftarrow \text{B}(n_{x}, p_{x}) \\\ \text{\%mod}_{x} = \frac{me_{x}}{n_{x}} * 100% }$$

Where

  • $\textbf{C}$ is the distribution over coverages at a position, I'm using this to simulate the "valid coverage".
  • $\textbf{M}$ is the distribution over the $p$, the probability of a read reporting methylation.
  • $n_{x}$ the coverage at position $x$.
  • $p_{x}$ the probability that a read reports methylation at position $x$.
  • $me_{x}$ the number of reads that report methylation at position $x$ is simulated as a binomial distribution given $n_{x}$ reads of coverage each with probability $p_{x}$ of reporting methylation.

I simulated 30K sites.

If we look at the marginal distributions of the coverage and the percent modification, we can see the expected pattern given the simulation setup, including a lower, but present, density of sites with intermediate levels of methylation.

image

When I attempt to make your heatmap, I can see a qualitatively similar pattern (although I think the strange semi-circles are just an artifact of the binning):

image

Overall, my explanation for the pattern is that the intermediate $p$ values around 50% methylation are rare, so the joint probability of intermediate $p_{x}$ and high (or low) $n_{x}$ is also quite low. That's why you see the half-moon shape in the middle. To see intermediate methylation levels at higher coverage you'll need very high coverage, but of course these sites will still only constitute a tiny fraction of the population. I think the shading in your plots aren't quite showing the range of density (not claiming that mine are perfect either).

Your probability histograms look completely normal to me. As an aside, the more recent versions of modkit will produce html plots for you when you use --hist.

@ArtRand ArtRand added the question Looking for clarification on inputs and/or outputs label Sep 19, 2024
@flofloflo-cell
Copy link
Author

Hello @ArtRand,

Thank you for the reply. Running a simulation is definitely a great idea. The circular patterns/strange semi-circles on the left of the plot are indeed an artifact due to depletion of potential values taken by percent_modified as the valid_coverage goes down.

Nevertheless, I feel that the patterns in simulation and the observed data are not exactly similar. I placed two boxes on each plot, centered around the mean of valid_coverage, which in my opinion shows that the simulation pattern is more symmetrical than the observed data.
image

I also reproduced the simulation in order to compare it to the observed data more accurately (except that I resampled from the observed valid_coverage instead of the poisson distribution to get a more comparable spread on the x-axis)
image

Intermediate methylation levels are indeed rare in my dataset and the simulation plot definitely shows that the joint probability of intermediate px and high/low nx is low. In the observed data plot, the joint probability of intermediate px seems higher at low nx than at high nx which does not feel intuitive at all. If intermediate methylation can already be observed at the lower coverage, it is also not intuitive that higher sequencing depth is required to observe intermediate methylation as higher coverage.

To provide another example, below is the data (from the same dataset) but for a specific m6a target motif (~15,000 adenines). Positions with methylation < 80% represent ~5% (>700 adenines) of the dataset and he same asymmetry can be observed.
image

Thanks for the help!

Best,
Flo

@marcus1487
Copy link
Contributor

You have to match both marginal distributions in your simulation. Could you also sample from the percent methylated distribution? This will give a good indication if there is a linked effect between coverage and percent methylation or if the simulated distribution matches the observed.

@flofloflo-cell
Copy link
Author

Hi @marcus1487,

That's a good point since the distribution of the observed percent_modified does look quite different from a beta(0.5, 0.5) distribution. I have reproduced the plot, sampling from both marginal distributions (also added side histograms and the median coverage as a red dotted line, as reference).

The simulated pattern got closer to the observed data, but the intermediate methylation region looks much more symmetrical between higher and lower coverage in the simulation compared to observed, in particular when looking at 50-75x versus 125-150x depth.

image

Best,
Flo

@ArtRand
Copy link
Contributor

ArtRand commented Sep 26, 2024

@flofloflo-cell,

Do these plots have 6mA and 5mC(G) calls bedMethyl records combined? The percent modified distribution makes me think you're mostly looking at 6mA calls (obviously there will be many more 6mA calls than CpG calls).

Another basic question, does the pattern change when you use the --no-filtering during modkit pileup? It's possible that the intermediate $p_x$ contexts are lower confidence and more often filtered out.

I agree that the latest pair of plots don't look similar, but it does feel like we're looking for subtle changes in the tails of these distributions. It might be worth running the simulation to the same number of samples as the observed, I can see that the highest counts in the observed plot is ~3x higher than the simulation. To be honest, however, I don't expect the overall pattern to change much. Maybe also drop the column of zero percent-methylation to see more of the graduation in the tail.

Let's look at just one modification at a time (6mA or 5mC), maybe you already have this, and try with --no-filtering and see if the pattern persists.

@flofloflo-cell
Copy link
Author

@ArtRand

Thanks for all the suggestions!

I started with the separate plots for m6a and m5c:
image

Then I generated the same plots with the --no-filtering option, which definitely look quite different and much more symmetrical:
image

I plotted 'with' against 'without' filtering for percent_modified and valid_coverage to understand where the shift is coming from. With the --no-filtering options, there is both a shift toward higher coverage and intermediate methylation that seem to fill up the gap. So as you mentioned, the intermediate px contexts seem to have a lower confidence and are getting more often filtered out, leading to an overall lower coverage.

In your opinion, considering how much data is getting filtering out, are the intermediate methylation values biologically meaningful or some sampling bias artifact resulting from the uncertainty of the basecalling model for these specific contexts?
image

Best,
Flo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Looking for clarification on inputs and/or outputs
Projects
None yet
Development

No branches or pull requests

4 participants