Skip to content

Commit

Permalink
Merge branch 'ar/prep-031' into 'master'
Browse files Browse the repository at this point in the history
Prepare v0.3.1

See merge request machine-learning/modkit!195
  • Loading branch information
ArtRand committed Jun 22, 2024
2 parents 67a2eea + 2454273 commit 1e64664
Show file tree
Hide file tree
Showing 8 changed files with 132 additions and 17 deletions.
8 changes: 8 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,14 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [v0.3.1]
### Fixes
- [call-mods] Always change model to "explicit", dropped base modification probabilities should not be interpreted as canonical.
- [dmr, segment] Add pseudo-count to avoid -inf in HMM.
- [find-motifs] Fix crash in exhaustive search.
### Adds
- [dmr] Allow specification of mod code-to-primary base on the command line with `--assign-code`.

## [v0.3.1rc1]
### Fixes
- [find-motifs] Bug where error would be reported when output tables are specified. Fixes #195
Expand Down
37 changes: 32 additions & 5 deletions book/src/advanced_usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -797,7 +797,15 @@ Options:
row for each base modification call in that read using the same thresholding algorithm as
in pileup, or summary (see online documentation for details on thresholds). Passing this
option will cause `modkit` to estimate the pass thresholds from the data unless a
`--filter-threshold` value is passed to the command. (alias: --read-calls)
`--filter-threshold` value is passed to the command. Use 'stdout' to stream this table to
stdout, but note that you cannot stream this table and the raw extract table to stdout.
--pass-only
Only output base modification calls that pass the minimum confidence threshold. (alias:
pass)
--no-headers
Don't print the header lines in the output tables.
--reference <REFERENCE>
Path to reference FASTA to extract reference context information from. If no reference is
Expand Down Expand Up @@ -1415,7 +1423,7 @@ Options:
compared at each site.
--ref <REFERENCE_FASTA>
Path to reference fasta for used in the pileup/alignment.
Path to reference fasta for used in the pileup/alignment
--segment <SEGMENTATION_FP>
Run segmentation, output segmented differentially methylated regions to this file.
Expand Down Expand Up @@ -1454,10 +1462,20 @@ Options:
Preset HMM segmentation parameters for higher propensity to switch from "Same" to
"Different" state. Results will be shorter segments, but potentially higher sensitivity.
-m <MODIFIED_BASES>
-m, --base <MODIFIED_BASES>
Bases to use to calculate DMR, may be multiple. For example, to calculate differentially
methylated regions using only cytosine modifications use --base C.
--assign-code <MOD_CODE_ASSIGNMENTS>
Extra assignments of modification codes to their respective primary bases. In general,
modkit dmr will use the SAM specification to know which modification codes are appropriate
to use for a given primary base. For example "h" is the code for 5hmC, so is appropriate
for cytosine bases, but not adenine bases. However, if your bedMethyl file contains custom
codes or codes that are not part of the specification, you can specify which primary base
they belong to here with --assign-code x:C meaning associate modification code "x" with
cytosine (C) primary sequence bases. If a code is encountered that is not part of the
specification, the bedMethyl record will not be used, this will be logged.
--log-filepath <LOG_FILEPATH>
File to write logs to, it's recommended to use this option.
Expand Down Expand Up @@ -1564,9 +1582,18 @@ Options:
Prefix files in directory with this label.
--ref <REFERENCE_FASTA>
Path to reference fasta for the pileup.
-m <MODIFIED_BASES>
-m, --base <MODIFIED_BASES>
Bases to use to calculate DMR, may be multiple. For example, to calculate differentially
methylated regions using only cytosine modifications use --base C.
--assign-code <MOD_CODE_ASSIGNMENTS>
Extra assignments of modification codes to their respective primary bases. In general,
modkit dmr will use the SAM specification to know which modification codes are appropriate
to use for a given primary base. For example "h" is the code for 5hmC, so is appropriate
for cytosine bases, but not adenine bases. However, if your bedMethyl file contains custom
codes or codes that are not part of the specification, you can specify which primary base
they belong to here with --assign-code x:C meaning associate modification code "x" with
cytosine (C) primary sequence bases. If a code is encountered that is not part of the
specification, the bedMethyl record will not be used, this will be logged.
--log-filepath <LOG_FILEPATH>
File to write logs to, it's recommended to use this option.
-t, --threads <THREADS>
Expand Down
10 changes: 10 additions & 0 deletions book/src/intro_dmr.md
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,16 @@ modkit dmr pair \

Keep in mind that the MAP-based p-value provided in single-site analysis is based on a "modified" vs "unmodified" model, see the [scoring section](./dmr_scoring_details.md) and [limitations](./limitations.md) for additional details.

### Note about modification codes
The `modkit dmr` commands require the `--base` option to determine which genome positions to compare, i.e. `--base C` tells `modkit` to compare methylation at cytosine bases.
You may use this option multiple times to compare methylation at multiple primary sequence bases.
It is possible that, during `pileup` a read will have a mismatch and a modification call, such as a C->A mismatch and a 6mA call on that A, and you may not want to use that 6mA call when calculating the differential methylation metrics.
To filter out bedMethyl records like this, `modkit` uses the [SAM specification](https://samtools.github.io/hts-specs/SAMtags.pdf) (page 9) of modification codes to determine which modification codes apply to which primary sequence bases.
For example, `h` is 5hmC and applies to cytosine bases, `a` is 6mA and applies to adenine bases.
However, `modkit pileup` does not require that you use modification codes only in the specification.
If your bedMethyl has records with custom modification codes or codes that aren't in the specification yet, use `--assign-code <mod_code>:<primary_base>` to indicate the code applies to a given primary sequence base.


## Differential methylation output format
The output from `modkit dmr pair` (and for each pairwise comparison with `modkit dmr multi`) is (roughly)
a BED file with the following schema:
Expand Down
37 changes: 32 additions & 5 deletions docs/advanced_usage.html
Original file line number Diff line number Diff line change
Expand Up @@ -956,7 +956,15 @@ <h2 id="extract"><a class="header" href="#extract">extract</a></h2>
row for each base modification call in that read using the same thresholding algorithm as
in pileup, or summary (see online documentation for details on thresholds). Passing this
option will cause `modkit` to estimate the pass thresholds from the data unless a
`--filter-threshold` value is passed to the command. (alias: --read-calls)
`--filter-threshold` value is passed to the command. Use 'stdout' to stream this table to
stdout, but note that you cannot stream this table and the raw extract table to stdout.

--pass-only
Only output base modification calls that pass the minimum confidence threshold. (alias:
pass)

--no-headers
Don't print the header lines in the output tables.

--reference &lt;REFERENCE&gt;
Path to reference FASTA to extract reference context information from. If no reference is
Expand Down Expand Up @@ -1562,7 +1570,7 @@ <h2 id="dmr-pair"><a class="header" href="#dmr-pair">dmr pair</a></h2>
compared at each site.

--ref &lt;REFERENCE_FASTA&gt;
Path to reference fasta for used in the pileup/alignment.
Path to reference fasta for used in the pileup/alignment

--segment &lt;SEGMENTATION_FP&gt;
Run segmentation, output segmented differentially methylated regions to this file.
Expand Down Expand Up @@ -1601,10 +1609,20 @@ <h2 id="dmr-pair"><a class="header" href="#dmr-pair">dmr pair</a></h2>
Preset HMM segmentation parameters for higher propensity to switch from "Same" to
"Different" state. Results will be shorter segments, but potentially higher sensitivity.

-m &lt;MODIFIED_BASES&gt;
-m, --base &lt;MODIFIED_BASES&gt;
Bases to use to calculate DMR, may be multiple. For example, to calculate differentially
methylated regions using only cytosine modifications use --base C.


--assign-code &lt;MOD_CODE_ASSIGNMENTS&gt;
Extra assignments of modification codes to their respective primary bases. In general,
modkit dmr will use the SAM specification to know which modification codes are appropriate
to use for a given primary base. For example "h" is the code for 5hmC, so is appropriate
for cytosine bases, but not adenine bases. However, if your bedMethyl file contains custom
codes or codes that are not part of the specification, you can specify which primary base
they belong to here with --assign-code x:C meaning associate modification code "x" with
cytosine (C) primary sequence bases. If a code is encountered that is not part of the
specification, the bedMethyl record will not be used, this will be logged.

--log-filepath &lt;LOG_FILEPATH&gt;
File to write logs to, it's recommended to use this option.

Expand Down Expand Up @@ -1709,9 +1727,18 @@ <h2 id="dmr-multi"><a class="header" href="#dmr-multi">dmr multi</a></h2>
Prefix files in directory with this label.
--ref &lt;REFERENCE_FASTA&gt;
Path to reference fasta for the pileup.
-m &lt;MODIFIED_BASES&gt;
-m, --base &lt;MODIFIED_BASES&gt;
Bases to use to calculate DMR, may be multiple. For example, to calculate differentially
methylated regions using only cytosine modifications use --base C.
--assign-code &lt;MOD_CODE_ASSIGNMENTS&gt;
Extra assignments of modification codes to their respective primary bases. In general,
modkit dmr will use the SAM specification to know which modification codes are appropriate
to use for a given primary base. For example "h" is the code for 5hmC, so is appropriate
for cytosine bases, but not adenine bases. However, if your bedMethyl file contains custom
codes or codes that are not part of the specification, you can specify which primary base
they belong to here with --assign-code x:C meaning associate modification code "x" with
cytosine (C) primary sequence bases. If a code is encountered that is not part of the
specification, the bedMethyl record will not be used, this will be logged.
--log-filepath &lt;LOG_FILEPATH&gt;
File to write logs to, it's recommended to use this option.
-t, --threads &lt;THREADS&gt;
Expand Down
8 changes: 8 additions & 0 deletions docs/intro_dmr.html
Original file line number Diff line number Diff line change
Expand Up @@ -309,6 +309,14 @@ <h2 id="3-detecting-differential-modification-at-single-base-positions"><a class
--log-filepath dmr.log
</code></pre>
<p>Keep in mind that the MAP-based p-value provided in single-site analysis is based on a "modified" vs "unmodified" model, see the <a href="./dmr_scoring_details.html">scoring section</a> and <a href="./limitations.html">limitations</a> for additional details.</p>
<h3 id="note-about-modification-codes"><a class="header" href="#note-about-modification-codes">Note about modification codes</a></h3>
<p>The <code>modkit dmr</code> commands require the <code>--base</code> option to determine which genome positions to compare, i.e. <code>--base C</code> tells <code>modkit</code> to compare methylation at cytosine bases.
You may use this option multiple times to compare methylation at multiple primary sequence bases.
It is possible that, during <code>pileup</code> a read will have a mismatch and a modification call, such as a C-&gt;A mismatch and a 6mA call on that A, and you may not want to use that 6mA call when calculating the differential methylation metrics.
To filter out bedMethyl records like this, <code>modkit</code> uses the <a href="https://samtools.github.io/hts-specs/SAMtags.pdf">SAM specification</a> (page 9) of modification codes to determine which modification codes apply to which primary sequence bases.
For example, <code>h</code> is 5hmC and applies to cytosine bases, <code>a</code> is 6mA and applies to adenine bases.
However, <code>modkit pileup</code> does not require that you use modification codes only in the specification.
If your bedMethyl has records with custom modification codes or codes that aren't in the specification yet, use <code>--assign-code &lt;mod_code&gt;:&lt;primary_base&gt;</code> to indicate the code applies to a given primary sequence base.</p>
<h2 id="differential-methylation-output-format"><a class="header" href="#differential-methylation-output-format">Differential methylation output format</a></h2>
<p>The output from <code>modkit dmr pair</code> (and for each pairwise comparison with <code>modkit dmr multi</code>) is (roughly)
a BED file with the following schema:</p>
Expand Down
45 changes: 40 additions & 5 deletions docs/print.html
Original file line number Diff line number Diff line change
Expand Up @@ -898,6 +898,14 @@ <h2 id="3-detecting-differential-modification-at-single-base-positions"><a class
--log-filepath dmr.log
</code></pre>
<p>Keep in mind that the MAP-based p-value provided in single-site analysis is based on a "modified" vs "unmodified" model, see the <a href="./dmr_scoring_details.html">scoring section</a> and <a href="./limitations.html">limitations</a> for additional details.</p>
<h3 id="note-about-modification-codes"><a class="header" href="#note-about-modification-codes">Note about modification codes</a></h3>
<p>The <code>modkit dmr</code> commands require the <code>--base</code> option to determine which genome positions to compare, i.e. <code>--base C</code> tells <code>modkit</code> to compare methylation at cytosine bases.
You may use this option multiple times to compare methylation at multiple primary sequence bases.
It is possible that, during <code>pileup</code> a read will have a mismatch and a modification call, such as a C-&gt;A mismatch and a 6mA call on that A, and you may not want to use that 6mA call when calculating the differential methylation metrics.
To filter out bedMethyl records like this, <code>modkit</code> uses the <a href="https://samtools.github.io/hts-specs/SAMtags.pdf">SAM specification</a> (page 9) of modification codes to determine which modification codes apply to which primary sequence bases.
For example, <code>h</code> is 5hmC and applies to cytosine bases, <code>a</code> is 6mA and applies to adenine bases.
However, <code>modkit pileup</code> does not require that you use modification codes only in the specification.
If your bedMethyl has records with custom modification codes or codes that aren't in the specification yet, use <code>--assign-code &lt;mod_code&gt;:&lt;primary_base&gt;</code> to indicate the code applies to a given primary sequence base.</p>
<h2 id="differential-methylation-output-format"><a class="header" href="#differential-methylation-output-format">Differential methylation output format</a></h2>
<p>The output from <code>modkit dmr pair</code> (and for each pairwise comparison with <code>modkit dmr multi</code>) is (roughly)
a BED file with the following schema:</p>
Expand Down Expand Up @@ -2026,7 +2034,15 @@ <h2 id="extract"><a class="header" href="#extract">extract</a></h2>
row for each base modification call in that read using the same thresholding algorithm as
in pileup, or summary (see online documentation for details on thresholds). Passing this
option will cause `modkit` to estimate the pass thresholds from the data unless a
`--filter-threshold` value is passed to the command. (alias: --read-calls)
`--filter-threshold` value is passed to the command. Use 'stdout' to stream this table to
stdout, but note that you cannot stream this table and the raw extract table to stdout.

--pass-only
Only output base modification calls that pass the minimum confidence threshold. (alias:
pass)

--no-headers
Don't print the header lines in the output tables.

--reference &lt;REFERENCE&gt;
Path to reference FASTA to extract reference context information from. If no reference is
Expand Down Expand Up @@ -2632,7 +2648,7 @@ <h2 id="dmr-pair"><a class="header" href="#dmr-pair">dmr pair</a></h2>
compared at each site.

--ref &lt;REFERENCE_FASTA&gt;
Path to reference fasta for used in the pileup/alignment.
Path to reference fasta for used in the pileup/alignment

--segment &lt;SEGMENTATION_FP&gt;
Run segmentation, output segmented differentially methylated regions to this file.
Expand Down Expand Up @@ -2671,10 +2687,20 @@ <h2 id="dmr-pair"><a class="header" href="#dmr-pair">dmr pair</a></h2>
Preset HMM segmentation parameters for higher propensity to switch from "Same" to
"Different" state. Results will be shorter segments, but potentially higher sensitivity.

-m &lt;MODIFIED_BASES&gt;
-m, --base &lt;MODIFIED_BASES&gt;
Bases to use to calculate DMR, may be multiple. For example, to calculate differentially
methylated regions using only cytosine modifications use --base C.


--assign-code &lt;MOD_CODE_ASSIGNMENTS&gt;
Extra assignments of modification codes to their respective primary bases. In general,
modkit dmr will use the SAM specification to know which modification codes are appropriate
to use for a given primary base. For example "h" is the code for 5hmC, so is appropriate
for cytosine bases, but not adenine bases. However, if your bedMethyl file contains custom
codes or codes that are not part of the specification, you can specify which primary base
they belong to here with --assign-code x:C meaning associate modification code "x" with
cytosine (C) primary sequence bases. If a code is encountered that is not part of the
specification, the bedMethyl record will not be used, this will be logged.

--log-filepath &lt;LOG_FILEPATH&gt;
File to write logs to, it's recommended to use this option.

Expand Down Expand Up @@ -2779,9 +2805,18 @@ <h2 id="dmr-multi"><a class="header" href="#dmr-multi">dmr multi</a></h2>
Prefix files in directory with this label.
--ref &lt;REFERENCE_FASTA&gt;
Path to reference fasta for the pileup.
-m &lt;MODIFIED_BASES&gt;
-m, --base &lt;MODIFIED_BASES&gt;
Bases to use to calculate DMR, may be multiple. For example, to calculate differentially
methylated regions using only cytosine modifications use --base C.
--assign-code &lt;MOD_CODE_ASSIGNMENTS&gt;
Extra assignments of modification codes to their respective primary bases. In general,
modkit dmr will use the SAM specification to know which modification codes are appropriate
to use for a given primary base. For example "h" is the code for 5hmC, so is appropriate
for cytosine bases, but not adenine bases. However, if your bedMethyl file contains custom
codes or codes that are not part of the specification, you can specify which primary base
they belong to here with --assign-code x:C meaning associate modification code "x" with
cytosine (C) primary sequence bases. If a code is encountered that is not part of the
specification, the bedMethyl record will not be used, this will be logged.
--log-filepath &lt;LOG_FILEPATH&gt;
File to write logs to, it's recommended to use this option.
-t, --threads &lt;THREADS&gt;
Expand Down
2 changes: 1 addition & 1 deletion docs/searchindex.js

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/searchindex.json

Large diffs are not rendered by default.

0 comments on commit 1e64664

Please sign in to comment.