Merge pull request #223: Add weighted sampling docs

nextstrain · Aug 22, 2024 · 21e038d · 21e038d
2 parents ab86278 + c6084f3
commit 21e038d
Showing 1 changed file with 130 additions and 27 deletions.
diff --git a/src/guides/bioinformatics/filtering-and-subsampling.rst b/src/guides/bioinformatics/filtering-and-subsampling.rst
@@ -53,6 +53,7 @@ Selection method:
    * - Subsampling
      - * ``--subsample-max-sequences``
        * ``--group-by``
+       * ``--group-by-weights``
        * ``--sequences-per-group``
        * ``--probabilistic-sampling``
        * ``--no-probabilistic-sampling``
@@ -158,10 +159,11 @@ Subsampling
 
 Another common filtering operation is **subsampling**: selection of data using
 rules based on output size rather than individual sequence attributes. These are
-the sampling methods supported by ``augur filter`` and a final section for caveats:
+the sampling methods supported by ``augur filter``:
 
 .. contents::
    :local:
+   :depth: 2
 
 Random sampling
 ---------------
@@ -181,14 +183,25 @@ For example, limit the output to 100 sequences:
      --output-metadata subsampled_metadata.tsv
 
 Random sampling is easy to define but can expose sampling bias in some datasets.
-Consider uniform sampling to reduce sampling bias.
+Consider using grouped sampling to reduce sampling bias.
 
-Uniform sampling
+Grouped sampling
 ----------------
 
 ``--group-by`` allows you to partition the data into groups based on column
-values and sample uniformly. For example, sample evenly across countries over
-time:
+values and sample a number of sequences per group.
+
+Grouped sampling can be further divided into two types with a final section for
+caveats:
+
+.. contents::
+   :local:
+
+Uniform sampling
+~~~~~~~~~~~~~~~~
+
+By default (i.e. without ``--group-by-weights``), ``--group-by`` will sample
+uniformly across groups. For example, sample evenly across regions over time:
 
 .. code-block:: bash
 
@@ -197,15 +210,15 @@ time:
      --metadata data/metadata.tsv \
      --min-date 2012 \
      --exclude exclude.txt \
-     --group-by country year month \
+     --group-by region year month \
      --subsample-max-sequences 100 \
      --output-sequences subsampled_sequences.fasta \
      --output-metadata subsampled_metadata.tsv
 
 An alternative to ``--subsample-max-sequences`` is ``--sequences-per-group``.
 This is useful if you care less about total sample size and more about having
 a fixed number of sequences from each group. For example, target one sequence
-per month from each country:
+per month from each region:
 
 .. code-block:: bash
 
@@ -214,19 +227,57 @@ per month from each country:
      --metadata data/metadata.tsv \
      --min-date 2012 \
      --exclude exclude.txt \
-     --group-by country year month \
+     --group-by region year month \
      --sequences-per-group 1 \
      --output-sequences subsampled_sequences.fasta \
      --output-metadata subsampled_metadata.tsv
 
+Weighted sampling
+~~~~~~~~~~~~~~~~~
+
+``--group-by-weights`` can be specified in addition to ``--group-by`` to allow
+different target sizes per group. For example, target twice the amount of
+sequences from Asia compared to other regions using this ``weights.tsv`` file:
+
+.. list-table::
+   :header-rows: 1
+
+   * - region
+     - weight
+   * - Asia
+     - 2
+   * - default
+     - 1
+
+and command:
+
+.. code-block:: bash
+
+   augur filter \
+     --sequences data/sequences.fasta \
+     --metadata data/metadata.tsv \
+     --min-date 2012 \
+     --exclude exclude.txt \
+     --group-by region year month \
+     --group-by-weights weights.tsv \
+     --subsample-max-sequences 100 \
+     --output-sequences subsampled_sequences.fasta \
+     --output-metadata subsampled_metadata.tsv
+
+The weights file format is described in ``augur filter`` docs for
+``--group-by-weights``.
+
+Caveats
+~~~~~~~
+
 Probabilistic sampling
-----------------------
+``````````````````````
 
-It is possible to encounter situations in uniform sampling where the number of
-groups exceeds the target sample size. For example, consider a command with
-groups defined by ``--group-by country year month`` and target sample size
-defined by ``--subsample-max-sequences 100``. If the input contains data from 5
-countries over a span of 24 months, that could result in 120 groups.
+It is possible to encounter situations where the number of groups exceeds the
+target sample size. For example, consider a command with groups defined by
+``--group-by region year month`` and target sample size defined by
+``--subsample-max-sequences 100``. If the input contains data from 5 regions
+over a span of 24 months, that could result in 120 groups.
 
 The only way to target 100 sequences from 120 groups is to apply **probabilistic
 sampling** which randomly determines a whole number of sequences per group. This
@@ -239,11 +290,12 @@ is noted in the output:
    possible to have more than the requested maximum of 100 sequences after
    filtering.
 
-This is automatically enabled. To force the command to exit with an error in
-these situations, use ``--no-probabilistic-sampling``.
+This is automatically enabled. ``--no-probabilistic-sampling`` can be used with
+uniform sampling to force the command to exit with an error in these situations.
+It is always be enabled for weighted sampling.
 
-Caveats
--------
+Undersampling
+`````````````
 
 For these sampling methods, the number of targeted sequences per group does not
 take into account the actual number of sequences available in the input data.
@@ -257,17 +309,62 @@ Subsampling using multiple ``augur filter`` commands
 ====================================================
 
 There are some subsampling strategies in which a single call to ``augur filter``
-does not suffice. One such strategy is "tiered subsampling". In this strategy,
-mutually exclusive sets of filters, each representing a "tier", are sampled with
-different subsampling rules. This is commonly used to create geographic tiers.
-Consider this subsampling scheme:
+does not suffice or is difficult to create. One such strategy is "tiered
+subsampling". In this strategy, mutually exclusive sets of filters, each
+representing a "tier", are sampled with different subsampling rules. This is
+commonly used to create geographic tiers. Consider this subsampling scheme:
+
+   Sample 200 sequences from Washington state and 100 sequences from the rest of
+   the United States.
+
+This can be approximated by first selecting all sequences from the United States
+then sampling with these weights:
+
+.. list-table::
+    :header-rows: 1
+
+    * - state
+      - weight
+    * - WA
+      - 200
+    * - default
+      - 2.04
+
+.. code-block:: bash
+
+   augur filter \
+     --sequences sequences.fasta \
+     --metadata metadata.tsv \
+     --query "country == 'USA'" \
+     --group-by state \
+     --group-by-weights weights.tsv \
+     --subsample-max-sequences 300 \
+     --output-sequences subsampled_sequences.fasta \
+     --output-metadata subsampled_metadata.tsv
+
+This approach has some caveats:
 
-   Sample 100 sequences from Washington state and 50 sequences from the rest of the United States.
+1. It relies on a calculation to determine weights, making it less intuitive:
 
-This cannot be done in a single call to ``augur filter``. Instead, it can be
-decomposed into multiple schemes, each handled by a single call to ``augur
-filter``. Additionally, there is an extra step to combine the intermediate
-samples.
+   .. math::
+
+     {n_{\text{other sequences}}} * \frac{1}{{n_{\text{other states}}}}
+     =                        100 * \frac{1}{49}
+     \approx                 1.02
+
+2. Achieving a full *100 sequences from the rest of the United States* requires
+   at least 2 sequences from each of the remaining states. This may not be
+   possible if some states are under-sampled.
+
+Intuitiveness for caveat (1) can be improved by adding a comment to the weights
+file. However, caveat (2) is an inherent limitation of what is effectively
+uniform sampling across all other states. The only way to get around this in
+``augur filter`` is **random sampling** across states, but that is not possible
+when ``state`` is used as a grouping column.
+
+An alternative approach is to decompose this into multiple schemes, each handled
+by a single call to ``augur filter``. Additionally, there is an extra step to
+combine the intermediate samples.
 
    1. Sample 100 sequences from Washington state.
    2. Sample 50 sequences from the rest of the United States.
@@ -313,6 +410,12 @@ and ``--include`` to sample the data based on the intermediate strain list
 files. If the same strain appears in both files, ``augur filter`` will only
 write it once in each of the final outputs.
 
+.. note::
+
+   The 2nd sample does not use ``--group-by``, implying **random sampling**
+   across states. This differs from previous approach that used a single ``augur
+   filter`` command with weighted sampling.
+
 Generalizing subsampling in a workflow
 --------------------------------------