[WIP] count_allele_calls #114

timothymillar · 2020-08-16T09:36:36Z

See issue #85

This implements an additional jitted function count_call_alleles_ndarray_jit for ndarrays only rather than doing it all in dask.
This approach seems to be inline with the goals outlined here but I'm happy to replicate the approach of the original count_alleles function if that is preferred.

Likewise I can re-write count_alleles using this approach which should improve performance (mainly on chucked arrays due to njit(..., nogil=True))

~~I haven't added numba to requirements.txt or setup.py Because that is done in #76~~

I guess add numba now for CI and fix conflict later

eric-czech · 2020-08-16T18:47:20Z

Thanks for picking this up @timothymillar!

I've been hesitant to wade into numba for things like counting (put a few thoughts at https://github.com/pystatgen/sgkit/issues/49#issuecomment-674561391) but I agree it's probably the right choice here.

It's a shame that there doesn't seem to be a good way to do this with Dask/Xarray, but if we do lose the freedom in array backends it would be nice to still support CPU/GPU easily. Have you thought at all about using a gufunc for this, which would make it easier to compile to either? I think the whole module could collapse down to something like this in that case, where using the sentinel missing value also simplifies things vs using the separate mask variable:

@numba.guvectorize(['void(numba.int8[:], numba.int32[:], numba.int32[:])'], '(n),(k)->(k)')
def count_alleles(x, _, out):
    out[:] = 0
    for v in x:
        if v >= 0:
            out[v] += 1
            
def count_call_alleles(ds) -> DataArray:
    G = da.asarray(ds.call_genotype)
    # This array is only necessary to tell dask/numba what the 
    # dimensions and dtype are for the output array
    O = da.empty(G.shape[:2] + (ds.dims['alleles'],), dtype='int32')
    O = O.rechunk(G.chunks[:2] + (-1,))
    return xr.DataArray(
        count_alleles(G, O),
        dims=('variants', 'samples', 'alleles'),
        name='call_allele_count'
    )

def count_variant_alleles(ds) -> DataArray:
    return (
        count_call_alleles(ds)
        .sum(dim='samples')
        .rename('variant_allele_count')
    )

I'd be a proponent of simply removing the count_variant_alleles function too and leaving it up to users in the future if there isn't a big performance difference between a sum following per-call allele counts and a custom kernel/function.

timothymillar · 2020-08-16T21:22:07Z

Have you thought at all about using a gufunc for this

Good idea, I'll give it a go

where using the sentinel missing value also simplifies things vs using the separate mask variable

I was a little uncertain of the application of the mask as their seems to be two options when it comes to a user masking out additional values (which I assume is an intended feature at some point):

The user updates the mask (via some API) but the genotype arrays are unchanged hence any function has to use the mask itself either by updating a copy of the genotype calls (as in the current implementation of count_alleles) or using the mask itself as in this PR.
The user updates the mask (via some API) and then applies it to their DataSet which replaces the (alleles within) genotype calls with the sentinel value in the original DataSet (or in a copy).

Essentially should functions working on a DataSet trust the mask or the sentinel values?

eric-czech · 2020-08-17T10:35:20Z

Essentially should functions working on a DataSet trust the mask or the sentinel values?

The mask is only a convenience on ds.call_genotype < 0 so I think it's best to use the sentinel values until both are needed.

when it comes to a user masking out additional values (which I assume is an intended feature at some point)

We haven't talked about that yet (feel free to file an issue), but afaik Dask still doesn't support assignment so I see that process as:

An operation defines a transformation of the data array that makes additional values equal to the missing sentinel
The mask is redefined as mask = arr < 0

timothymillar · 2020-08-19T08:45:39Z

@eric-czech The gufunc version is working nicely but I'm having some issues satisfying the CI with the use of guvectorize.
Do you know whats causing the Sphinx doc issue?

A couple of notes about the implementation:

The gufunc is returning an array with dtype uint8, I don't think this will be an issue but if it is then the dtype of the dummy array can be used to indicate the output dtype.
I'm not sure on the docstring style preference as there is a bit of variation in the repo.
I reversed your signature from '(n),(k)->(k)' to '(k),(n)->(n)' because the genotype vector has a length of ploidy and k is commonly used for ploidy in (polyploid) literature.

eric-czech · 2020-08-19T18:18:21Z

Nice @timothymillar! This looks great.

The gufunc is returning an array with dtype uint8, I don't think this will be an issue but if it is then the dtype of the dummy array can be used to indicate the output dtype.

Perfect, makes sense.

I'm not sure on the docstring style preference as there is a bit of variation in the repo.

Me neither. I started using the Dask style like that but then switched to referencing our ArrayLike type once that was added and @tomwhite did too. I was waiting until Sphinx was up and running before trying to overhaul the signatures, but https://github.com/pystatgen/sgkit/pull/124 would change a lot of it anyhow. I'm not sure what the standard should be until some more dust settles. Let us know if you have any thoughts on it.

I reversed your signature from '(n),(k)->(k)' to '(k),(n)->(n)' because the genotype vector has a length of ploidy and k is commonly used for ploidy in (polyploid) literature.

Good call!

Do you know whats causing the Sphinx doc issue?

Hmm I do not. Does it go away if you do from numba import guvectorize and use the decorator that way? I'm not sure if the issue is with that or the numba.int8[:] signature annotations. Perhaps it will work with strings like int8[:] instead?

eric-czech · 2020-08-19T18:20:45Z

sgkit/stats/aggregation.py

+    n_alleles = ds.dims["alleles"]
+    G = da.asarray(ds["call_genotype"])
+    shape = (G.chunks[0], G.chunks[1], n_alleles)
+    N = da.empty(n_alleles, dtype=np.uint8)


Good idea using this instead! For the subsequent map_blocks call it might be a good idea to ensure this has only a single chunk so that the blocks broadcast. I can't imagine why anyone would ever configure the default chunk size to be so small that n_alleles items would result in multiple chunks, but guarding against it is probably a good idea.

I can't imagine why anyone would ever configure the default chunk size to be so small that n_alleles items would result in multiple chunks

Neither but this edge case should be handled correctly by this PR, see below.

Ah I see. Thoughts on a da.asarray(ds["call_genotype"].chunk(chunks=dict(ploidy=-1))) then? I think ploidy chunks of size one could actually be pretty common when concatenating haplotype arrays.

I could also see the argument in leaving it up to the user to interpret those error messages and rechunk themselves so that they have to think through the performance implications, but I had already started being defensive about that in the GWAS regression methods. My rationale was that it's better to optimize for a less frustrating user experience than for making performance concerns prominent (and we could always add warnings for that later).

Ohh nm, I see why it doesn't matter now (and the test for it). Ignore that.

eric-czech · 2020-08-19T18:21:38Z

sgkit/stats/aggregation.py

+    shape = (G.chunks[0], G.chunks[1], n_alleles)
+    N = da.empty(n_alleles, dtype=np.uint8)
+    return xr.DataArray(
+        da.map_blocks(count_alleles, G, N, chunks=shape, drop_axis=2, new_axis=2),


Out of curiosity, what keeps this from working as count_alleles(G, N) instead? Do the block shapes need to be identical?

count_alleles(G, N) results in an error if G is chunked in the ploidy dimension. See comment below for an example.

timothymillar · 2020-08-19T21:11:50Z

Here's a minimal example explaining the use of map_blocks.

setup:

import numpy as np
import dask.array as da

from sgkit.stats.aggregation import count_alleles

n_alleles = 2
genotypes = np.array(
    [[[ 0,  0],
      [ 0,  1],
      [ 1,  0]],
     [[-1,  0],
      [ 0, -1],
      [-1, -1]]], dtype=np.int8)

N = da.empty(n_alleles, dtype=np.uint8)
G = da.asarray(genotypes).rechunk((1,1,1))  # unlikely chunking

(All of the below options work correctly if G is not chunked in dimension 2 (ploidy))

Option 1: calling count_alleles directly:

count_alleles(G, N).compute()

Results in error:

ValueError: Core dimension `'k'` consists of multiple chunks. To fix, rechunk into a single chunk along this dimension or set `allow_rechunk=True`, but beware that this may increase memory usage significantly.

I didn't actually explore the use of allow_rechunk=True but even if it achieves the same as below I prefer the more explicit use of map_blocks.

Option 2: naive use of map_blocks (my first attempt):

shape = (G.chunks[0], G.chunks[1], n_alleles)
da.map_blocks(count_alleles, G, N, chunks=shape).compute()

Results in incorrect array dimensions:

array([[[1, 0, 1, 0],
        [1, 0, 0, 1],
        [0, 1, 1, 0]],

       [[0, 0, 1, 0],
        [1, 0, 0, 0],
        [0, 0, 0, 0]]], dtype=uint8)

Option 3: Using map_blocks kwargs to "forget" the ploidy dimensions ~~shape~~ size:

shape = (G.chunks[0], G.chunks[1], n_alleles)
da.map_blocks(count_alleles, G, N, chunks=shape, drop_axis=2, new_axis=2).compute()

Result:

array([[[2, 0],
        [1, 1],
        [1, 1]],

       [[1, 0],
        [1, 0],
        [0, 0]]], dtype=uint8)

timothymillar · 2020-08-19T21:45:42Z

Does it go away if you do from numba import guvectorize and use the decorator that way? I'm not sure if the issue is with that or the numba.int8[:] signature annotations. Perhaps it will work with strings like int8[:] instead?

One of those seems to have fixed it, thanks for the help!

eric-czech · 2020-08-19T22:14:36Z

sgkit/stats/aggregation.py

+    nopython=True,
+)
+def count_alleles(g: ArrayLike, _: ArrayLike, out: ArrayLike) -> None:
+    """Generaliszed U-function for computing per sample allele counts.


nit: spelling

Thanks, not my strong point

eric-czech · 2020-08-19T22:16:32Z

sgkit/stats/aggregation.py

+    >>> import sgkit as sg
+    >>> from sgkit.testing import simulate_genotype_call_dataset
+    >>> ds = simulate_genotype_call_dataset(n_variant=4, n_sample=2, seed=1)
+    >>> ds['call_genotype'].to_series().unstack().astype(str).apply('/'.join, axis=1).unstack() # doctest: +NORMALIZE_WHITESPACE


This is a good place for @tomwhite's code in https://github.com/pystatgen/sgkit/pull/58 now fyi.

eric-czech · 2020-08-20T13:02:29Z

This is good to go as far as I'm concerned. @tomwhite / @ravwojdyla / @jeromekelleher could one of you take a look as well? Two+ approvals seems appropriate for this one.

timothymillar · 2020-08-24T06:01:34Z

Just for reference the following is a strait forward implementation only using xarray and dask but the performance is much worse that the gufunc version

def count_call_alleles(ds):
    G = ds["call_genotype"]
    n_variant, n_allele = G.shape[0], ds.dims["alleles"]
    G = G.expand_dims(dim="alleles", axis=-1)
    I = da.arange(n_allele, dtype='int8')
    A = G == I  # one-hot encoding of alleles
    AC = A.sum(axis=-2)
    return AC

jeromekelleher

LGTM too, although I don't understand the details of how this is interacting with numba.

@tomwhite, @ravwojdyla I think we should have one more vote here.

tomwhite

+1 this looks great @timothymillar.

eric-czech · 2020-08-24T11:07:31Z

Just for reference the following is a strait forward implementation only using xarray and dask

Clever! I bet that's still much faster than da.apply_along_axis(np.bincount, axis=2).

Thanks again for picking this up @timothymillar, nicely done. I'll set it to merge.

ravwojdyla · 2020-08-24T14:12:07Z

@Mergifyio refresh

mergify · 2020-08-24T14:12:39Z

Command refresh: success

timothymillar and others added 4 commits August 16, 2020 21:01

Implement count_call_alleles sgkit-dev#85

23fdea9

Fixes for mypy and black

39995ed

Merge branch 'master' into f/85-count_call_alleles

7d664ab

Add dependency on numba

e888bc4

eric-czech mentioned this pull request Aug 16, 2020

Track and improve the performance of allele counting method #49

Open

timothymillar and others added 8 commits August 18, 2020 22:18

gufunc implementation of count_alleles

324fa9a

Fix count alleles bug for chunking

96666a9

Fix doctest for count_call_alleles

adc9d23

Docstring for gufunc

62b3975

Merge branch 'master' into f/85-count_call_alleles

6ad49de

Remove duplication in setup.cfg

218b524

Fixes for pre-commit

ba2538a

Add ignore type checking for guvectorize decorator

6c8427d

eric-czech reviewed Aug 19, 2020

View reviewed changes

hammer mentioned this pull request Aug 19, 2020

Improve contributing docs #125

Closed

3 tasks

Explicit import of guvectorize

3609976

eric-czech reviewed Aug 19, 2020

View reviewed changes

timothymillar added 2 commits August 20, 2020 19:21

Use display_genotypes in aggregation docstrings

8eaa40b

Numpy style docstring for count_alleles

45ad172

eric-czech approved these changes Aug 20, 2020

View reviewed changes

jeromekelleher reviewed Aug 24, 2020

View reviewed changes

tomwhite approved these changes Aug 24, 2020

View reviewed changes

eric-czech added the auto-merge Auto merge label for mergify test flight label Aug 24, 2020

Merge branch 'master' into f/85-count_call_alleles

1f34baf

ravwojdyla mentioned this pull request Aug 24, 2020

Mergify not merging #138

Closed

ravwojdyla merged commit 320ebc9 into sgkit-dev:master Aug 25, 2020

eric-czech mentioned this pull request Aug 26, 2020

Add variant/sample summary statistic methods #29

Closed

7 tasks

tomwhite mentioned this pull request Sep 17, 2020

Add allele counting function for individual calls #85

Closed

timothymillar mentioned this pull request Aug 9, 2022

Cohort statistics without numba #885

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] count_allele_calls #114

[WIP] count_allele_calls #114

timothymillar commented Aug 16, 2020 •

edited

Loading

eric-czech commented Aug 16, 2020 •

edited

Loading

timothymillar commented Aug 16, 2020

eric-czech commented Aug 17, 2020

timothymillar commented Aug 19, 2020

eric-czech commented Aug 19, 2020 •

edited

Loading

eric-czech Aug 19, 2020

timothymillar Aug 19, 2020

eric-czech Aug 19, 2020

eric-czech Aug 19, 2020

eric-czech Aug 19, 2020 •

edited

Loading

timothymillar Aug 19, 2020

eric-czech Aug 19, 2020

timothymillar commented Aug 19, 2020 •

edited

Loading

timothymillar commented Aug 19, 2020

eric-czech Aug 19, 2020

timothymillar Aug 20, 2020

eric-czech Aug 19, 2020 •

edited

Loading

eric-czech commented Aug 20, 2020

timothymillar commented Aug 24, 2020 •

edited

Loading

jeromekelleher left a comment •

edited

Loading

tomwhite left a comment

eric-czech commented Aug 24, 2020

ravwojdyla commented Aug 24, 2020

mergify bot commented Aug 24, 2020

[WIP] count_allele_calls #114

[WIP] count_allele_calls #114

Conversation

timothymillar commented Aug 16, 2020 • edited Loading

eric-czech commented Aug 16, 2020 • edited Loading

timothymillar commented Aug 16, 2020

eric-czech commented Aug 17, 2020

timothymillar commented Aug 19, 2020

eric-czech commented Aug 19, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-czech Aug 19, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timothymillar commented Aug 19, 2020 • edited Loading

timothymillar commented Aug 19, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-czech Aug 19, 2020 • edited Loading

Choose a reason for hiding this comment

eric-czech commented Aug 20, 2020

timothymillar commented Aug 24, 2020 • edited Loading

jeromekelleher left a comment • edited Loading

Choose a reason for hiding this comment

tomwhite left a comment

Choose a reason for hiding this comment

eric-czech commented Aug 24, 2020

ravwojdyla commented Aug 24, 2020

mergify bot commented Aug 24, 2020

timothymillar commented Aug 16, 2020 •

edited

Loading

eric-czech commented Aug 16, 2020 •

edited

Loading

eric-czech commented Aug 19, 2020 •

edited

Loading

eric-czech Aug 19, 2020 •

edited

Loading

timothymillar commented Aug 19, 2020 •

edited

Loading

eric-czech Aug 19, 2020 •

edited

Loading

timothymillar commented Aug 24, 2020 •

edited

Loading

jeromekelleher left a comment •

edited

Loading