Add `stats` submodule with `compute_completeness()` function #118

JBGreisman · 2021-12-09T22:56:48Z

This PR adds an initial pass at rs.stats.compute_completeness(). This function takes a DataSet as input, and returns the completeness by resolution bins. It is written to work for both merged and unmerged DataSet objects. It also takes an anomalous flag which can be used to compute the anomalous completeness.

docstring:

In [1]: rs.stats.compute_completeness?
Signature: rs.stats.compute_completeness(dataset, bins=10, anomalous=False, dmin=None)
Docstring:
Compute completeness of DataSet by resolution bin.

Parameters
----------
dataset : rs.DataSet
    DataSet object to be analyzed
bins : int
    Number of resolution bins to use
anomalous : bool
    Whether to compute the anomalous completeness
dmin : float
    Resolution cutoff to use. If no dmin is supplied, the maximum resolution
    reflection will be used

Returns
-------
rs.DataSet
    DataSet object summarizing the completeness by resolution bin
File:      ~/Documents/Hekstra_Lab/github/reciprocalspaceship/reciprocalspaceship/stats/completeness.py
Type:      function

Example:

In [2]: rs.stats.compute_completeness(dataset, dmin=1.5)
Out[2]: 
              completeness
49.47 - 3.31      0.978490
3.31 - 2.60       1.000000
2.60 - 2.26       1.000000
2.26 - 2.05       0.981612
2.05 - 1.90       0.996479
1.90 - 1.79       0.996871
1.79 - 1.69       0.994523
1.69 - 1.62       0.993740
1.62 - 1.55       0.989437
1.55 - 1.50       0.992178
overall           0.992333

To do:

Write tests
Add stats section to the documentation

codecov-commenter · 2021-12-09T23:06:45Z

Codecov Report

Merging #118 (fca6e52) into main (ae0e2f5) will decrease coverage by 0.19%.
The diff coverage is 96.12%.

❗ Current head fca6e52 differs from pull request most recent head 1f1b3fc. Consider uploading reports for the commit 1f1b3fc to get more accurate results

@@            Coverage Diff             @@
##             main     #118      +/-   ##
==========================================
- Coverage   98.67%   98.48%   -0.20%     
==========================================
  Files          41       43       +2     
  Lines        1590     1712     +122     
==========================================
+ Hits         1569     1686     +117     
- Misses         21       26       +5

Flag	Coverage Δ
unittests	`98.48% <96.12%> (-0.20%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
reciprocalspaceship/dtypes/base.py	`95.49% <50.00%> (-0.84%)`	⬇️
reciprocalspaceship/stats/__init__.py	`75.00% <75.00%> (ø)`
reciprocalspaceship/utils/binning.py	`90.62% <87.50%> (-9.38%)`	⬇️
reciprocalspaceship/__init__.py	`100.00% <100.00%> (ø)`
reciprocalspaceship/dtypes/anomalousdifference.py	`100.00% <100.00%> (ø)`
reciprocalspaceship/dtypes/batch.py	`100.00% <100.00%> (ø)`
reciprocalspaceship/dtypes/hklindex.py	`100.00% <100.00%> (ø)`
reciprocalspaceship/dtypes/intensity.py	`100.00% <100.00%> (ø)`
reciprocalspaceship/dtypes/m_isym.py	`100.00% <100.00%> (ø)`
reciprocalspaceship/dtypes/mtzint.py	`100.00% <100.00%> (ø)`
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ae0e2f5...1f1b3fc. Read the comment docs.

JBGreisman · 2021-12-10T17:40:19Z

For unmerged data, this function currently works whether the reflections are provided in or out of the reciprocal ASU.

For merged reflections, computing anomalous completeness has the added difficulty of how to handle "filler" values for unobserved reflections. Although programs often use NaNs for fully unobserved reflections (so dropna() can be used to remove those extras), phenix and aimless use different conventions for cases where only one copy of a Friedel pair was observed. Specifically, phenix fills them as NaNs and Aimless just puts 0.0 for all values. I can't quite think of a general way to handle this discrepancy at the moment to make this function "just work" in all cases.

JBGreisman · 2022-02-24T15:46:11Z

Just to follow up on this, I think that the best option here to resolve the discrepancies in "filler values" across programs will be to add an argument to the call signature called filler_value=np.nan or unobserved_value=np.nan.

This can then be used to allow the user to achieve the desired behavior. I'll also update the docstring to be a bit more explicit about the behavior with: 1) unmerged data 2) merged data; 1-col anomalous and 3) merged data; 2-col anomalous.

I also think I will expand this method to support all forms of completeness that are reported by aimless. Namely:

Completeness (all): completeness of data before merging Friedel pairs (all reflections in +/- ASU)
Completeness (non-anomalous): completeness after merging Friedel pairs (all reflections in +ASU)
Completeness (anomalous): completeness of the anomalous data. Only accounts for acentric Bijvoet mates measured in both +/- ASU

…ompleteness

JBGreisman · 2022-03-27T21:22:24Z

This is ready to go in my mind:

> rs.mtzdump tests/data/algorithms/HEWL_SSAD_24IDC.mtz --embed

In [1]: mtz.sample(10)
Out[1]: 
          FreeR_flag     IMEAN  SIGIMEAN      I(+)    SIGI(+)      I(-)   SIGI(-)  N(+)  N(-)
H  K  L                                                                                      
33 1  6            5 11.798488 0.6114557 13.792131 0.85277236  9.688977  0.877203    48    40
44 10 4           12 90.181526  5.098182 90.181526   5.098182       0.0       0.0     8     0
25 16 9            6  164.7444 2.9361315  154.9134   4.022536 175.95505  4.295529    32    28
31 12 1           11  16.75307 0.4626905 14.979586 0.59099203 19.561523 0.7437062    38    34
30 6  9           13  92.40724 1.5158632 103.99409  2.2297392  82.44996 2.0670066    48    52
10 9  12           5  313.1947  3.920187 323.86127  5.6489615  303.2858 5.4446454    60    64
4  2  16           6 580.68256 10.024454  551.0918  14.140535  610.5782 14.213181    32    32
21 14 8           10 201.05835 2.4446104 221.15997  3.5505462 182.93996 3.3708513    60    64
35 5  8           19 268.03336 5.2958283 259.67307   7.117649  278.4012  7.926307    29    24
37 2  3           16 31.599096 0.6746577 30.608599 0.94752496 32.617615 0.9608345    60    60

In [2]: rs.stats.compute_completeness(mtz, anomalous=False,  unobserved_value=0)
Out[2]: 
              completeness
             non-anomalous
56.10 - 3.91      0.997613
3.91 - 3.06       1.000000
3.06 - 2.66       0.997613
2.66 - 2.41       1.000000
2.41 - 2.23       1.000000
2.23 - 2.09       0.997615
2.09 - 1.98       0.991304
1.98 - 1.89       0.994444
1.89 - 1.82       0.994449
1.82 - 1.70       0.529089
overall           0.915936

In [3]: rs.stats.compute_completeness(mtz, anomalous=True,  unobserved_value=0)
Out[3]: 
             completeness                        
                      all non-anomalous anomalous
56.10 - 3.91     0.998574      0.997613  1.000000
3.91 - 3.06      1.000000      1.000000  1.000000
3.06 - 2.66      0.998252      0.997613  0.999030
2.66 - 2.41      0.999136      1.000000  0.999057
2.41 - 2.23      1.000000      1.000000  1.000000
2.23 - 2.09      0.996159      0.997615  0.996313
2.09 - 1.98      0.992803      0.991304  0.994531
1.98 - 1.89      0.994510      0.994444  0.995036
1.89 - 1.82      0.994503      0.994449  0.995471
1.82 - 1.70      0.492174      0.529089  0.500952
overall          0.907413      0.915936  0.906873

kmdalton

On the whole the code style is good, and the tests seem adequate. I don't have too many comments, but I noted a few things that might be improved. I cannot claim that I know how robust this will be to different data sets, but it is a pretty solid start as far as I can tell.

I am going to approve, this but please consider making some of the changes.

reciprocalspaceship/stats/completeness.py

reciprocalspaceship/utils/binning.py

…erved_value` description

Fixes #59: Add stats submodule with compute_completeness() function

1f77d71

JBGreisman added API Interface Decisions enhancement Improvement to existing feature labels Dec 9, 2021

JBGreisman added 3 commits December 9, 2021 18:39

Add rs.stats.compute_completeness() to API reference

7e0783c

Add generic test of arguments with merged and unmerged HEWL

45281e3

Add test that all values are <= 1.0

a35d284

JBGreisman linked an issue Dec 10, 2021 that may be closed by this pull request

Add a function to compute completeness #59

Closed

JBGreisman added 2 commits December 9, 2021 19:14

Actually use the anomalous flag that is passed in the tests...

d3be059

Add additional tests for compute_completeness()

560b576

JBGreisman added 7 commits March 24, 2022 18:04

Merge branch 'main' into completeness

475abfc

Add MTZDtype.is_friedel_dtype() convenience function

42448e3

Add convenience function for assigning bins from bin edges

a0813f6

Update rs.stats.compute_completeness() to report different types of c…

927f681

…ompleteness

Add assign_with_binedges() to rs.utils namespace and add tests

784f8b1

Update compute_completeness() and tests

fca6e52

[skip ci] Update docs for rs.stats.compute_completeness()

872e8c0

kmdalton approved these changes Mar 28, 2022

View reviewed changes

reciprocalspaceship/stats/completeness.py Show resolved Hide resolved

reciprocalspaceship/utils/binning.py Outdated Show resolved Hide resolved

reciprocalspaceship/utils/binning.py Show resolved Hide resolved

Add return types, rename right to right_inclusive, and fix `unobs…

1f1b3fc

…erved_value` description

JBGreisman merged commit 7e5ca09 into main Mar 28, 2022

JBGreisman deleted the completeness branch March 28, 2022 20:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `stats` submodule with `compute_completeness()` function #118

Add `stats` submodule with `compute_completeness()` function #118

JBGreisman commented Dec 9, 2021 •

edited

Loading

codecov-commenter commented Dec 9, 2021 •

edited

Loading

JBGreisman commented Dec 10, 2021

JBGreisman commented Feb 24, 2022

JBGreisman commented Mar 27, 2022

kmdalton left a comment

Add stats submodule with compute_completeness() function #118

Add stats submodule with compute_completeness() function #118

Conversation

JBGreisman commented Dec 9, 2021 • edited Loading

codecov-commenter commented Dec 9, 2021 • edited Loading

Codecov Report

JBGreisman commented Dec 10, 2021

JBGreisman commented Feb 24, 2022

JBGreisman commented Mar 27, 2022

kmdalton left a comment

Choose a reason for hiding this comment

Add `stats` submodule with `compute_completeness()` function #118

Add `stats` submodule with `compute_completeness()` function #118

JBGreisman commented Dec 9, 2021 •

edited

Loading

codecov-commenter commented Dec 9, 2021 •

edited

Loading