Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add stats submodule with compute_completeness() function #118

Merged
merged 14 commits into from
Mar 28, 2022
Merged

Conversation

JBGreisman
Copy link
Member

@JBGreisman JBGreisman commented Dec 9, 2021

This PR adds an initial pass at rs.stats.compute_completeness(). This function takes a DataSet as input, and returns the completeness by resolution bins. It is written to work for both merged and unmerged DataSet objects. It also takes an anomalous flag which can be used to compute the anomalous completeness.


docstring:

In [1]: rs.stats.compute_completeness?
Signature: rs.stats.compute_completeness(dataset, bins=10, anomalous=False, dmin=None)
Docstring:
Compute completeness of DataSet by resolution bin.

Parameters
----------
dataset : rs.DataSet
    DataSet object to be analyzed
bins : int
    Number of resolution bins to use
anomalous : bool
    Whether to compute the anomalous completeness
dmin : float
    Resolution cutoff to use. If no dmin is supplied, the maximum resolution
    reflection will be used

Returns
-------
rs.DataSet
    DataSet object summarizing the completeness by resolution bin
File:      ~/Documents/Hekstra_Lab/github/reciprocalspaceship/reciprocalspaceship/stats/completeness.py
Type:      function

Example:

In [2]: rs.stats.compute_completeness(dataset, dmin=1.5)
Out[2]: 
              completeness
49.47 - 3.31      0.978490
3.31 - 2.60       1.000000
2.60 - 2.26       1.000000
2.26 - 2.05       0.981612
2.05 - 1.90       0.996479
1.90 - 1.79       0.996871
1.79 - 1.69       0.994523
1.69 - 1.62       0.993740
1.62 - 1.55       0.989437
1.55 - 1.50       0.992178
overall           0.992333

To do:

  • Write tests
  • Add stats section to the documentation

@JBGreisman JBGreisman added API Interface Decisions enhancement Improvement to existing feature labels Dec 9, 2021
@codecov-commenter
Copy link

codecov-commenter commented Dec 9, 2021

Codecov Report

Merging #118 (fca6e52) into main (ae0e2f5) will decrease coverage by 0.19%.
The diff coverage is 96.12%.

❗ Current head fca6e52 differs from pull request most recent head 1f1b3fc. Consider uploading reports for the commit 1f1b3fc to get more accurate results

@@            Coverage Diff             @@
##             main     #118      +/-   ##
==========================================
- Coverage   98.67%   98.48%   -0.20%     
==========================================
  Files          41       43       +2     
  Lines        1590     1712     +122     
==========================================
+ Hits         1569     1686     +117     
- Misses         21       26       +5     
Flag Coverage Δ
unittests 98.48% <96.12%> (-0.20%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
reciprocalspaceship/dtypes/base.py 95.49% <50.00%> (-0.84%) ⬇️
reciprocalspaceship/stats/__init__.py 75.00% <75.00%> (ø)
reciprocalspaceship/utils/binning.py 90.62% <87.50%> (-9.38%) ⬇️
reciprocalspaceship/__init__.py 100.00% <100.00%> (ø)
reciprocalspaceship/dtypes/anomalousdifference.py 100.00% <100.00%> (ø)
reciprocalspaceship/dtypes/batch.py 100.00% <100.00%> (ø)
reciprocalspaceship/dtypes/hklindex.py 100.00% <100.00%> (ø)
reciprocalspaceship/dtypes/intensity.py 100.00% <100.00%> (ø)
reciprocalspaceship/dtypes/m_isym.py 100.00% <100.00%> (ø)
reciprocalspaceship/dtypes/mtzint.py 100.00% <100.00%> (ø)
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ae0e2f5...1f1b3fc. Read the comment docs.

@JBGreisman JBGreisman linked an issue Dec 10, 2021 that may be closed by this pull request
@JBGreisman
Copy link
Member Author

For unmerged data, this function currently works whether the reflections are provided in or out of the reciprocal ASU.

For merged reflections, computing anomalous completeness has the added difficulty of how to handle "filler" values for unobserved reflections. Although programs often use NaNs for fully unobserved reflections (so dropna() can be used to remove those extras), phenix and aimless use different conventions for cases where only one copy of a Friedel pair was observed. Specifically, phenix fills them as NaNs and Aimless just puts 0.0 for all values. I can't quite think of a general way to handle this discrepancy at the moment to make this function "just work" in all cases.

@JBGreisman
Copy link
Member Author

Just to follow up on this, I think that the best option here to resolve the discrepancies in "filler values" across programs will be to add an argument to the call signature called filler_value=np.nan or unobserved_value=np.nan.

This can then be used to allow the user to achieve the desired behavior. I'll also update the docstring to be a bit more explicit about the behavior with: 1) unmerged data 2) merged data; 1-col anomalous and 3) merged data; 2-col anomalous.

I also think I will expand this method to support all forms of completeness that are reported by aimless. Namely:

  • Completeness (all): completeness of data before merging Friedel pairs (all reflections in +/- ASU)
  • Completeness (non-anomalous): completeness after merging Friedel pairs (all reflections in +ASU)
  • Completeness (anomalous): completeness of the anomalous data. Only accounts for acentric Bijvoet mates measured in both +/- ASU

@JBGreisman
Copy link
Member Author

This is ready to go in my mind:

> rs.mtzdump tests/data/algorithms/HEWL_SSAD_24IDC.mtz --embed

In [1]: mtz.sample(10)
Out[1]: 
          FreeR_flag     IMEAN  SIGIMEAN      I(+)    SIGI(+)      I(-)   SIGI(-)  N(+)  N(-)
H  K  L                                                                                      
33 1  6            5 11.798488 0.6114557 13.792131 0.85277236  9.688977  0.877203    48    40
44 10 4           12 90.181526  5.098182 90.181526   5.098182       0.0       0.0     8     0
25 16 9            6  164.7444 2.9361315  154.9134   4.022536 175.95505  4.295529    32    28
31 12 1           11  16.75307 0.4626905 14.979586 0.59099203 19.561523 0.7437062    38    34
30 6  9           13  92.40724 1.5158632 103.99409  2.2297392  82.44996 2.0670066    48    52
10 9  12           5  313.1947  3.920187 323.86127  5.6489615  303.2858 5.4446454    60    64
4  2  16           6 580.68256 10.024454  551.0918  14.140535  610.5782 14.213181    32    32
21 14 8           10 201.05835 2.4446104 221.15997  3.5505462 182.93996 3.3708513    60    64
35 5  8           19 268.03336 5.2958283 259.67307   7.117649  278.4012  7.926307    29    24
37 2  3           16 31.599096 0.6746577 30.608599 0.94752496 32.617615 0.9608345    60    60

In [2]: rs.stats.compute_completeness(mtz, anomalous=False,  unobserved_value=0)
Out[2]: 
              completeness
             non-anomalous
56.10 - 3.91      0.997613
3.91 - 3.06       1.000000
3.06 - 2.66       0.997613
2.66 - 2.41       1.000000
2.41 - 2.23       1.000000
2.23 - 2.09       0.997615
2.09 - 1.98       0.991304
1.98 - 1.89       0.994444
1.89 - 1.82       0.994449
1.82 - 1.70       0.529089
overall           0.915936

In [3]: rs.stats.compute_completeness(mtz, anomalous=True,  unobserved_value=0)
Out[3]: 
             completeness                        
                      all non-anomalous anomalous
56.10 - 3.91     0.998574      0.997613  1.000000
3.91 - 3.06      1.000000      1.000000  1.000000
3.06 - 2.66      0.998252      0.997613  0.999030
2.66 - 2.41      0.999136      1.000000  0.999057
2.41 - 2.23      1.000000      1.000000  1.000000
2.23 - 2.09      0.996159      0.997615  0.996313
2.09 - 1.98      0.992803      0.991304  0.994531
1.98 - 1.89      0.994510      0.994444  0.995036
1.89 - 1.82      0.994503      0.994449  0.995471
1.82 - 1.70      0.492174      0.529089  0.500952
overall          0.907413      0.915936  0.906873

Copy link
Member

@kmdalton kmdalton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the whole the code style is good, and the tests seem adequate. I don't have too many comments, but I noted a few things that might be improved. I cannot claim that I know how robust this will be to different data sets, but it is a pretty solid start as far as I can tell.

I am going to approve, this but please consider making some of the changes.

reciprocalspaceship/stats/completeness.py Show resolved Hide resolved
reciprocalspaceship/utils/binning.py Outdated Show resolved Hide resolved
reciprocalspaceship/utils/binning.py Show resolved Hide resolved
@JBGreisman JBGreisman merged commit 7e5ca09 into main Mar 28, 2022
@JBGreisman JBGreisman deleted the completeness branch March 28, 2022 20:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Interface Decisions enhancement Improvement to existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a function to compute completeness
3 participants