Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create restapi endpoint for counting full_types #4277

Merged
merged 2 commits into from
Nov 13, 2020

Conversation

ramirezfranciscof
Copy link
Member

@ramirezfranciscof ramirezfranciscof commented Jul 21, 2020

Fixes #4247
Fixes #4541

Work in progress: currently only the end leaves are counted, and not the container branches.

  • Changes in aiida/restapi/api.py and aiida/restapi/common/utils.py only add the new endpoint to relevant lists, and aiida/restapi/resources.py redirects the request to the new endpoint to an internal function.
  • The internal function (get_namespace) resides in aiida/restapi/translator/nodes/node.py, and is the same as the one requested for the full_types endpoint, except it has been adapted with an optional keyword to count the nodes. This in turn is passed to get_node_namespace inside of aiida/restapi/common/identifiers.py.
  • Most modifications where done in aiida/restapi/common/identifiers.py, where the namespace class is defined. The initialization was adapted to include a counter, which receives an None value unless the option to count nodes was selected when calling get_node_namespace. This value gets returned when the get_description method is called (when requesting the endpoint).

@ramirezfranciscof ramirezfranciscof marked this pull request as draft July 21, 2020 15:46
@codecov
Copy link

codecov bot commented Jul 21, 2020

Codecov Report

Merging #4277 (b29e955) into develop (810dc56) will increase coverage by 0.08%.
The diff coverage is 89.29%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #4277      +/-   ##
===========================================
+ Coverage    79.41%   79.48%   +0.08%     
===========================================
  Files          482      482              
  Lines        35287    35326      +39     
===========================================
+ Hits         28020    28076      +56     
+ Misses        7267     7250      -17     
Flag Coverage Δ
django 73.65% <89.29%> (+0.08%) ⬆️
sqlalchemy 72.82% <89.29%> (+0.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
aiida/restapi/api.py 76.09% <ø> (ø)
aiida/restapi/common/config.py 100.00% <ø> (ø)
aiida/restapi/common/utils.py 74.08% <ø> (ø)
aiida/restapi/resources.py 98.09% <77.78%> (+1.54%) ⬆️
aiida/restapi/common/identifiers.py 78.69% <91.12%> (+7.17%) ⬆️
aiida/restapi/translator/nodes/node.py 84.92% <100.00%> (+0.71%) ⬆️
aiida/transports/plugins/local.py 81.54% <0.00%> (+0.26%) ⬆️
aiida/orm/nodes/data/cif.py 86.92% <0.00%> (+0.27%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 810dc56...b29e955. Read the comment docs.

@ramirezfranciscof
Copy link
Member Author

For reference: I just tried this with the 2D materials database and the results seem to indicate that the time of the full_types entrypoint shouldn't have changed much, whereas the time of the full_types_count should be at most ten times slower. Here are a few plots of 100 calls to the get_namespace method (for the develop branch, this PR with count=False and this PR with count=True respectively):

results_develop
develop branch

results_false
PR with count=False

results_true
PR with count=True

The numbers remained mostly the same for django or sqlalchemy backends. I'll see if I can try some bigger database.

@ramirezfranciscof ramirezfranciscof marked this pull request as ready for review August 9, 2020 15:31
@CasperWA CasperWA added the pr/work-in-progress PR that is still work in progress but already needs discussion label Aug 26, 2020
@ramirezfranciscof
Copy link
Member Author

ramirezfranciscof commented Sep 7, 2020

Ok, I was checking on the status of this and noticed the work in progress (WIP) tag set by @CasperWA . I'm not sure exactly what would be interpreted as a WIP but just in case let me clarify the current situation:

  1. Re testing it with the materials cloud: this would be the next step, I already asked @vgranata if she could do it. Also, for valeria: in case you saw the WIP tag and thought I was still working on this, maybe also because of my last comment (see 3), this is not the case, you have green light to do the test.
  2. Re code reviewing: I ideally wanted to check 1 before doing a proper code review. If that is the meaning of the WIP (the website check being the "work"), then I guess its ok to keep it.
  3. Re my last comment: in the end I wasn't still able to get the bigger database (I am getting a "not enough space" error when trying to import it) so that will likely not happen. Anyways I would thing that the timings with the middle sized one are enough to go on. So this should not be the reason for the WIP tag.

@vgranata
Copy link

@ramirezfranciscof Thank you Francisco for clarifying this.
@asle85 Elsa offered to look and test this PR.

@elsapassaro
Copy link
Contributor

I've tested the new endpoint on the AiiDA databases used in Materials Cloud: it's very fast for small databases, and for the biggest ones (until ~1.4 million nodes) it takes around 3 seconds.

@giovannipizzi which level of specification do we want to show in the statistics pie chart? Is it ok to have the namespaces (or labels) of the leaves, i.e. namespaces that do not have subspaces? Does it make sense to include this information in the nodes/statistics endpoint?

@giovannipizzi
Copy link
Member

Thanks! I think it's fast enough. @asle85 - I don't know, I think we need to check together a couple of examples to decide. If this PR anyway gives enough info to create any piechart, I suggest to move on and merge it, and we can discuss what to show in the pie chart in an issue on the issue tracker of one of the Materials Cloud repositories.

@ramirezfranciscof ramirezfranciscof removed the pr/work-in-progress PR that is still work in progress but already needs discussion label Sep 30, 2020
@ramirezfranciscof
Copy link
Member Author

Ok, good; I just need now someone to review the code and I'll merge. I requested to @sphuber , but he may be busy with other stuff, so feel free to request to another person or even do it yourself if you have the time @giovannipizzi .

@ramirezfranciscof
Copy link
Member Author

ramirezfranciscof commented Oct 7, 2020

@flavianojs I cannot add you as reviewer not sure why (maybe you need to be part of the aiidateam and you are not?), but yeah, as we told you feel free to take a look and comment.

@chrisjsewell
Copy link
Member

I'll look through this tomorrow @ramirezfranciscof, pester me if you don't hear anything lol

@ramirezfranciscof
Copy link
Member Author

Hey @chrisjsewell : friendly reminder if you can check this.

@chrisjsewell
Copy link
Member

chrisjsewell commented Oct 12, 2020

Yep it generally looks fine to me 👍
However, given that this is a PR on counting, I think the test should actually make sure that the count is correct 😉

"Ideally", I'd say re-write the test file as pytest (using the fixtures), then just use pytest-regressions to validate the whole JSON blob, since its all deterministic. But that's a lot of hassle!

But, at the very least, check that the top-level counter is 9

    data = {
        'counter':
        9,
        'full_type':
        'node.%|',
        'label':
        'node',
        'namespace':
        'node',
        'path':
        'node',
        'subspaces': [{
            'counter':
            6,
            'full_type':
            'data.%|',
            'label':
            'Data',
            'namespace':
            'data',
            'path':
            'node.data',
            'subspaces': [{
                'counter':
                1,
                'full_type':
                'data.array.%|',
                'label':
                'array',
                'namespace':
                'array',
                'path':
                'node.data.array',
                'subspaces': [{
                    'counter': 1,
                    'full_type': 'data.array.kpoints.KpointsData.|',
                    'label': 'KpointsData',
                    'namespace': 'kpoints',
                    'path': 'node.data.array.kpoints',
                    'subspaces': []
                }]
            }, {
                'counter': 1,
                'full_type': 'data.cif.CifData.|',
                'label': 'CifData',
                'namespace': 'cif',
                'path': 'node.data.cif',
                'subspaces': []
            }, {
                'counter': 2,
                'full_type': 'data.dict.Dict.|',
                'label': 'Dict',
                'namespace': 'dict',
                'path': 'node.data.dict',
                'subspaces': []
            }, {
                'counter': 1,
                'full_type': 'data.folder.FolderData.|',
                'label': 'FolderData',
                'namespace': 'folder',
                'path': 'node.data.folder',
                'subspaces': []
            }, {
                'counter': 1,
                'full_type': 'data.structure.StructureData.|',
                'label': 'StructureData',
                'namespace': 'structure',
                'path': 'node.data.structure',
                'subspaces': []
            }]
        }, {
            'counter':
            3,
            'full_type':
            'process.%|%',
            'label':
            'Process',
            'namespace':
            'process',
            'path':
            'node.process',
            'subspaces': [{
                'counter':
                3,
                'full_type':
                'process.calculation.%|%',
                'label':
                'Calculation',
                'namespace':
                'calculation',
                'path':
                'node.process.calculation',
                'subspaces': [{
                    'counter': 1,
                    'full_type': 'process.calculation.calcfunction.CalcFunctionNode.|',
                    'label': 'CalcFunctionNode',
                    'namespace': 'calcfunction',
                    'path': 'node.process.calculation.calcfunction',
                    'subspaces': []
                }, {
                    'counter': 2,
                    'full_type': 'process.calculation.calcjob.CalcJobNode.|',
                    'label': 'CalcJobNode',
                    'namespace': 'calcjob',
                    'path': 'node.process.calculation.calcjob',
                    'subspaces': []
                }]
            }]
        }]
    }

if process_type == '':
builder.append(orm.Data, filters={'node_type': {'==': node_type}})
else:
builder.append(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should also really be a test for this scenario

Copy link
Member

@chrisjsewell chrisjsewell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment

@elsapassaro
Copy link
Contributor

@ramirezfranciscof Since we are adding a new endpoint, it makes sense to release a new minor version of the REST API, i.e change the version to 4.1.0 in the config. In this way we can detect from the version number if the endpoint will be there or not.

@ramirezfranciscof ramirezfranciscof force-pushed the issue4247 branch 2 times, most recently from da9dcc1 to cb78300 Compare November 6, 2020 16:00
@ramirezfranciscof ramirezfranciscof force-pushed the issue4247 branch 4 times, most recently from c81604d to 0899e0e Compare November 11, 2020 16:41
@sphuber
Copy link
Contributor

sphuber commented Nov 11, 2020

Thanks for the rebase @ramirezfranciscof that helps a lot. I will start going through it now. First question that I have now straight away: can you give me examples of the process nodes that have an empty process type string and those that have None (apparently)? I am not even sure how that one happens. Is there then null in the database column?

@ramirezfranciscof
Copy link
Member Author

ramirezfranciscof commented Nov 11, 2020

There should be calcfunctions with process_type=None in this database https://dev-aiida-dev.materialscloud.org/curated-cofs/api/v4/ andprocess_type='' in this other one https://dev-aiida-dev.materialscloud.org/2dtopo/api/v4/. The second database can be downloaded from here https://archive.materialscloud.org/record/2020.86, I don't know to which archive it corresponds the first one though, @asle85 ?

EDIT: I think it might be this one for the None nodes https://archive.materialscloud.org/record/2020.133

Comment on lines 118 to 125
filters['process_type'] = {'==': process_type}
else:
filters['process_type'] = {'or': [{'==': ''}, {'==': None}]}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are going to account for process types that are both an empty string or None, then we might as well get rid of the conditional entirely and just always set filters['process_type'] = {'==': process_type}. This single line will have exactly the same effect as the current four lines. We should add a comment that currently (probably due to a bug in the migrations) the process type can be empty or None at all

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I don't think it would have the same effect because currently if process_type is either '' or None, it will search for both '' and None. Your change would make None just search for None and '' just search for ''.

The reason I grouped these together is that this function takes the full_type string (that looks like node.type.Descriptor|process.type.descriptor) and splits it into the node_type and the process_type. It is possible to pass an empty process_type in this way (node.type.Descriptor|) but it is not possible to do so with a null process type (as there is no simple way to distinguish between ' None' string and None in node.type.Descriptor|None).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you are saying that the full types will look like node.type|None both when process_type is '' as well as None? In that case, you are right that what I proposed is not the same. I am still worried about all these edge cases that really shouldn't exist and we should figure out how and why they came about. At the very least, you should add a comment here to explain that your additional clause is because there (erroneously) exist process nodes with a process_type equal to empty string or null

Copy link
Member Author

@ramirezfranciscof ramirezfranciscof Nov 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you are saying that the full types will look like node.type|None both when process_type is '' as well as None?

Nono, it will show node.type| both when it is '' and when it is None. I mentioned the node.type|None example to illustrate that it is not easy to separate the None case from the '' case because you need to encode this into a string (the full_type) and therefore you would then also need to distinguish between the null None and the string 'None'.

I am still worried about all these edge cases that really shouldn't exist and we should figure out how and why they came about. At the very least, you should add a comment here to explain that your additional clause is because there (erroneously) exist process nodes with a process_type equal to empty string or null

Ok, I can add that, and we could open an issue. I mean, it might be the case that this was a bug in some migrations that already got fixed, but nevertheless these nodes will exist in databases so I think it is not bad to be able to catch these in the most securely way possible.

aiida/restapi/common/identifiers.py Outdated Show resolved Hide resolved
aiida/restapi/common/identifiers.py Outdated Show resolved Hide resolved
tests/restapi/test_identifiers.py Outdated Show resolved Hide resolved
tests/restapi/test_identifiers.py Outdated Show resolved Hide resolved
tests/restapi/test_identifiers.py Outdated Show resolved Hide resolved
Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ramirezfranciscof . I have some suggestions and questions. I already reviewed the first commit in a different tab, but those don't show up in this review. I hope they won't get lost and show up...

aiida/restapi/common/identifiers.py Outdated Show resolved Hide resolved
aiida/restapi/resources.py Show resolved Hide resolved
aiida/restapi/resources.py Outdated Show resolved Hide resolved
aiida/restapi/resources.py Outdated Show resolved Hide resolved
tests/restapi/test_statistics.py Outdated Show resolved Hide resolved
tests/restapi/conftest.py Outdated Show resolved Hide resolved
tests/restapi/conftest.py Outdated Show resolved Hide resolved
@ramirezfranciscof ramirezfranciscof force-pushed the issue4247 branch 4 times, most recently from 841b511 to dc77945 Compare November 12, 2020 10:57
@CasperWA
Copy link
Contributor

CasperWA commented Nov 12, 2020

Hey, just want to say I have some corrections that should go into these files as well. I have implemented them in the #4337 PR, but I haven't had time to update it yet, so just highlighting some things that may be interesting to implement here:

https://github.com/aiidateam/aiida-core/pull/4337/files#diff-4bb11213826a021effbfd69c334387fd7c536eb5579f1f3a6bb0bd372770f671R72

Ah - and I see now I am missing uploading the stuff that touches on generating the node and process types. Since it's on my home desktop computer and I'm currently in the office, I'll have to push that tonight, if I remember. It just simplifies the creation of the full_type field value, ensuring it doesn't fail if any of the parts should be None or empty strings.

Edit: If this information is completely out-of-scope of this PR, then please disregard it 😅

@ramirezfranciscof
Copy link
Member Author

Ah - and I see now I am missing uploading the stuff that touches on generating the node and process types. Since it's on my home desktop computer and I'm currently in the office, I'll have to push that tonight, if I remember. It just simplifies the creation of the full_type field value, ensuring it doesn't fail if any of the parts should be None or empty strings.

Edit: If this information is completely out-of-scope of this PR, then please disregard it 😅

@CasperWA haha, well yes it is in scope, this is one of the changes I implemented here. Please do try to push the changes asap and ping me when you do; we were trying to get this PR merge for tomorrow's release. Do you remember what changes you made related to this that you could share now?

@CasperWA
Copy link
Contributor

Ah - and I see now I am missing uploading the stuff that touches on generating the node and process types. Since it's on my home desktop computer and I'm currently in the office, I'll have to push that tonight, if I remember. It just simplifies the creation of the full_type field value, ensuring it doesn't fail if any of the parts should be None or empty strings.
Edit: If this information is completely out-of-scope of this PR, then please disregard it sweat_smile

@CasperWA haha, well yes it is in scope, this is one of the changes I implemented here. Please do try to push the changes asap and ping me when you do; we were trying to get this PR merge for tomorrow's release. Do you remember what changes you made related to this that you could share now?

I believe I simplified this function to a single f-string:

def construct_full_type(node_type, process_type):
"""Return the full type, which uniquely identifies any `Node` with the given `node_type` and `process_type`.
:param node_type: the `node_type` of the `Node`
:param process_type: the `process_type` of the `Node`
:return: the full type, which is a unique identifier
"""
if node_type is None:
process_type = ''
if process_type is None:
process_type = ''
return f'{node_type}{FULL_TYPE_CONCATENATOR}{process_type}'

Remove the try-except here:

try:
node_entry['full_type'] = construct_full_type(node_entry['node_type'], node_entry['process_type'])
except KeyError:
node_entry['full_type'] = None

Using instead the .get method.

The latter change here relies on what full_type's value should be if any of node_type or process_type are None or empty strings. If the desired result is that the value should be None/null, then the construct_full_type function should reflect this, not the latter part, where the function is used.

Maybe you've already fixed this (didn't check, sorry)? But that's at least what I can remember for now :) I'll have to push what I have locally tonight otherwise.

@ramirezfranciscof
Copy link
Member Author

@CasperWA Ok, thanks, I see you've updated your branch. I think we are both dealing with the problem of atipical process_types but in different parts of the restAPI code, so our changes seem to be complementary. I think my modifications shouldn't affect yours, but you can rebase once we merge this and let me know if there is any problem.

@sphuber @chrisjsewell ready for a new pass.

@CasperWA
Copy link
Contributor

@CasperWA Ok, thanks, I see you've updated your branch. I think we are both dealing with the problem of atipical process_types but in different parts of the restAPI code, so our changes seem to be complementary. I think my modifications shouldn't affect yours, but you can rebase once we merge this and let me know if there is any problem.

@ramirezfranciscof I was more thinking of moving some of the changes over to this PR, not if the changes in each of the PRs are compatible with each other.
The changes I have made were simply fixes to some issues I experience, i.e., they can be implemented in this PR as well if that was your wish. Especially because I saw you were touching similar aspects of the REST API code.

But I'll leave it up to you and just rebase whenever, if it comes to that 👍

@ramirezfranciscof
Copy link
Member Author

ramirezfranciscof commented Nov 13, 2020

@CasperWA Yeah, we can also do that. I can check with @sphuber if he thinks its better to do that: this one has to go today, so it will depend on how much work would be to incorporate both together. For starters, we might need to rebase your branch over the first commit of this one (title "Further considerations for None or empty types"?), check that all tests work, and then I would rebase over yours. If you want maybe you can test the first rebase in a separate branch to verify that it does it automatically and all tests work and confirm here.

@chrisjsewell would you be able to give this a final pass before our meeting at 15?

@CasperWA
Copy link
Contributor

@CasperWA Yeah, we can also do that. I can check with @sphuber if he thinks its better to do that: this one has to go today, so it will depend on how much work would be to incorporate both together. For starters, we might need to rebase your branch over the first commit of this one (title "Further considerations for None or empty types"?), check that all tests work, and then I would rebase over yours. If you want maybe you can test the first rebase in a separate branch to verify that it does it automatically and all tests work and confirm here.

I think you misunderstand me. My PR should not be incorporated into this PR. But I implemented some minor fixes in my PR that may be relevant to get into this one. But it seems it will be too difficult, so let's keep it separate and I will update my PR accordingly.

The `process_type` attribute has changed over the years; currently it
must have some descriptor for processes and be None for data types.
Apparently this has not only been the case, and thus old databases may
have both data and process nodes with either empty strings ('') and/or
None entries in their `process_type` attributes.

Additionally, there were some problems with how the unregistered entry
points were considered that made it impossible to query for them.

In order to consider all of this when filtering and doing statistics,
it has been decided to:

1) Group all instances of a given node_type that have either '' or None
as their process_type in the same `full_type` (`node_type|` ) and hence
always query for both when the `process_type` is missing.

2) Remove the `aiida.descriptor:` and the `no-entry-point` from the
`process_type` part of unregistered processes. This was interfeering
when the `full_type` was given to return the filtering options to query
for these processes.

Tests were adapted to test this new compatibility aspects.
This feature returns a namespace tree of the available node types in the
database (data node_types + process process_types) with the addition of
a count at each leaf / branch. It also has the option of doing so for a
single user, if the pk is provided as an option.
Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @ramirezfranciscof

@sphuber sphuber dismissed chrisjsewell’s stale review November 13, 2020 14:33

Changes addressed

@sphuber sphuber merged commit 62ed643 into aiidateam:develop Nov 13, 2020
@sphuber sphuber deleted the issue4247 branch November 13, 2020 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants