Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Validate dependencies.yaml using jsonschema #29

Merged
merged 20 commits into from
Feb 8, 2023

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Nov 28, 2022

This PR enables validating the contents of a dependencies.yaml file directly without doing any processing. The schema is encoded using JSON Schema and validated using the Python implementation. The new Python code is fairly minimal, and it would be even shorter except that I leveraged the object-oriented API to show all errors in a file instead of simply showing the first error using jsonschema.validate. The majority of the new lines are from the schema definition. The validation is injected into the normal CLI usage so that schemas are always validated before dependency files are generated, ensuring that developers see useful errors about why their dependencies.yaml file is invalid rather than opaque runtime errors when dfg fails to use the file.

@vyasr
Copy link
Contributor Author

vyasr commented Nov 28, 2022

As an example, applying this patch:

--- a/tests/examples/no-specific-match/dependencies.yaml
+++ b/tests/examples/no-specific-match/dependencies.yaml
@@ -1,14 +1,11 @@
 files:
   all:
-    output: conda
     requirements_dir: output/actual
     matrix:
       cuda: ["11.8"]
     includes:
       - cudatoolkit
-channels:
-  - rapidsai
-  - conda-forge
+channels: 1234
 dependencies:
   cudatoolkit:
     specific:

and rerunning tests results in

------------------------------------------------------------------------------------------ Captured stderr call -------------------------------------------------------------------------------------------
Error #1:
        'output' is a required property

        Failed validating 'required' in schema['properties']['files']['patternProperties']['.*']:
            {'properties': {'conda_dir': {'type': 'string'},
                            'includes': {'items': {'type': 'string'},
                                         'type': 'array'},
                            'matrix': {'type': 'object'},
                            'output': {'if': {'type': 'array'},
                                       'then': {'items': {'type': 'string'}},
                                       'type': ['string', 'array']},
                            'requirements_dir': {'type': 'string'}},
             'required': ['output', 'includes'],
             'type': 'object'}

        On instance['files']['all']:
            {'includes': ['cudatoolkit'],
             'matrix': {'cuda': ['11.8']},
             'requirements_dir': 'output/actual'}
Error #2:
        1234 is not of type 'array', 'string'

        Failed validating 'type' in schema['properties']['channels']:
            {'if': {'type': 'array'},
             'then': {'items': {'type': 'string'}},
             'type': ['array', 'string']}

        On instance['channels']:
            1234
========================================================================================= short test summary info =========================================================================================
FAILED tests/test_examples.py::test_error_examples[no-specific-match] - RuntimeError: The provided dependencies data is invalid.
====================================================================================== 1 failed, 13 passed in 0.53s =======================================================================================

@vyasr vyasr changed the title Validate dependencies.yaml using jsonschema feat: Validate dependencies.yaml using jsonschema Nov 28, 2022
@vyasr
Copy link
Contributor Author

vyasr commented Dec 13, 2022

Just making some notes of my planned next steps here to help reviewers understand where I was going with this and identify any blind spots I may have:

  • I will move the schema out of the Python file into a JSON file so that it can be maintained separately.
  • I would like to implement some form of versioning for the schema. The easiest would be to simply tie its version to the package version.
  • I would like to consider removing support for values that are either strings or arrays of strings. I'm not convinced of the value-add given that it leads to undesirable complexity in parsing (all the _ensure_list nonsense that we have to make sure to do, which is especially problematic because without that you could silently allow unexpected behavior because the in operator in Python will do a substring search when the right operand is a string rather than a collection).

Copy link
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the inline-comments, some general recommendations:

  1. We should strive for significantly increasing the composability of the schema file. Using $ref links, it is possible to define sub-schemas, such as RequirementStrings, RequirementStrings (or RequirementStringList), and Requirements which could be either one of the these two.
  2. As already mentioned, the schema file should be placed in a separate file.
  3. We need to think about how the schema is maintained and also how it is published. One of the easiest approaches is to tie it to the package version and then access it directly via the GitHub raw or download endpoint. In this way the version would be directly encoded as part of the URL. The downside of that approach is that the schema version would potentially change even if there are no changes to it. Finally, it would need to be understood that the schema is part of the API (even if the user-documentation does not directly reference it) and must therefor be backwards-compatible unless a release with breaking changes is made.
  4. We should also validate the schema itself as part of the tests for this repository.

Overall, I would recommend to refactor this schema to be significantly more modular before actually checking its correctness. I find it extremely hard to read right now and it would take me significantly more effort compared to refactoring it first and then comparing it to the current non-formalized specification as part of the documentation.

@vyasr vyasr marked this pull request as ready for review February 4, 2023 00:27
@vyasr
Copy link
Contributor Author

vyasr commented Feb 4, 2023

Thanks to a makeover from @csadorf I think this PR is ready for review. @ajschmidt8 let me know what you think of it.

One minor note, I snuck in an isort bugfix.

@vyasr vyasr requested a review from ajschmidt8 February 8, 2023 22:25
@vyasr vyasr merged commit f7e8234 into rapidsai:main Feb 8, 2023
@github-actions
Copy link

github-actions bot commented Feb 8, 2023

🎉 This PR is included in version 1.1.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

@vyasr vyasr deleted the feat/jsonschema_validation branch February 9, 2023 20:33
vyasr added a commit to vyasr/dependency-file-generator that referenced this pull request Feb 17, 2023
This PR enables validating the contents of a dependencies.yaml file
directly without doing any processing. The schema is encoded using [JSON
Schema](https://json-schema.org/) and validated using [the Python
implementation](https://python-jsonschema.readthedocs.io/). The new
Python code is fairly minimal, and it would be even shorter except that
I leveraged the object-oriented API to show all errors in a file instead
of simply showing the first error using `jsonschema.validate`. The
majority of the new lines are from the schema definition. The validation
is injected into the normal CLI usage so that schemas are always
validated before dependency files are generated, ensuring that
developers see useful errors about why their dependencies.yaml file is
invalid rather than opaque runtime errors when dfg fails to use the
file.

---------

Co-authored-by: Simon Adorf <sadorf@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants