Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove packaging of test data and download them on demand #123

Merged
merged 6 commits into from
May 30, 2023

Conversation

ericpre
Copy link
Member

@ericpre ericpre commented May 24, 2023

I had a go at removing the packaging of the test file and downloading on demand when running the test suite.

Summary of the approach:

  • test data live in this repository, just a different folder
  • adding test data during development is similar as before
    • add the file to the git repository
    • update registry.txt by calling the rsciio.tests.registry_utils.update_registry function
  • the test data are not packaged
  • the test data are downloaded automatically if there are not locally available when running the test suite. It slow down running because:
    • it takes about 3 min to download everything because there are many small files and the overhead is significant.
    • running several worker is possible, which I suspect is a bug pytest-xdist or pytest

Example of the download all test data before running the test suite: https://github.com/hyperspy/rosettasciio/actions/runs/5071177296/jobs/9107256002?pr=123

Drawback

  • Running the test suite will fail, it the user doesn't write access to the test folder.
  • There is an issue with generating hash on windows of text, xml file (most likely related to line ending) and it doesn't seem that we can solve it. However, we would workaround with a custom pre-commit hook that we could run from the pull request and that will be skipped on windows.

Progress of the PR

  • Move test data a the rsciio/tests/data folder,
  • [n/a] update docstring (if appropriate),
  • [n/a] update user guide (if appropriate),
  • update contributing CONTRIBUTING.rst once we are happy with the approach
  • Check setting base url plays well with our setup in rosettasciio to pick tagged or development branch
  • add an changelog entry in the upcoming_changes folder (see upcoming_changes/README.rst),
  • Check formatting changelog entry in the readthedocs doc build of this PR (link in github checks)
  • add tests,
  • ready for review.

After the PR is merged

  • Update base url from my fork to the hyperspy organisation.
  • Add custom pre-commit hook to update the registry so that the registry can be updated by running pre-commit from the PR (add a comment)
  • follow up with the zip file

@codecov
Copy link

codecov bot commented May 24, 2023

Codecov Report

Patch coverage: 87.50% and no project coverage change.

Comparison is base (6210e4e) 85.24% compared to head (cf4486b) 85.24%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #123   +/-   ##
=======================================
  Coverage   85.24%   85.24%           
=======================================
  Files          73       73           
  Lines        9034     9042    +8     
  Branches     2045     2046    +1     
=======================================
+ Hits         7701     7708    +7     
  Misses        872      872           
- Partials      461      462    +1     
Impacted Files Coverage Δ
rsciio/conftest.py 80.00% <87.50%> (+8.57%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@ericpre ericpre changed the title Pooch Remove packaging of test data and download them on demand May 24, 2023
rsciio/tests/test_bruker.py Outdated Show resolved Hide resolved
rsciio/tests/test_emd.py Show resolved Hide resolved
rsciio/tests/test_tiff.py Show resolved Hide resolved
MY_PATH = os.path.dirname(__file__)
ZIPF = os.path.join(MY_PATH, "edax_files.zip")
TEST_DATA_PATH = Path(__file__).parent / "data" / "edax"
ZIPF = TEST_DATA_PATH / "edax_files.zip"
TMP_DIR = tempfile.TemporaryDirectory()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use tmp_path here as for other plugins?

Copy link
Contributor

@sem-geologist sem-geologist May 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it is good chance to get rid of zip and other packed data (I mean leave it loose - not packed within zip or gz)?
Why:

  1. bandwitch saving is redundant, as git compresses data behind anyway.
  2. size is moot point, if this is going to be downloaded on demand
  3. updating with new files, requires huge binary differences, getting rid of zip/gz collections, only new file will be added as difference to the git history.
  4. finally, these tmp_path would be not needed (albeit if tests would be rewritten to use rsciio to read the data explicitly, and retrieved list/dictionary fed to hyperspy to generate signal for testing, for some formats loading could be straight from zip file, without any tmp_folders)

Copy link
Member Author

@ericpre ericpre May 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should try to remove of the zip files. Is it ok, if we do this in a follow PR? It would be good to test updating register, etc.

Just to avoid misunderstand, the expected usage is to download the files on demand when running from a non-development installation. For developer, there will be no change in workflow (well, you can remove the data folder and it will be downloaded again).
The main aim is to be able to run the test suite from the conda-forge feedstock or the hyperspy-bundle.

  1. finally, these tmp_path would be not needed (albeit if tests would be rewritten to use rsciio to read the data explicitly, and retrieved list/dictionary fed to hyperspy to generate signal for testing, for some formats loading could be straight from zip file, without any tmp_folders)

I am not sure about that. What would be the benefit? For development purpose I usually prefer to open the file directly, sometimes using different approach, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if zip files are get rid (replaced with unpacked content), then no tmp_path is needed as all files can be just opened directly (using Hyperspy load). That is prefered way.

Sorry for convoluted message. What I had meant some of readers (undocumented behavior) are able to open not only physical files, but can accept the opened file or other BytesIO like opened object, i.e. file from within zip archive - no tmp folder then is needed, as byte-stream is read and parsed directly seamlessly from zip. I know that that works with tiffs. Ofc for test files that feature rather not good due to 1,2,3 reasons given before. However in the future we should consider documenting this behavior as gz, zip are one of most common available well documented, tested, widely supported containers. I.e. it is nice to be able to keep whole SEM session for one sample within single file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring the zip file business, for the scenario where tmp_path are used for testing saving file, we should keep this pattern, because we don't need to worry about removing them, user having write access to the folder, etc. What do you think?

This make me think that to run the test suite, the user will need write access to the tests folder, which is not guaranty in multi user installation. I have updated the summary of this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use tmp_path here as for other plugins?

Same as above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of course for saving tests the situation is different, and rather it can't be evaded anyhow.

@jlaehne jlaehne linked an issue May 24, 2023 that may be closed by this pull request
@jlaehne jlaehne requested a review from CSSFrancis May 24, 2023 21:24
@CSSFrancis
Copy link
Member

@ericpre Very cool! I'll probably try to check this out tomorrow to see if I can get it to work. The rsciio.tests.registry_utils.update_registry function seems reasonable but I wonder if people will get stuck on that. Is it worth it to see if we can use pre-commit to get that to check for new files and then automatically edit the registry.txt file?

@ericpre
Copy link
Member Author

ericpre commented May 25, 2023

@ericpre Very cool! I'll probably try to check this out tomorrow to see if I can get it to work. The rsciio.tests.registry_utils.update_registry function seems reasonable but I wonder if people will get stuck on that. Is it worth it to see if we can use pre-commit to get that to check for new files and then automatically edit the registry.txt file?

Yes, this is a good idea, even if not everyone uses pre-commit, it will be useful to request pre-commit to update it from a comment in a PR.
In any case, if the registry is out of date, the Build / Test Packaging will fail, so it will not fall through the cracks!

Can we leave this for another PR? :)

@CSSFrancis
Copy link
Member

Yes, this is a good idea, even if not everyone uses pre-commit, it will be useful to request pre-commit to update it from a comment in a PR. In any case, if the registry is out of date, the Build / Test Packaging will fail, so it will not fall through the cracks!

Can we leave this for another PR? :)

Sounds good to me. Otherwise I think it looks good.

@jlaehne
Copy link
Contributor

jlaehne commented May 27, 2023

TODO before merging: Edit CONTRIBUTING.rst to update the path where test files should be placed (and if needed how to update the registry). The doc currently contains the comment [change for pooch] at the appropriate point.

rsciio/conftest.py Fixed Show fixed Hide fixed
CONTRIBUTING.rst Outdated Show resolved Hide resolved
CONTRIBUTING.rst Outdated Show resolved Hide resolved
CONTRIBUTING.rst Outdated Show resolved Hide resolved
CONTRIBUTING.rst Outdated Show resolved Hide resolved
CONTRIBUTING.rst Outdated Show resolved Hide resolved
CONTRIBUTING.rst Outdated Show resolved Hide resolved
@ericpre
Copy link
Member Author

ericpre commented May 30, 2023

Everyone is happy with it, merging to continue on the other PRs.

@ericpre ericpre merged commit 45ee267 into hyperspy:main May 30, 2023
@ericpre ericpre mentioned this pull request Jun 2, 2023
21 tasks
jlaehne added a commit that referenced this pull request Jun 2, 2023
Follow up of #123 (Remove packaging of test data)
@jlaehne jlaehne mentioned this pull request Jun 15, 2023
9 tasks
@ericpre ericpre added this to the v0.1.0 initial release milestone Oct 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use pooch library to manage test files
4 participants