Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support mpi4py 4.x.x #1618

Merged
merged 11 commits into from
Sep 2, 2024
Merged

Support mpi4py 4.x.x #1618

merged 11 commits into from
Sep 2, 2024

Conversation

JuanPedroGHM
Copy link
Member

@JuanPedroGHM JuanPedroGHM commented Aug 15, 2024

Due Diligence

  • General:
  • Implementation:
    • unit tests: all split configurations tested
    • unit tests: multiple dtypes tested
    • documentation updated where needed

Description

Not sure why this just appeared now, but MPI was complaining that our count and displacement tuples had some floats, when it should have been only int32. I don't think the changelog makes a specific point about this, maybe they were doing the checks before, but now it is up to us.

This caused a lot of our tests to fail, as a lot of the functions using Allgatherv did not properly cast the count and displacements tuples.

Issue/s resolved: #1598

Changes proposed:

  • On the method as_buffer in comunications.py, added extra lines that explicitly cast the elements of tuples count and displs to ints to avoid any unexpected casting errors further down the call stack.
  • Removed the limit to allow the usage of mpi4py==4.0.0 on the dependencies list.
  • Corrected some of the function signatures to properly signal that some types are optional.
  • Properly skip tests if multiple nodes are required but only one is available ( looking at you, DASO )
  • Tested it using OpenMPI 4.1, python 3.11, on 1 and 2 nodes in horeka, both with and without GPU.

Type of change

  • Bug fix (could have some other breaking changes down the line)

Does this change modify the behaviour of other functions? If so, which?

yes

All the communication functions that make use of as_buffer to send/recv objects through MPI.

@JuanPedroGHM JuanPedroGHM linked an issue Aug 15, 2024 that may be closed by this pull request
@JuanPedroGHM JuanPedroGHM changed the title Support mpi4py 4.0.0 Support mpi4py 4.x.x Aug 15, 2024
@JuanPedroGHM JuanPedroGHM marked this pull request as ready for review August 15, 2024 11:20
@JuanPedroGHM JuanPedroGHM added bug Something isn't working MPI Anything related to MPI communication benchmark PR labels Aug 15, 2024
Copy link
Contributor

Thank you for the PR!

Copy link

codecov bot commented Aug 15, 2024

Codecov Report

Attention: Patch coverage is 93.75000% with 1 line in your changes missing coverage. Please review.

Project coverage is 92.07%. Comparing base (15c4478) to head (6d08262).
Report is 5 commits behind head on main.

Files Patch % Lines
heat/core/tests/test_suites/basic_test.py 90.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1618   +/-   ##
=======================================
  Coverage   92.07%   92.07%           
=======================================
  Files          83       83           
  Lines       12144    12158   +14     
=======================================
+ Hits        11181    11194   +13     
- Misses        963      964    +1     
Flag Coverage Δ
unit 92.07% <93.75%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ClaudiaComito ClaudiaComito added this to the 1.5.0 milestone Aug 15, 2024
Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

@JuanPedroGHM
Copy link
Member Author

JuanPedroGHM commented Aug 19, 2024

Benchmarks results - Sponsored by perun

function mpi_ranks device metric value ref_value std % change type alert lower_quantile upper_quantile
matmul_split_1 4 CPU RUNTIME 0.137711 0.11414 0.0272603 20.6512 jump-detection True nan nan
concatenate 4 CPU RUNTIME 0.155889 0.179785 0.00335936 -13.2915 jump-detection True nan nan
apply_inplace_standard_scaler_and_inverse 4 CPU RUNTIME 0.00894644 0.0069137 0.000970233 29.4017 jump-detection True nan nan
matmul_split_0 4 GPU RUNTIME 0.0530454 0.0480567 0.042959 10.3808 jump-detection True nan nan
qr_split_1 4 GPU RUNTIME 0.0760624 0.0669101 0.0216329 13.6784 jump-detection True nan nan
concatenate 4 GPU RUNTIME 0.197229 0.13092 0.0547443 50.6482 jump-detection True nan nan
lanczos 4 CPU RUNTIME 0.24407 0.649483 0.00172366 -62.421 trend-deviation True 0.245424 1.18205
hierachical_svd_tol 4 CPU RUNTIME 0.0516526 0.14904 0.000279719 -65.3431 trend-deviation True 0.0518328 0.250819
kmeans 4 CPU RUNTIME 0.311722 0.998816 0.00207136 -68.7908 trend-deviation True 0.31745 1.79788
kmedians 4 CPU RUNTIME 0.388455 1.19528 0.00221733 -67.5009 trend-deviation True 0.392783 2.12779
kmedoids 4 CPU RUNTIME 0.702011 1.29547 0.0206439 -45.8102 trend-deviation True 0.703679 1.99692
apply_inplace_robust_scaler_and_inverse 4 CPU RUNTIME 2.38853 5.69061 0.0169252 -58.0269 trend-deviation True 2.41384 9.92578
qr_split_1 4 GPU RUNTIME 0.0760624 0.0659502 0.0216329 15.3331 trend-deviation True 0.0615888 0.0690311
lanczos 4 GPU RUNTIME 0.595215 0.583195 0.0106197 2.06108 trend-deviation True 0.579026 0.588987

Grafana Dashboard
Last updated: 2024-09-02T06:55:57Z

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

Comment on lines 152 to 153
len(TestCase.get_hostnames()) > 1,
"Test only works on single node, file creation is not synchronized across nodes",
Copy link
Collaborator

@mtar mtar Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
len(TestCase.get_hostnames()) > 1,
"Test only works on single node, file creation is not synchronized across nodes",
len(TestCase.get_hostnames()) > 1 or not os.environment.get('TMPDIR'),
"Test only works on single node, file creation is not synchronized across nodes. Set the environment variable 'TMPDIR' to a path globally accessible by all nodes otherwise.",

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this cause problems when TMPDIR is set only on some of the nodes? Can that happen? I'm also not sure if this guarantees that TMPDIR is globally accessible by all the nodes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user has to set this variable explicitly. I hope they know what they're doing then. Do I give them too much credit here? 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if it is meant for us, it should be good enough. I don't think there are a lot of people out there running the tests.

@@ -33,7 +33,7 @@
"Topic :: Scientific/Engineering",
],
install_requires=[
"mpi4py>=3.0.0, <4.0.0",
"mpi4py>=3.0.0",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you run it on 3.x.x?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean by that. I installed mpi4py 4.0.0 manually.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you checked wether the changes are backwards compatible with the older versions? If not we have to raise the minimum dependency or add some more lines reflecting that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I did not check for backwards compatibility. I think the only problem with that one might be the rename of buffer/memory, but I will run some more tests.

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

@JuanPedroGHM JuanPedroGHM requested a review from mtar August 28, 2024 07:34
Copy link
Contributor

Thank you for the PR!

Copy link
Collaborator

@mtar mtar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👍

Copy link
Contributor

github-actions bot commented Sep 2, 2024

Thank you for the PR!

@JuanPedroGHM JuanPedroGHM merged commit affdebf into main Sep 2, 2024
7 of 8 checks passed
@mtar mtar deleted the fix/mpi4py-4-support branch October 21, 2024 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmark PR bug Something isn't working MPI Anything related to MPI communication PR talk
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support mpi4py 4.x
3 participants