Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CaltechHPC environment #5959

Merged
merged 2 commits into from
May 10, 2024

Conversation

nilsvu
Copy link
Member

@nilsvu nilsvu commented May 1, 2024

  • The previous environment had an issue with loading numeric initial data (took ~40min, now takes 16 seconds).
  • It also ran BBH evolutions extremely slowly for some reason (2M in 24 hours).
  • It was actually built for cascadelake, not skylake.

Kyle and I reinstalled the environment on skylake, so we should be able to run on all available nodes on CaltechHPC now. I confirmed that the IO issues are fixed and a BBH runs at reasonable speed.

Proposed changes

Upgrade instructions

Code review checklist

  • The code is documented and the documentation renders correctly. Run
    make doc to generate the documentation locally into BUILD_DIR/docs/html.
    Then open index.html.
  • The code follows the stylistic and code quality guidelines listed in the
    code review guide.
  • The PR lists upgrade instructions and is labeled bugfix or
    new feature if appropriate.

Further comments

@nilsvu nilsvu added the priority critical for progress label May 1, 2024
@nilsvu nilsvu requested a review from knelli2 May 1, 2024 07:26
@nilsvu
Copy link
Member Author

nilsvu commented May 1, 2024

Marking priority because BBH evolutions are essentially broken on CaltechHPC at the moment.

@knelli2
Copy link
Contributor

knelli2 commented May 1, 2024

I'm testing this now. I'll report back once I'm confident everything is working smoothly (or as non-roughly as possible)

Copy link
Contributor

@knelli2 knelli2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I have tested this out and am now confident that this will work on all nodes of CaltechHPC (skylake, cascadelake, icelake).

Note that we had to add the environment variable FI_PROVIDER=tcp to the skylake-2024-04 environment because the mpi module kept trying to use UCX for the fabric but because of #3886, we can't use UCX, so we go with TCP instead.

Comment on lines 45 to 56
-D ENABLE_PARAVIEW=ON \
-D ENABLE_PARAVIEW=OFF \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were there issues finding paraview? This is really only an issue for the CLI so not an immediate priority.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues, I sent a request to the CaltechHPC admins to build ParaView with the same Python etc as the rest of the build. They haven't responded yet.

Comment on lines 13 to 14

spectre_load_modules() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[random line] Should we remove the caltech_hpc_gcc_icelake.sh env now and rename this one to just caltech_hpc_gcc.sh? I think that'll be less confusing for users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think so. Now or later?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say now. Just rip the Band-Aid off. Be sure to also change the cluster installation instructions, maybe adding a little description that whatever node you build on, you can only run on that type of node or newer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok done

@nilsvu nilsvu force-pushed the caltechhpc_skylake branch 5 times, most recently from 45f6225 to 0ba915b Compare May 10, 2024 00:00
@nilsvu nilsvu changed the title Fix CaltechHPC skylake environment Fix CaltechHPC environment May 10, 2024
nilsvu added 2 commits May 9, 2024 21:11
This allows extra params like `-p reservation=sxs`
passed to BBH pipeline commands.
- The previous environment had an issue with loading numeric
  initial data (took ~40min, now takes 16 seconds).
- It also ran BBH evolutions extremely slowly for some reason.
- It was actually built for cascadelake, not skylake.

Kyle and I reinstalled the environment on skylake, so we
should be able to run on all available nodes on CaltechHPC
now. I confirmed that the IO issues are fixed and a BBH runs
at reasonable speed.
@knelli2 knelli2 merged commit 10195a8 into sxs-collaboration:develop May 10, 2024
22 checks passed
@nilsvu nilsvu deleted the caltechhpc_skylake branch May 10, 2024 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority critical for progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants