Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in hpc-cluster-slurm.yaml #4

Merged

Conversation

heyealex
Copy link
Contributor

@heyealex heyealex commented Oct 15, 2021

Default values for terraform variables have been taking precedence over
global variables, so the default for the login-node was taken rather
than the global variable, unlike the controller which had not default
set.

To remedy this:

  • The default was removed from zone for the login node
  • Precendence has been modified to pull from globals before defaults

In addition to this, a few other fixes have been included:

  • Updating the omnia branch name to include the version
  • Better error handling in the expand step
  • Tests to cover new behavior in applyGlobalVariables
  • .tflint.hcl was added to cover pre-commit tests

Default values for terraform variables have been taking precedence over
global variables, so the default for the login-node was taken rather
than the global variable, unlike the controller which had not default
set.

To remedy this:
* The default was removed from zone for the login node
* Precendence has been modified to pull from globals before defaults

In addition to this, a few other fixes have been included:
* Updating the omnia branch name to include the version
* Better error handling in the expand step
* Tests to cover new behavior in applyGlobalVariables
@heyealex heyealex requested a review from cboneti October 15, 2021 00:12
@heyealex heyealex changed the base branch from main to develop October 15, 2021 01:17
Copy link
Member

@cboneti cboneti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address the few comments here.
I suggest we merge this into develop first. We test it a little more and then merge develop into main.

}
if err := bc.applyGlobalVariables(); err != nil {
log.Fatal(err)
}
bc.expandVariables()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should pick one behavior for error handling. expandVariables handles its own errors, whereas the other two are handled here...
I think that if the errors that happened in combineLabels and in aplyGlobalVariables are unrecoverable, then we should call logFatal within these functions, otherwise, I see no good reason to propagate the error until expand, but not further. It just feels inconsistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was done primarily to help with testing some of these lower level function, where we can return an error and verify that it is what we expected. I think we need to select a level at which it makes sense to actually handle the errors, which is probably in the function called by bc.expand() and bc.validate(), where we can have the most intuitive error messages, but keep the high level functions in config.go clean. What are your thoughts on that?

I also added a task for improving error handling in the future, possibly using a tool like github.com/pkg/errors to process and wrap errors between functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bugs me big time, but ok to push as is, since it is an urgent patch. Then we should agree on how to handle errors.
My big inclination is towards "intent" (Unrecoverable errors should fail immediately), but I see that testing can be harder (but not impossible).

pkg/config/expand.go Outdated Show resolved Hide resolved
This commit creates a more explicit flow showing the precedence for
sourcing a setting from explicit to global to default, and failing if it
can't be found anywhere. Updating tests to reflect this as well.
Moved templates from their own directory to within reswriter as is
required by internal embed tooling and removed a test that is not
properly self-contained.
* linter fixes for go and terraform

Adding package comments and various other go-lint related fixes. This
also updates the python script for omnia install and fixes an outdated
branch name for omnia.
Copy link
Member

@cboneti cboneti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, as long as we capture follow actions such as added unit tests and homogeneous error testing approach

}
if err := bc.applyGlobalVariables(); err != nil {
log.Fatal(err)
}
bc.expandVariables()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bugs me big time, but ok to push as is, since it is an urgent patch. Then we should agree on how to handle errors.
My big inclination is towards "intent" (Unrecoverable errors should fail immediately), but I see that testing can be harder (but not impossible).

@heyealex heyealex merged commit f75a44a into GoogleCloudPlatform:develop Oct 15, 2021
@heyealex heyealex deleted the bugfix/hpc-slurm-zone-mismatch branch October 15, 2021 18:27
heyealex pushed a commit that referenced this pull request Apr 27, 2022
Uses `deployment_name` for the generated directory
nick-stroud referenced this pull request in nick-stroud/hpc-toolkit Jul 8, 2022
Add startup script example using seperate instance template module
heyealex pushed a commit that referenced this pull request Nov 12, 2022
nick-stroud pushed a commit that referenced this pull request Aug 14, 2024
OFE - Terraform 1.4 fix, reservations and validation of compatible disk types
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants