Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing Parameter Tables #414

Open
wants to merge 7 commits into
base: dev
Choose a base branch
from

Conversation

mabruzzo
Copy link
Collaborator

@mabruzzo mabruzzo commented Aug 5, 2024

Overview

This PR introduces some minor miscellaneous parameter-refactoring (we start using std::string_view during parsing) and introduces the idea of Parameter Tables to the parameter file.

About Parameter Tables

This is an idea that shows up in a number of different file-formats (in Enzo-E we call these parameter-groups). In particular, I'm borrowing the nomenclature and conventions from toml files1.

The idea is that you declare a table by putting the table name between brackets. Consider the following snippet

key1=4

[table]
key1=5
key2=false

[table.take2]
key1=6

In this snippet, "key", "table.key1", "table.key2", and "table.take2.key1" all refer to independent parameters. (Note: the table.take2.key1 parameter can be defined even if the [table] section was completely removed).

We use the dotted-parameter notation in the codebase and on the command-line to specify the full name of a parameter. Technically, TOML allows you to use the dotted notation inside a parameter-file, but we don't support that (yet -- it's a little complicated to do that).

I tried to match all the TOML conventions. There are 2 key consequences:

  1. If you already explicitly defined a table, it is an error to explicitly define the parameter again
  2. If you define a dotted table-name, like [my.table], the file acts as though you have implicitly defined the [my] table if you haven't already defined it. It is OK to explicitly define the [my] table after you explicitly define the [my.table] table

Why is this useful

From an organizational perspective, I think this is very useful. Athena++ and Enzo-E do something comparable, and it makes the meaning of parameters simpler to understand. This is something I wanted to add for a while, but there wasn't a strong motivation for doing it.

In fact there is also a practical usage -- provided by the new ParameterMap::Enforce_Table_Content_Uniform_Access_Status method.

  • The impetus for doing this was refactoring some cooling/chemistry stuff (I'll post that PR immediately afterwords). While I was doing that, I encountered the common situation where some chemistry/cooling solvers need some extra parameters that aren't required by other solvers. I wanted to raise some kind of error if an unused parameter was specified.
  • I realized if we could logically group together the chemistry/cooling parameters, and we could leverage the fact that we track which parameters are accessed from the parameter-file, we could automatically raise a new error.

Footnotes

  1. In general, I firmly believe we should be evolving cholla's parameter-file format to be a strict-subset of TOML. TOML is nice because it is well-specified (unlike INI) and it has nice support for shallowly nested parameter tables/group (for deeper nesting, you probably want something more complex). Importantly it is also mostly compatible with the existing format (the main obstacle, is that toml expects all string arguments to be enclosed in quotes -- I'm confident we can migrate to that).

@mabruzzo
Copy link
Collaborator Author

mabruzzo commented Aug 9, 2024

I cannot replicate the single failing test when I manually run the tests on the CRC.

@mabruzzo
Copy link
Collaborator Author

mabruzzo commented Aug 9, 2024

The particular test in question that is failing, see here, totally baffles me!

There is no part of this test that is remotely related to parameter-file parser. In detail, this test consists of:

  • passing in a single cell with $\rho = 1, \rho v_x = 0, \rho v_y = 0, \rho v_z = 0, E = 1, \gamma = 5/3, \delta x = \delta y = \delta z = 1$ to the function that computes the inverse timestep (without any CFL number).
  • You can actually work this out by hand. Since $v_x = v_y= v_z =0$, this is $dt^{-1} = c_s/ \delta x = \delta x^{-1} \sqrt{\gamma (\gamma - 1) E / \rho} = \sqrt{10}/3$. This is exactly the expected result (1.054...).
  • The test is failing and saying that we get a value that is 50% larger. This frankly makes no sense to me how that could possibly happen! (It has absolutely nothing to do with parameter parsing -- unless we are somewhere reading in parameters from within the test-runner and those properties are bleeding over into tests in unexpected ways).
  • The most frustrating part is I absolutely can't replicate this issue myself on ppc-n0. I have tried running it a few times myself.

I have restarted this pipeline a few times. The very first time it ran, it started to hang on a few MHD tests. The second time
it ran, this error was the only error. Subsequently I tried running it again and more things started to fail. Finally I tried running it a 4th time and got a message that the whole node went down.

When ppc-n4 comes back up, I guess I will try pushing new versions of this test and start printing things out in order to try to debug it that way (unless somebody knows of a better way to do it1 -- @bcaddy, do you have any ways?)

Footnotes

  1. for example, circleci offers the capability for you to ssh into a test write after it fails. From the little bit of searching that I did, I don't think Jenkins supports doing that

@mabruzzo
Copy link
Collaborator Author

mabruzzo commented Aug 9, 2024

The weirdest part is that #415 directly includes the changes from this PR and the tests all pass there

@bcaddy
Copy link
Collaborator

bcaddy commented Aug 9, 2024

PR #397 is having the same issue with the same test failing with the same incorrect test value. I have no clue what the issue is. I tried running the tests manually on C-3PO, PPC-n0 and PPC-n4 and they all passed when run manually. At this point my only guess is that it's either an issue with Jenkins or a race condition. If it was a race condition I would expect it to fail intermittently but so far it's been very consistent. If it was a hardware fault I would expect it to fail on every PR, not just these two. I think @evaneschneider was looking into it but I don't know what the status of that is.

Printing is the best way I've found to interrogate the state within a Jenkins run, there might be a Jenkins plugin that helps but idk. When I'm having issues with a single thing like this I'll disable all the matrix elements except the problematic one so it runs a little faster. I did look into CircleCI when we were first doing tests, they wanted like $3k a month for what we needed.

@bcaddy
Copy link
Collaborator

bcaddy commented Aug 9, 2024

The issue does appear to be attached to specific PRs and their contents. When I reverted everything in #397 it ran fine but as soon as I added the changes back it failed. That PR only changes the names of some variables that are totally unrelated so it really shouldn't impact anything

@mabruzzo
Copy link
Collaborator Author

mabruzzo commented Aug 9, 2024

PR #397 is having the same issue with the same test failing with the same incorrect test value. I have no clue what the issue is. I tried running the tests manually on C-3PO, PPC-n0 and PPC-n4 and they all passed when run manually. At this point my only guess is that it's either an issue with Jenkins or a race condition. If it was a race condition I would expect it to fail intermittently but so far it's been very consistent. If it was a hardware fault I would expect it to fail on every PR, not just these two. I think @evaneschneider was looking into it but I don't know what the status of that is.

Gotcha. That makes me feel a lot better about this issue (I thought I broke something)

Printing is the best way I've found to interrogate the state within a Jenkins run, there might be a Jenkins plugin that helps but idk. When I'm having issues with a single thing like this I'll disable all the matrix elements except the problematic one so it runs a little faster. I did look into CircleCI when we were first doing tests, they wanted like $3k a month for what we needed.

Yeah, I would expect CircleCI to be much too expensive for anything GPU-related

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants