Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design simplification #728

Open
pavelkomarov opened this issue Sep 30, 2024 · 5 comments
Open

Design simplification #728

pavelkomarov opened this issue Sep 30, 2024 · 5 comments

Comments

@pavelkomarov
Copy link
Contributor

pavelkomarov commented Sep 30, 2024

I have some suggestions after reading through more of the documentation:

  • More of the objects' parameters, like the Controller's tfinal and a State's q values and a Solver's BCs, should be arguments to the respective constructors. I don't want to have to futz with object parameters after object creation, and I don't want to be able to create an object that isn't usable because something isn't yet set. Take inspiration from scikit-learn, with those long parameter lists, mostly kwargs set to defaults, with BIG docstrings explaining how each controls the object.
  • There should be fewer ways to do things. I want one right way. That way can be flexible, but right now, e.g., initializing a Solution from a file, a frame number in memory, from a lower-level list of dimensions, etc. constitute a lot of pretty different methods, each of which takes brain space. Give me one function or one constructor with well-documented parameters to handle the different cases. Removing support for less-used options or folding them in to others will greatly shorten the code and make it easier to maintain. For example, a Controller can iterate over basically a linspace of times, or over a list of output times, or just give an output frame every kth iteration. If you say "Always give me a list", then that covers the first two use cases (because I can in-line a linspace call somewhere), and you can remove the other two paths from the code.
  • States take a Domain, Patch, Grid, or list of Dimensions at initialization, and then the Solution, which contains States also takes geometry. Why? Can't it get the geometry by asking its sub-objects for their geometries? Maybe there is a good reason, but I found this confusing and possibly redundant.
  • The documentation says Patch "contains a reference to a nearly identical Grid object". If they're nearly identical, why are there two, neither inheriting from the other? Remember the Zen of Python, "There should be one-- and preferably only one --obvious way to do it."
  • Limiters are still confusing to me. The fact they're set by number instead of by name or enum makes them all the more abstruse. Instead of a dictionary from int->function, do string->function or something.
  • problem_data feels like it should belong to the Solver for a given problem, not to the State.
  • What is the difference between compute_p and compute_F? I get the sense p is defined over the whole geometry, and F can only be a single value. But they're both derived quantities. I say the concepts should be merged: You can either compute a derived quantity over the whole field, or you can sum out a scalar value.
  • I'm finding myself not using the run_app_from_main utility. It's easier to call run on a Controller and then call the plot function or do anything else I want with it afterward. I feel like trying to make a command-line utility with htmlplot=True and other specially-named arguments isn't necessary and again gives me an extra way to do things (also hides where some things are actually happening), when my degrees of freedom already feel overwhelming.

Of all of them, the first is the most important.

In my opinion, on the surface, at the level of examples, it should be far fewer than 300 lines to get going. A lot of complexity can be abstracted or deduplicated away. Seduce me as a user.

That said, I think the division of things into the various .py files makes pretty good sense. The fact the Riemann Solvers live in a totally different package was my most major point of confusion. There is just a lot of code. I find myself greping to find things often.

@ketch
Copy link
Member

ketch commented Oct 1, 2024

Thanks for compiling and sharing your impressions as a new user! There is a lot of food for thought here. I will make some comments although I can't guarantee that I will reply to continued discussion on these points.

More of the objects' parameters, like the Controller's tfinal and a State's q values and a Solver's BCs, should be arguments to the respective constructors.

This is a matter of style and what you're comfortable with. I understand that you are familiar and comfortable with a different style.

There should be fewer ways to do things. I want one right way. That way can be flexible, but right now, e.g., initializing a Solution from a file, a frame number in memory, from a lower-level list of dimensions, etc. constitute a lot of pretty different methods, each of which takes brain space.

PyClaw is not only a package itself, but is also used by other packages, and is used by people with different backgrounds and skills. Because of that, it's set up to be used in different ways. This is not optimal for you but each of the methods you mention is needed and used by someone.

The documentation says Patch "contains a reference to a nearly identical Grid object". If they're nearly identical, why are there two, neither inheriting from the other?

Patches are an abstraction needed for adaptive mesh refinement. If you're not using AMR, you don't need to know about or work with Patches, only Grids.

Limiters are still confusing to me. The fact they're set by number instead of by name or enum makes them all the more abstruse.

This is historical and could probably be improved. Feel free to make a concrete suggestion (PR). But you can set them by name in a sense, using things like pyclaw.limiters.tvd.MC. If this doesn't seem to be clear from the documentation, let me know.

problem_data feels like it should belong to the Solver for a given problem, not to the State.

Can you explain why? In my view, problem_data is like aux except that it contains scalar values rather than fields.

What is the difference between compute_p and compute_F? I get the sense p is defined over the whole geometry, and F can only be a single value. But they're both derived quantities. I say the concepts should be merged:

If I recall correctly, they are separate because for large parallel runs it's essential to know in advance if you need to do an allreduce (which is required for F but not for p).

I'm finding myself not using the run_app_from_main utility.

That's okay; you can ignore it. It is used heavily by others, myself included.

In my opinion, on the surface, at the level of examples, it should be far fewer than 300 lines to get going.

I'm not sure what you're referring to here. Of course, some of the available examples are much longer than others, and not every one of them would be a good example to start with. The basic example at https://www.clawpack.org/pyclaw/tutorial.html is about 20 lines.

@pavelkomarov
Copy link
Contributor Author

pavelkomarov commented Oct 2, 2024

This is a matter of style and what you're comfortable with. I understand that you are familiar and comfortable with a different style.

Sure, it's a different style, but it's more than that. It forces a certain helpfully explicit structure: When I see y = f(x, a, b, c), it lets me know the exact information that function needs in order to compute its output. If I instead have obj.x = x, obj.a = a, etc. followed by y = obj.f(), then I don't even know whether any of those parameters are used to compute y without looking at the source code for f! That makes it hard to know what setup is necessary before making a call.

Right now you have a lot of parameters in your __init__s which get set to default values. To your credit, you do have significant docstrings at the tops of classes, e.g. https://github.com/clawpack/pyclaw/blob/master/src/pyclaw/solver.py, where parameters are attributes, which are helpful. However, I still believe turning these into function kwargs makes the code easier to follow and makes important function calls easier to set up, because the structure itself suggests which things I need to create or do before I'm ready to make the call.

I'm not advocating for a totally functional paradigm. It can make sense to hold data in objects and have methods operate on that data. But consider what's easier to follow in different situations. Constructors generally take parameter values to set to object variables. Right now a lot of parameters feel hidden, buried in the source code or among a heap of others in the documentation pages, so it can feel difficult to find which ones I need to set and what I need to set them to. Contrast with function-specific docstrings that specify exactly how each input affects the behavior of that specific function. I get to not read about parameters that are irrelevant to the call I want to make, and I get to not read about cases where parameters might be used in other ways elsewhere in the object. Confusion is less likely. I can do the thing I want to do with minimum viable information.

each of the methods you mention is needed and used by someone.

I don't think "There's a lot downstream" is in itself a sufficient reason to preserve the status quo. You can always create a new version where old functionality is deprecated and eventually phase it out altogether, and users can keep using particular older versions if they want. I believe there should be a conscious drive toward simplification every time changes are made to any code, generally. Otherwise code entropy can overtake a project. "What is the minimal set of options that covers the use cases?" If there need to be several, maybe they can be made to feel similar (like how I suggested always taking a list of times to the Controller). Often a lot of different options need to come in to being, and they can only be unified afterward, with the benefit of perspective and time. But that smoothing process does make a better library.

Patches

Can we always safely get away with using Patch? If so, can we fold the Grid functionality in to that class and squeeze the concepts into one? There's cognitive overhead if I have to know to use one object in the serial case and another in the parallel case. Even if I only want to do Serial, I'm left scratching my head "What is this patch thing?"

problem_data

This data is unique to each problem, right? There is a unique Riemann solver (typically in Fortran) for each problem, so it kind of feels like the wrapping Solver (in Python) is a more natural 1:1 where constants unique to each problem should go.

large parallel runs it's essential to know in advance if you need to do an allreduce

Is there a way to make that distinction automatically in the background somehow, so the user doesn't have to be faced with both compute_ functions?

300 lines

I was being slightly snarky there. Sorry. Shorter examples are definitely possible, but I've found myself drawn to the gallery and the example source code links there, because they're very findable. But they're not super accessible. They're often made to show off extra functionality of the library. Even so, setting up more complex problems does take a significant number of lines, on the order of 100, maybe. That's a barrier to entry. The easiest way I see to save lines is to turn create-object, set-param, set-param, set-param, ... into create-object(param, param, param ...). There are many other ways too; saving lines is a creative exercise. I find it gratifying, though, because entropy tends to be exponential in the code size.

@mandli
Copy link
Member

mandli commented Oct 7, 2024

Before I launch into my own particular thoughts, I would say that many of your points are valid, but that the realities of scientific software development can get in the way. We do not get funded to develop, maintain, and improve the software beyond specific funding efforts and the small bit of free-time any one developer has. The resulting tower-of-cards we tend to build is a more profound problem that bears discussion, but as of now has no great solution.

Now off my soap-box:

  1. I would agree with the premise here that we could/should restructure some constructors to contain ways to instantiate the objects directly, the way you have suggested. There are many questions as to where key-word arguments should stop and start, as well as what sane default values are. Note that many times we instantiate required objects to None as there is not an obvious default. If you have some specific suggestions, I am sure we would be happy to look into them. For instance, the time in the State object I could see being an argument to the constructor rather than assigned after creation.
  2. I respectfully disagree, but "downstream" use is an important and sufficient reason to be very cautious in changing interfaces. PyClaw is built upon other codes "upstream" as well, and I would be peeved if something changed out from under us without obvious ways to replicate the behavior or deal with versions. Better is to maintain the past interfaces and develop new ones that can coexist. At some point, maybe we can consider deprecating some of these older interfaces, but again, this must be done very judiciously.
  3. Patch vs. Grid is an important distinction, despite them looking almost identical. The documentation tries to show what the distinction is, but if you are not using that functionality, it is difficult to grasp the why. That section in the documentation could probably be improved from a basic user perspective to make it easier to see if one would want to make the distinction. Historically, keeping the terms separate helped to transition from one type of solver to another.
  4. For problem_data I can see the point, but the reasoning for putting it in the Solution is that the data in there can change and evolve, and it's an easy place to put data that does not need to be gridded. It's also based on legacy stuff, so see the above regarding "downstream" and "upstream".
  5. The compute_p and compute_FI am not entirely sure about, but compute_p calculates a value at every grid cell and compute_F computes a single value for the entire domain. The parallel business comes in as an all_reduce for the sum across processes for the single value.

@pavelkomarov
Copy link
Contributor Author

pavelkomarov commented Oct 7, 2024

I totally get you on the no direct funding house of cards thing. Why has the export-as-PDF feature in Jupyter notebook caused an HTTP 500 error for years? I buy the cathedral vs the bazaar argument that open source is often better, but there really are cases where the logic breaks down because incentives just aren't there. In scientific computing, where the abstruse technical details are at a maximum and money is at a minimum, where there are few eyes to make all bugs shallow and few hands to lighten the load, it can be pretty challenging, unless you get super popular because ML hype or something.

I don't have the latitude to spend a huge portion of my time improving clawpack, nor do I have the kind of over-arching knowledge to spot all the opportunities for improvement or take care of them efficiently. But I do have a maybe usefully-naive perspective as a user, and I will probably continue to need to play that role, so if I can make that incrementally better for future-me as well as others, I'm keen. I just want to make sure we're on the same page about what we can and should try to tackle, so I don't pour time in to a PR that can't be merged or something. (I once did that, and I'm still bitter about it lol, still trying to eat that guy's lunch in terms of user numbers: exportify vs exportify. My app is first on Google--for now.) To that end, I appreciate you guys being responsive on this thread.

Personally, as someone who's gone through the ML side and long been disillusioned with its limitations, I think sims of this kind can help extend models' predictive and sensing power in the future. (If we ever have a call with @nathankutz, we can talk more about that vision.) If we want to position this library to take more user share over time, so our bugs can be shallow and our street cred (or even published results) as contributors can have increasing value, then earlier improvements can pay greater dividends.

@ketch
Copy link
Member

ketch commented Oct 8, 2024

@pavelkomarov Thanks for your thoughts. For any PR that will require a significant amount of work, it's definitely a good idea to open an issue for discussion first to avoid wasting anyone's time.

While we are certainly motivated to make the software as easy to use as we can, this is a secondary priority after enabling research. I'm not sure that "taking more user share" is a good goal for Clawpack; we don't get paid by the users and in a sense more users just means more work that we have little time to do. But improvements that enable people to effectively use the code with less help / less problems are something I want very much, and I think that's what you're aiming for too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants