Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redefines equality (==) for VerticallyStretchedRectilinearGrids #2019

Merged
merged 4 commits into from
Oct 28, 2021

Conversation

tomchor
Copy link
Collaborator

@tomchor tomchor commented Oct 17, 2021

Closes #2018

@navidcy
Copy link
Collaborator

navidcy commented Oct 17, 2021

@tomchor did you try the MWE in #2018 with this fix and it's all good?

src/Grids/vertically_stretched_rectilinear_grid.jl Outdated Show resolved Hide resolved
field of the two elements. (The default behavior is using `===`, which makes it so that
`grid1==deepcopy(grid1)` for `VerticallyStretchedRectilinearGrid`s.
"""
function Base.:(==)(grid1::VerticallyStretchedRectilinearGrid, grid2::VerticallyStretchedRectilinearGrid)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about other types of grids? Curvilinear grids for example? Would they show the same "problem" as the VerticallyStretchedRectilinearGrid? Why don't we redefine Base.:(==)to work forAbstactGrid`?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea for other fields, so I wanted to be conservative for now. This seems to be related to array comparisons and, since vertically stretched grids needs an array, this error gets thrown. Do other grids have arrays in their struct?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap. Other grids have it.

No need to be conservative. Let's fix it for all grids!
Would the == as above fail for RegularRectilinearGrids?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the == as above fail for RegularRectilinearGrids?

I think that would work for all grids without change actually. But I haven't tested it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

== works (now) for RegularRectilinearGrid because it is isbits I think. We may not need a special method for this grid. The same goes for RegularLatitudeLongitudeGrid.

For any grid that contains arrays, we need to define our own numeric equality operator ==.

@tomchor
Copy link
Collaborator Author

tomchor commented Oct 17, 2021

@tomchor did you try the MWE in #2018 with this fix and it's all good?

It fixes it for CPUs, but I get an error for GPU architecture. It's likely that we need to allow scalar operations due to ==. I'm planning on trying that tomorrow and I won't merge before it also works for GPUs. (I was wondering if we should add this as a test...)

Co-authored-by: Navid C. Constantinou <navidcy@users.noreply.github.com>
@navidcy navidcy self-requested a review October 17, 2021 23:20
Copy link
Collaborator

@navidcy navidcy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A test would be nice to come along with this PR. I can help out with that if you want.

@navidcy
Copy link
Collaborator

navidcy commented Oct 17, 2021

It fixes it for CPUs, but I get an error for GPU architecture. It's likely that we need to allow scalar operations due to ==. I'm planning on trying that tomorrow and I won't merge before it also works for GPUs. (I was wondering if we should add this as a test...)

Hm... Whatever we do should work for both CPUs and GPUs. @glwagner and I were thinking about equality between grids at some point. Probably (I'm not sure -- you can double check that), the fix in this PR fails for GPUs, because when in that case when the checkpointer tries to compare the two grids, the one saved on disk has Arrays while the other one has CuArrays. If that's the case, we need a solution that remedy this and call two grids "equal" despite the device they live on. Some of these issue are related or discussed within #1825 and #1998.

@tomchor
Copy link
Collaborator Author

tomchor commented Oct 17, 2021

A test would be nice to come along with this PR. I can help out with that if you want.

Thanks! I'll take you up on that if you don't mind. Given that you said that other grids also have arrays, it might be best to expand checkpoint testing for all grids?

@navidcy
Copy link
Collaborator

navidcy commented Oct 17, 2021

A test would be nice to come along with this PR. I can help out with that if you want.

Thanks! I'll take you up on that if you don't mind. Given that you said that other grids also have arrays, it might be best to expand checkpoint testing for all grids?

Yeap. But first we need to sort the GPU/CPU issue. For example, two grids that are identical but the arrays of one live on CPU but the other on GPU should they be considered equal? I believe in that case we should add methods so that:

julia> grid1 == grid2
true

julia> grid1 === grid2
false

@navidcy
Copy link
Collaborator

navidcy commented Oct 17, 2021

But I would suggest we postpone this until after #1998 so that we don't do double work. Also, by putting architecture in grids will (potentially) make it easier to compare grids across devices.

@tomchor
Copy link
Collaborator Author

tomchor commented Oct 18, 2021

But I would suggest we postpone this until after #1998 so that we don't do double work. Also, by putting architecture in grids will (potentially) make it easier to compare grids across devices.

I don't know if I agree that waiting is best though. Seems like #1998 is still a draft so it might take a while, and a bug is a bug. I'm gonna have to fix this anyway to move on with my research since I need to pickup my simulations and I'd like to use vertically stretched grids.

Also, I think we can define a method that returns grid1==grid2 despite being on different architectures pretty easily. We just gotta not compare architecture in the function I already wrote, no?

@navidcy
Copy link
Collaborator

navidcy commented Oct 18, 2021

@tomchor, you can always do

pkg> add Oceananigans#tc/pickup

and continue research if that does the job.

I'm just a bit reluctant to apply a quick fix that works only for one case only to have to do the same work again for the lat/lon grid for example, etc. I'm very keen to solve this bug before #1998 -- postponing after #1998 was just a suggestion. But the solution should be one that covers all grids and all architectures. Not something that works only for one grid and leaves the bug for some others.

Perhaps let's see what others have to chip in on this?

@tomchor
Copy link
Collaborator Author

tomchor commented Oct 18, 2021

@tomchor, you can always do

pkg> add Oceananigans#tc/pickup

and continue research if that does the job.

I'm just a bit reluctant to apply a quick fix that works only for one case only to have to do the same work again for the lat/lon grid for example, etc. I'm very keen to solve this bug before #1998 -- postponing after #1998 was just a suggestion. But the solution should be one that covers all grids and all architectures. Not something that works only for one grid and leaves the bug for some others.

Perhaps let's see what others have to chip in on this?

Oh yeah, I didn't mean to half-fix it. I do believe it should be a "proper" fix. I just think that adding architecture to grids (#1998) isn't gonna change what the fix is. I'll have more info tomorrow after I actually try to do this for GPUs.

Copy link
Member

@glwagner glwagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should return false if the grids are on a different topology.

@tomchor
Copy link
Collaborator Author

tomchor commented Oct 18, 2021

This should return false if the grids are on a different topology.

It does. And currently it also returns false if the architecture is different too, although I don't know if that's the behavior we want.

@tomchor
Copy link
Collaborator Author

tomchor commented Oct 18, 2021

Regarding the comment here

I think there are probably a few other things to fix here, because we can't "restore" a grid with a GPU architecture.

Fixing this fully really requires finishing #1998 first, and overhauling the checkpointing infrastructure to match so that the checkpointer is "architecture aware". We should also add tests for checkpointing with other grid types.

I'm not sure what you mean by "restoring" a grid with GPU architecture. This last commit was enough for me to pickup my GPU-architecture simulation with a VerticallyStretchedGrid. For example, the following mwe works in this PR:

using Oceananigans
using CUDA

grid1 = VerticallyStretchedRectilinearGrid(size=(2, 2, 2),
                                          architecture=GPU(),
                                          x=(0, 1), y=(0, 1), z_faces=k -> k,
                                          halo=(3,3,3),
                                          )

grid2 = VerticallyStretchedRectilinearGrid(size=(2, 2, 2),
                                          architecture=GPU(),
                                          x=(0, 1), y=(0, 1), z_faces=k -> k,
                                          halo=(3,3,3),
                                          )

if CUDA.@allowscalar grid1!=grid2
    throw(error)
end

model = NonhydrostaticModel(grid=grid1,
                            architecture=GPU(),
                           )

progress(sim) = @info "Iteration: $(sim.model.clock.iteration), time: $(round(Int, sim.model.clock.time))"
simulation = Simulation(model, Δt=1, stop_time=10, progress=progress)
simulation.output_writers[:chk_writer] = Checkpointer(model;
                                         dir=".",
                                         prefix = "mwe",
                                         schedule = TimeInterval(2),
                                         force = false,
                                         cleanup = true,
                                         )

run!(simulation)

simulation.stop_time = 20

run!(simulation, pickup=true)

@tomchor
Copy link
Collaborator Author

tomchor commented Oct 18, 2021

So, everything that I've tested so far leads me to believe this is working with vertically stretched grids for now. I've tested a few examples with both CPU and GPU architecture and they've picked up successfully. (Although it's possible I'm missing something...) However, since @navidcy and @glwagner have reservations about this fix (regarding #1998) I won't expand it to all abstract grids for now. I think more discussions need to be done.

I'll leave this PR open though since I need this fix to complete my research, if that's okay.

@@ -194,7 +194,7 @@ function set!(model, filepath::AbstractString)

# Validate the grid
checkpointed_grid = file["grid"]
model.grid == checkpointed_grid ||
CUDA.@allowscalar model.grid == checkpointed_grid ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to allow scalar operations specifically where they are needed? Since we don't need scalar operations for all equality comparisons, we may want to check within ==(grid1, grid2) where applicable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that first but it failed for some reason. Couldn't really figure out why, but I agree we should strive to use scalar op only when necessary.

@glwagner
Copy link
Member

Can we just compare 1) the topology and 2) all of the nodes? I think that would be sufficient to determine equality. I'm worried if we don't explicitly check the topology that we might run into some edge cases where the nodes / spacings are the same but the topologies are different (these would possibly be pathological with 0 halo in a periodic / flat direction, but still good to explicitly check I think...)

When comparing numeric equality for nodes, we should add @allowscalar as needed there.

I think we should add a test that equality works as expected, as well. We should also add a test for checkpointing on stretched grids.

I think its ok if we don't support a numeric equality for other grids --- we can build them up one at a time. They are specific to each grid.

The main downside here is that we need to refactor our grid implementation to have a single RectilinearGrid (there's no reason to have different rectilinear grid types, since we can dispatch on the case that certain directions have constant spacing easily). So this code will go away when we make that change. But I think it's a useful incremental improvement for now while we still have two rectilinear grids, so I'm fine to have it go in once it's cleaned up.

Because of this:

julia> using CUDA

julia> a = rand(2)
2-element Vector{Float64}:
 0.8207604162394306
 0.3815099688071648

julia> b = CuArray(a)
2-element CuArray{Float64, 1}:
 0.8207604162394306
 0.3815099688071648

julia> CUDA.@allowscalar a == b
true

I think that this will work if the checkpointed grid is deserialized onto the CPU, even though model.grid is on the GPU. I think this is what we want, so that's fortunate the above works.

@glwagner
Copy link
Member

I'll leave this PR open though since I need this fix to complete my research, if that's okay.

It'd be great to get this PR through. I don't think you need the PR open for your research though... ? You can use any code you like for that. We should keep this PR open if we are planning to finish it and merge it into main.

@tomchor
Copy link
Collaborator Author

tomchor commented Oct 18, 2021

I'll leave this PR open though since I need this fix to complete my research, if that's okay.

It'd be great to get this PR through. I don't think you need the PR open for your research though... ? You can use any code you like for that. We should keep this PR open if we are planning to finish it and merge it into main.

True, I guess I just need the branch/code.

I intend to merge this though. It's just that apparently #1998 needs to be finished first, according to @navidcy at least, so I thought I'd wait for that. Are you suggesting we move on this one before other PRs and just for vertically stretched grids?

@glwagner
Copy link
Member

I don't think we need #1998 first. I think we may need different == for different grid types, but I'm not totally sure.

@navidcy
Copy link
Collaborator

navidcy commented Oct 19, 2021

I intend to merge this though. It's just that apparently #1998 needs to be finished first, according to @navidcy at least, so I thought I'd wait for that. Are you suggesting we move on this one before other PRs and just for vertically stretched grids?

Nah... Well, I may have initially thought that #1998 was important to finish first, but I take it back :)
Let's make this work for all grids in the simplest manner possible and merge it?

@navidcy navidcy added bug 🐞 Even a perfect program still has bugs grids 🗺️ labels Oct 25, 2021
@navidcy navidcy merged commit 1cea014 into main Oct 28, 2021
@navidcy
Copy link
Collaborator

navidcy commented Oct 28, 2021

@tomchor you should be able to user v0.63.3 now. When you make sure that everything works for you then delete this branch.

@tomchor tomchor deleted the tc/pickup branch October 28, 2021 03:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Even a perfect program still has bugs grids 🗺️
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Picking up a simulation fails with VerticallyStretchedGrids
3 participants