Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we store architecture in grid? #1825

Closed
glwagner opened this issue Jul 5, 2021 · 16 comments · Fixed by #2078
Closed

Should we store architecture in grid? #1825

glwagner opened this issue Jul 5, 2021 · 16 comments · Fixed by #2078

Comments

@glwagner
Copy link
Member

glwagner commented Jul 5, 2021

In the general case of a pre-computed grid or mesh, the architecture on which the grid is stored has be known. Thus in general, the architecture of a grid must be provided.

The few special cases to this rule are grids with metrics that are either constant or can be computed on the fly, like RegularRectilinearGrid. Up to recently RegularRectilinearGrid was the only option, so we hadn't the motivation to ponder whether architecture belonged as a model property, field property, or grid property...

Putting architecture in grid would mean that we don't have to specify it in model constructors. Since grid is so central we wouldn't really need architecture as a property anywhere else. For example, we wouldn't need to store architecture in every.single.field object, as we do now. Alleviating this boilerplate makes the change feel especially right...

The main downside is a loss of flexibility for those grids that are truly architecture-independent, like RegularRectilinearGrid and RegularLatitudeLongitudeGrid. On the other hand, we have to specify architecture somewhere and rarely helpful to mix CPU and GPU computations using the same grid.

@francispoulin
Copy link
Collaborator

Thanks for the nice summary @glwagner .

I vote for storing architecture in grid.

@navidcy
Copy link
Collaborator

navidcy commented Jul 7, 2021

architecture lives in the Grid in FourierFlows.jl (if that's an argument.. haha)

@navidcy
Copy link
Collaborator

navidcy commented Jul 7, 2021

Also, float_type is part of the grid and all field defined on the grid could inherit that, right?
(perhaps this is already the case?)

@glwagner
Copy link
Member Author

glwagner commented Jul 7, 2021

eltype(grid) propagates everywhere because of defaults like

new_data(arch, grid, loc) = new_data(eltype(grid), arch, grid, loc)

grid seems like a good place for users to specify architecture and the floating point type. I think the API is better if important things like that only need to be specified once.

@glwagner
Copy link
Member Author

glwagner commented Jul 9, 2021

@ali-ramadhan would be good to get your thoughts before doing anything. Does this help / hurt memory distributed models?

@francispoulin
Copy link
Collaborator

Out of curiosity, when we create a PR, is it possible to find out how long the tests take compared to before? It seems like it might be useful when making a change like this, as well as pretty much anything else.

@glwagner
Copy link
Member Author

glwagner commented Jul 9, 2021

The amount of time it takes each step to run is reported by buildkite:

image

@francispoulin
Copy link
Collaborator

Thanks. And I guess you can go to the latest merge to find out how long it used to take? It seems like showing the difference might be helpful, in case there is a big difference but that is probably wishing for the stars.

@glwagner
Copy link
Member Author

glwagner commented Jul 9, 2021

Right, you can reference prior PRs to compare old build times to a new build time.

Custom reports might be possible, I don't know! Could be worth checking out the buildkite documentation. If you want something written into the PR I think you have to build a bot or something. Also possible, probably not "hard" once you know what to do.

@navidcy
Copy link
Collaborator

navidcy commented Jul 9, 2021

But since all tests run on a computer at MIT the time it takes also depends on how many tests are running (eg how many prs were just open) and other random factors.

@glwagner
Copy link
Member Author

glwagner commented Jul 9, 2021

Right, there are some caveats on how you interpret those times. They aren't a benchmark.

@francispoulin
Copy link
Collaborator

I guess bennchmarks are the best way to tell and we should run those often enough to see whether anything has changed singificantly. Thanks.

@glwagner
Copy link
Member Author

Looking at the code for DistributedIncompressibleModel:

function DistributedIncompressibleModel(; architecture, grid, model_kwargs...)
i, j, k = architecture.local_index
Rx, Ry, Rz = architecture.ranks
my_connectivity = architecture.connectivity
Nx, Ny, Nz = size(grid)
Lx, Ly, Lz = length(grid)
# Pull out endpoints for full grid.
xL, xR = grid.xF[1], grid.xF[Nx+1]
yL, yR = grid.yF[1], grid.yF[Ny+1]
zL, zR = grid.zF[1], grid.zF[Nz+1]
# Make sure we can put an integer number of grid points in each rank.
# Will generalize in the future.
# TODO: Check that we have enough grid points on each rank to fit the halos!
@assert isinteger(Nx / Rx)
@assert isinteger(Ny / Ry)
@assert isinteger(Nz / Rz)
nx, ny, nz = Nx÷Rx, Ny÷Ry, Nz÷Rz
lx, ly, lz = Lx/Rx, Ly/Ry, Lz/Rz
x₁, x₂ = xL + (i-1)*lx, xL + i*lx
y₁, y₂ = yL + (j-1)*ly, yL + j*ly
z₁, z₂ = zL + (k-1)*lz, zL + k*lz
# FIXME? local grid might have different topology!
my_grid = RegularRectilinearGrid(topology=topology(grid), size=(nx, ny, nz), x=(x₁, x₂), y=(y₁, y₂), z=(z₁, z₂), halo=halo_size(grid))

suggests to me that it's important to include architecture when constructing grid, even for grids that are device independent. Specifically, our architecture object combines information about the memory layout and the device that's used for computation (CPU or GPU). RegularRectilinearGrid is device independent, but, apparently, is in practice not architecture in dependent (because we need to restrict ourselves to knowledge of just a "local" grid for local computations).

One complication is that, apparently, the pressure solver requires the global grid:

pressure_solver = haskey(model_kwargs, :pressure_solver) ? Dict(model_kwargs)[:pressure_solver] :
DistributedFFTBasedPoissonSolver(architecture, grid, my_grid)

We'll have to look into that in more detail to understand what needs to be done for that, and for other solvers like the PreconditionedConjugateGradientSolver.

@glwagner
Copy link
Member Author

@ali-ramadhan can we "reconstruct" the full grid parameters needed to build the DistributedFFTPoissonSolver, rather than passing full_grid as an argument to the constructor?

function DistributedFFTBasedPoissonSolver(arch, full_grid, local_grid)
topo = (TX, TY, TZ) = topology(full_grid)
λx = poisson_eigenvalues(full_grid.Nx, full_grid.Lx, 1, TX())
λy = poisson_eigenvalues(full_grid.Ny, full_grid.Ly, 2, TY())
λz = poisson_eigenvalues(full_grid.Nz, full_grid.Lz, 3, TZ())
I, J, K = arch.local_index
λx = λx[(J-1)*local_grid.Ny+1:J*local_grid.Ny, :, :]
eigenvalues = (; λx, λy, λz)
transform = PencilFFTs.Transforms.FFT!()
proc_dims = (arch.ranks[2], arch.ranks[3])
plan = PencilFFTs.PencilFFTPlan(size(full_grid), transform, proc_dims, MPI.COMM_WORLD)
storage = PencilFFTs.allocate_input(plan)
return DistributedFFTBasedPoissonSolver(plan, full_grid, local_grid, eigenvalues, storage)
end

I think we just need an Allgather to infer the global number of grid points and global grid extent from each process-local grid.

DistributedShallowWaterModel doesn't use the global grid at all except to build the process-local grids.

@glwagner
Copy link
Member Author

glwagner commented Jul 15, 2021

As for the API, it might simplify matters to have architecture as a positional argument, so that we dispatch on it. So we might end up with syntax that looks something like

grid = RegularRectilinearGrid(CPU(), float_type=Float32, size=(1, 1, 1), x=(0, 1), y=(1, 2), z=(-3, 0))

We can also add an architecture kwarg if others think that's better, eg

grid = RegularRectilinearGrid(architecture=CPU(), float_type=Float32, size=(1, 1, 1), x=(0, 1), y=(1, 2), z=(-3, 0))

We will have to translate the architecture kwarg to a positional arg (so that we can dispatch on it for the purpose of Oceananigans.Distributed). This adds some boilerplate for all grids, but isn't ultimately a huge concern.

@christophernhill might have some useful comments.

@francispoulin
Copy link
Collaborator

Dispatching over architecture sounds like a good idea to me, but I don't pretend to understand all the details of the two approaches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants