-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking progress for merging the dofhandlers #629
Comments
Construction using Ferrite
using BenchmarkTools
grid = generate_grid(Quadrilateral, (1000, 1000))
ip_v = Lagrange{2,RefCube,2}()
ip_s = Lagrange{2,RefCube,1}()
@btime DofHandler($grid); # 124.676 ns (7 allocations: 320 bytes)
@btime MixedDofHandler($grid); # 136.625 μs (8 allocations: 15.26 MiB)
@btime add!(Ref(dh)[], $(:v), $2, $ip_v) setup=(dh=DofHandler($grid)) evals=1; # 41.000 ns (3 allocations: 240 bytes)
@btime add!(dh, $(:v), $2, $ip_v) setup=(dh=MixedDofHandler($grid)) evals=1; # 11.676 ms (11 allocations: 18.00 MiB)
function setup_close(grid, T)
dh = T(grid)
add!(dh, :v, 2, Lagrange{2,RefCube,2}())
add!(dh, :s, 1, Lagrange{2,RefCube,1}())
return dh
end
@btime close!(dh) setup=(dh=setup_dh($grid, $DofHandler)) evals=1; # 647.299 ms (224 allocations: 565.04 MiB)
@btime close!(dh) setup=(dh=setup_dh($grid, $MixedDofHandler)) evals=1; # 873.991 ms (297 allocations: 791.71 MiB) Edit: After @fredrikekre efforts last night (#637, #639, #642, #643) the full construction of both dofhandlers is about twice as fast now while consuming roughly half as much memory. The same set-up as above gives function full_construction(grid, T)
dh = T(grid)
add!(dh, :v, 2, Lagrange{2,RefCube,2}())
add!(dh, :s, 1, Lagrange{2,RefCube,1}())
close!(dh)
end
@btime full_construction($grid, $DofHandler); # 346.665 ms (131 allocations: 363.19 MiB)
@btime full_construction($grid, $MixedDofHandler); # 356.083 ms (146 allocations: 496.37 MiB) (Note that the fine-grained benchmarks from above indeed are not that meaningful as some memory allocations happen in different functions for |
Constraints using Ferrite
using BenchmarkTools
grid = generate_grid(Quadrilateral, (1000, 1000));
∂Ω = union(
getfaceset(grid, "left"),
getfaceset(grid, "right"),
getfaceset(grid, "top"),
getfaceset(grid, "bottom"),
);
dbc = Dirichlet(:v, ∂Ω, (x, t) -> [0, 0]);
ip_v = Lagrange{2,RefCube,2}();
ip_s = Lagrange{2,RefCube,1}();
function setup_dhclosed(grid, T)
dh = T(grid)
add!(dh, :v, 2, Lagrange{2,RefCube,2}())
add!(dh, :s, 1, Lagrange{2,RefCube,1}())
close!(dh)
return dh
end
function setup_ch(grid, T)
dh = setup_dhclosed(grid, T)
return ConstraintHandler(dh)
end
function setup_ch2(grid, T, dbc)
dh = setup_dhclosed(grid, T)
ch = ConstraintHandler(dh)
add!(ch, dbc)
return ch
end
@btime ConstraintHandler(dh) setup=(dh=setup_dhclosed($grid, $DofHandler)) evals=1; # 9.560 μs (12 allocations: 992 bytes)
@btime ConstraintHandler(dh) setup=(dh=setup_dhclosed($grid, $MixedDofHandler)) evals=1; # 10.130 μs (12 allocations: 992 bytes)
@btime add!(ch, $dbc) setup=(ch = setup_ch($grid, $DofHandler)) evals=1; # 38.962 ms (8108 allocations: 21.91 MiB)
@btime add!(ch, $dbc) setup=(ch = setup_ch($grid, $MixedDofHandler)) evals=1; # 3.617 ms (8124 allocations: 4.04 MiB)
@btime close!(ch) setup=(ch = setup_ch2($grid, $DofHandler, $dbc)) evals=1; # 2.288 s (12071 allocations: 288.21 MiB)
@btime close!(ch) setup=(ch = setup_ch2($grid, $MixedDofHandler, $dbc)) evals=1; # 2.198 s (12071 allocations: 288.21 MiB) |
CellIterator Microbenchmark using Ferrite
using BenchmarkTools
grid = generate_grid(Quadrilateral, (1000, 1000));
function setup_dhclosed(grid, T)
dh = T(grid)
add!(dh, :v, 2, Lagrange{2,RefCube,2}())
add!(dh, :s, 1, Lagrange{2,RefCube,1}())
close!(dh)
return dh
end
function setup_cc(grid, T, flags)
dh = T(grid)
add!(dh, :v, 2, Lagrange{2,RefCube,2}())
add!(dh, :s, 1, Lagrange{2,RefCube,1}())
close!(dh)
return CellCache(dh, flags)
end
@btime CellCache(dh) setup=(dh=setup_dhclosed($grid, $DofHandler)); # 1.062 μs (4 allocations: 480 bytes)
@btime CellCache(dh) setup=(dh=setup_dhclosed($grid, $MixedDofHandler)); # 629.000 ns (4 allocations: 480 bytes)
@btime CellIterator(dh) setup=(dh=setup_dhclosed($grid, $DofHandler)); # 857.000 ns (4 allocations: 480 bytes)
@btime CellIterator(dh) setup=(dh=setup_dhclosed($grid, $MixedDofHandler)); # 3.763 μs (4 allocations: 480 bytes)
@btime reinit!(cc, 1) setup=(cc=setup_cc($grid, $DofHandler, $UpdateFlags(true, true, true))); # 54.031 ns (0 allocations: 0 bytes)
@btime reinit!(cc, 1) setup=(cc=setup_cc($grid, $MixedDofHandler, $UpdateFlags(true, true, true))); # 57.543 ns (0 allocations: 0 bytes)
Integrated test (Poisson) using Ferrite, SparseArrays
function assemble_element!(Ke::Matrix, fe::Vector, cellvalues::CellScalarValues)
n_basefuncs = getnbasefunctions(cellvalues)
fill!(Ke, 0)
fill!(fe, 0)
for q_point in 1:getnquadpoints(cellvalues)
dΩ = getdetJdV(cellvalues, q_point)
for i in 1:n_basefuncs
δu = shape_value(cellvalues, q_point, i)
∇δu = shape_gradient(cellvalues, q_point, i)
fe[i] += δu * dΩ
for j in 1:n_basefuncs
∇u = shape_gradient(cellvalues, q_point, j)
Ke[i, j] += (∇δu ⋅ ∇u) * dΩ
end
end
end
return Ke, fe
end
function assemble_global(cellvalues::CellScalarValues, K::SparseMatrixCSC, dh::Union{DofHandler, MixedDofHandler})
n_basefuncs = getnbasefunctions(cellvalues)
Ke = zeros(n_basefuncs, n_basefuncs)
fe = zeros(n_basefuncs)
f = zeros(ndofs(dh))
assembler = start_assemble(K, f)
for cell in CellIterator(dh)
reinit!(cellvalues, cell)
assemble_element!(Ke, fe, cellvalues)
assemble!(assembler, celldofs(cell), Ke, fe)
end
return K, f
end
function assemble_heat(T)
grid = generate_grid(Quadrilateral, (100, 100));
dim = 2
ip = Lagrange{dim, RefCube, 1}()
qr = QuadratureRule{dim, RefCube}(2)
cellvalues = CellScalarValues(qr, ip);
dh = T(grid)
add!(dh, :u, 1)
close!(dh);
K = create_sparsity_pattern(dh)
ch = ConstraintHandler(dh);
∂Ω = union(
getfaceset(grid, "left"),
getfaceset(grid, "right"),
getfaceset(grid, "top"),
getfaceset(grid, "bottom"),
);
dbc = Dirichlet(:u, ∂Ω, (x, t) -> 0)
add!(ch, dbc);
close!(ch)
@btime assemble_global($cellvalues, $K, $dh);
end
assemble_heat(DofHandler) # 4.724 ms (12 allocations: 80.69 KiB)
assemble_heat(MixedDofHandler) # 4.711 ms (12 allocations: 80.69 KiB) |
I think bringing down the constructor times for |
Yea, and also I am not sure it is very useful to compare with such granularity. I would setup a benchmark that does all of constructing, adding fields, distribute in |
Not sure if I can fully agree here. I think it makes sense to at least check that we do not have severe performance regression in some simple operations (e.g. due to type instability or unwanted allocs). |
Basic functionality using Ferrite
using BenchmarkTools
grid = generate_grid(Quadrilateral, (10, 10))
dh = DofHandler(grid)
add!(dh, :v, 2, Lagrange{2,RefCube,2}()) # quadratic vector field
add!(dh, :s, 1, Lagrange{2,RefCube,1}()) # linear scalar field
close!(dh)
mixed_dh = MixedDofHandler(grid)
add!(mixed_dh, :v, 2, Lagrange{2,RefCube,2}()) # quadratic vector field
add!(mixed_dh, :s, 1, Lagrange{2,RefCube,1}()) # linear scalar field
close!(mixed_dh)
# does the same thing anyways
@btime ndofs($dh); # 2.125 ns (0 allocations: 0 bytes)
@btime ndofs($mixed_dh); # 2.125 ns (0 allocations: 0 bytes)
@btime ndofs_per_cell($dh); # 2.125 ns (0 allocations: 0 bytes)
@btime ndofs_per_cell($mixed_dh); # 2.083 ns (0 allocations: 0 bytes)
@btime dof_range($dh, $(:v)); # 37.298 ns (0 allocations: 0 bytes)
@btime dof_range($mixed_dh, $(:v)); # 47.781 ns (0 allocations: 0 bytes)
@btime celldofs($dh, $15); # 51.756 ns (1 allocation: 240 bytes)
@btime celldofs($mixed_dh, $15); # 46.249 ns (1 allocation: 240 bytes)
dofs = Vector{Int}(undef, ndofs_per_cell(dh, 15));
@btime celldofs!($dofs, $dh, $15); # 7.675 ns (0 allocations: 0 bytes)
@btime celldofs!($dofs, $mixed_dh, $15); # 24.072 ns (0 allocations: 0 bytes) Edit: The time difference between the |
Postprocessing using Ferrite
using BenchmarkTools
grid = generate_grid(Quadrilateral, (1000, 1000))
dh = DofHandler(grid)
add!(dh, :v, 2, Lagrange{2,RefCube,2}()) # quadratic vector field
add!(dh, :s, 1, Lagrange{2,RefCube,1}()) # linear scalar field
close!(dh)
mixed_dh = MixedDofHandler(grid)
add!(mixed_dh, :v, 2, Lagrange{2,RefCube,2}()) # quadratic vector field
add!(mixed_dh, :s, 1, Lagrange{2,RefCube,1}()) # linear scalar field
close!(mixed_dh)
u = rand(ndofs(dh))
# point evaluation
points = [2*rand(Vec{2})-ones(Vec{2}) for _ in 1:1000]
ph = PointEvalHandler(grid, points)
@btime get_point_values($ph, $dh, $u, $(:v)); # 98.417 μs (28 allocations: 17.44 KiB)
@btime get_point_values($ph, $mixed_dh, $u, $(:v)); # 142.250 μs (31 allocations: 17.56 KiB)
# reshaping from dof-order to nodal order (part of vtk export)
@btime reshape_to_nodes($dh, $u, $(:v)); # 22.863 ms (6 allocations: 22.93 MiB)
@btime reshape_to_nodes($mixed_dh, $u, $(:v)) # 234.516 ms (10 allocations: 22.93 MiB)
# vtk export
filename = joinpath(tempdir(), "test")
@btime vtk_point_data(vtk, $dh, $u) setup=(vtk=vtk_grid($filename, $grid)) evals=1; # 852.222 ms (147 allocations: 52.54 MiB)
@btime vtk_point_data(vtk, $mixed_dh, $u) setup=(vtk=vtk_grid($filename, $grid)) evals=1; # 1.268 s (158 allocations: 52.54 MiB) Edit: Fixing #631 is likely to fix the performance gaps in |
Those are equally fast for me:
|
Can this be explained by a diff in Julia version or the used machines? |
Sparsity pattern using Ferrite
using BenchmarkTools
grid = generate_grid(Quadrilateral, (100, 100))
dh = DofHandler(grid)
add!(dh, :v, 2, Lagrange{2,RefCube,2}()) # quadratic vector field
add!(dh, :s, 1, Lagrange{2,RefCube,1}()) # linear scalar field
close!(dh)
mixed_dh = MixedDofHandler(grid)
add!(mixed_dh, :v, 2, Lagrange{2,RefCube,2}()) # quadratic vector field
add!(mixed_dh, :s, 1, Lagrange{2,RefCube,1}()) # linear scalar field
close!(mixed_dh)
# without coupling
@btime create_sparsity_pattern($dh); # 34.338 ms (24 allocations: 246.05 MiB)
@btime create_sparsity_pattern($mixed_dh); # 34.367 ms (24 allocations: 246.05 MiB)
@btime create_symmetric_sparsity_pattern($dh); # 18.820 ms (24 allocations: 130.69 MiB)
@btime create_symmetric_sparsity_pattern($mixed_dh); # 18.739 ms (24 allocations: 130.69 MiB)
# with coupling
@btime create_sparsity_pattern($dh; coupling=$field_coupling); # 26.282 ms (31 allocations: 175.80 MiB)
@btime create_sparsity_pattern($mixed_dh; coupling=$field_coupling); # 27.622 ms (36 allocations: 175.80 MiB)
@btime create_symmetric_sparsity_pattern($dh; coupling=$field_coupling); # 15.894 ms (31 allocations: 95.57 MiB)
@btime create_symmetric_sparsity_pattern($mixed_dh; coupling=$field_coupling); # 15.278 ms (36 allocations: 95.57 MiB) Edit: Coupling benchmarks updated after #650 . |
CellIterator II
Can't reproduce the difference in constructing using Ferrite
using BenchmarkTools
grid = generate_grid(Quadrilateral, (1000, 1000));
dh = DofHandler(grid)
add!(dh, :v, 2, Lagrange{2,RefCube,2}()) # quadratic vector field
add!(dh, :s, 1, Lagrange{2,RefCube,1}()) # linear scalar field
close!(dh)
mixed_dh = MixedDofHandler(grid)
add!(mixed_dh, :v, 2, Lagrange{2,RefCube,2}()) # quadratic vector field
add!(mixed_dh, :s, 1, Lagrange{2,RefCube,1}()) # linear scalar field
close!(mixed_dh)
@btime CellCache($dh); # 74.359 ns (4 allocations: 480 bytes)
@btime CellCache($mixed_dh); # 73.665 ns (4 allocations: 480 bytes)
@btime CellIterator($dh); # 74.820 ns (4 allocations: 480 bytes)
@btime CellIterator($mixed_dh); # 9.708 μs (4 allocations: 480 bytes)
cc_dh = CellCache(dh);
cc_mixed_dh = CellCache(mixed_dh)
@btime reinit!($cc_dh, $1); # 20.938 ns (0 allocations: 0 bytes)
@btime reinit!($cc_mixed_dh, $1); # 22.503 ns (0 allocations: 0 bytes) |
Constraints: apply_analytical, affine constraints, periodic bcs f(x) = x ⋅ x
u = zeros(ndofs(dh))
@btime apply_analytical!($u, $dh, $(:s), $f); # 25.878 ms (14 allocations: 2.27 KiB)
@btime apply_analytical!($u, $mixed_dh, $(:s), $f); # 323.590 ms (76 allocations: 34.50 MiB)
lc = AffineConstraint(1, [2 => 5.0, 3 => 3.0], 1.0)
@btime add!(ch, $lc) setup=(ch = ConstraintHandler($dh)); # 11.596 ns (0 allocations: 0 bytes)
@btime add!(ch, $lc) setup=(ch = ConstraintHandler($mixed_dh)); # 11.177 ns (0 allocations: 0 bytes)
φ(x) = x - Vec{2}((1.0, 0.0))
face_mapping = collect_periodic_faces(grid, "left", "right", φ)
pdbc = PeriodicDirichlet(:v, face_mapping, [1, 2])
# Add the constraint to the constraint handler
@btime add!(ch, $pdbc) setup=(ch=ConstraintHandler($dh)); # 737.750 μs (4091 allocations: 1.24 MiB)
@btime add!(ch, $pdbc) setup=(ch=ConstraintHandler($mixed_dh)); # 735.458 μs (4091 allocations: 1.24 MiB) The difference in |
Renumbering using Ferrite
using BenchmarkTools
grid = generate_grid(Quadrilateral, (100, 100))
dh = DofHandler(grid)
add!(dh, :v, 2, Lagrange{2,RefCube,2}()) # quadratic vector field
add!(dh, :s, 1, Lagrange{2,RefCube,1}()) # linear scalar field
close!(dh)
mixed_dh = MixedDofHandler(grid)
add!(mixed_dh, :v, 2, Lagrange{2,RefCube,2}()) # quadratic vector field
add!(mixed_dh, :s, 1, Lagrange{2,RefCube,1}()) # linear scalar field
close!(mixed_dh)
@btime renumber!($dh, $(ndofs(dh):-1:1)); # 596.500 μs (4 allocations: 22.56 KiB)
@btime renumber!($mixed_dh, $(ndofs(mixed_dh):-1:1)); # 598.459 μs (4 allocations: 22.56 KiB)
@btime renumber!($dh, $(DofOrder.FieldWise())); # 6.330 ms (94 allocations: 5.87 MiB)
@btime renumber!($mixed_dh, $(DofOrder.FieldWise())); # 6.067 ms (95 allocations: 5.87 MiB)
@btime renumber!($dh, $(DofOrder.ComponentWise())); # 6.016 ms (113 allocations: 4.33 MiB)
@btime renumber!($mixed_dh, $(DofOrder.ComponentWise())); # 5.875 ms (114 allocations: 4.33 MiB) Edit: Updated with new benchmarks after #645. |
This changes `FieldHandler.cellset` to be a sorted `OrderedSet` instead of a `Set`. This ensures that loops over sub-domains are done in ascending cell order. Since e.g. cells, node coordinates, and dofs are stored in ascending cell order this gives a significant performance boost to loops over sub-domains, i.e. assembly-style loops. In particular, this removes the performance gap between `MixedDofHandler` and `DofHandler` in the `create_sparsity_pattern` benchmark in #629. This is a minimal/initial step towards #625 that can be done before the DofHandler merge and rework of FieldHandler/SubDofHandler.
This changes `FieldHandler.cellset` to be a sorted `OrderedSet` instead of a `Set`. This ensures that loops over sub-domains are done in ascending cell order. Since e.g. cells, node coordinates, and dofs are stored in ascending cell order this gives a significant performance boost to loops over sub-domains, i.e. assembly-style loops. In particular, this removes the performance gap between `MixedDofHandler` and `DofHandler` in the `create_sparsity_pattern` benchmark in #629. This is a minimal/initial step towards #625 that can be done before the `DofHandler` merge and rework of `FieldHandler`/`SubDofHandler`.
This changes `FieldHandler.cellset` to be a sorted `OrderedSet` instead of a `Set`. This ensures that loops over sub-domains are done in ascending cell order. Since e.g. cells, node coordinates, and dofs are stored in ascending cell order this gives a significant performance boost to loops over sub-domains, i.e. assembly-style loops. In particular, this removes the performance gap between `MixedDofHandler` and `DofHandler` in the `create_sparsity_pattern` benchmark in #629. This is a minimal/initial step towards #625 that can be done before the `DofHandler` merge and rework of `FieldHandler`/`SubDofHandler`.
This changes `FieldHandler.cellset` to be a `BitSet` (which is sorted) instead of a `Set`. This ensures that loops over sub-domains are done in ascending cell order. Since e.g. cells, node coordinates and dofs are stored in ascending cell order this gives a significant performance boost to loops over sub-domains, i.e. assembly-style loops. In particular, this removes the performance gap between `MixedDofHandler` and `DofHandler` in the `create_sparsity_pattern` benchmark in #629. This is a minimal/initial step towards #625 that can be done before the `DofHandler` merge and rework of `FieldHandler`/`SubDofHandler`.
This patch uses `BitSet` in `apply_analytical!` and `reshape_to_nodes` for `MixedDofHandler`. The benefit here is twofold: computing the intersection is much faster (basically just bitwise `&`) and the subsequent looping over the cells are done in ascending cell order. This closes the performance gap between `MixedDofHandler` and `DofHandler` in benchmarks from #629 of `apply_analytical!`, `reshape_to_nodes`, and `vtk_point_data`. For example, here is the benchmark results for `apply_analytical!`: ``` 387.853 ms (72 allocations: 34.50 MiB) # MixedDofHandler master 55.262 ms (38 allocations: 553.45 KiB) # MixedDofHandler patch 41.861 ms (14 allocations: 2.27 KiB) # DofHandler master/patch ```
After #660 I think this issue can be closed since the MixedDofHandler is now equally as performant (at least where it really matters). 🎉 |
This patch uses `BitSet` in `apply_analytical!` and `reshape_to_nodes` for `MixedDofHandler`. The benefit here is twofold: computing the intersection is much faster (basically just bitwise `&`) and the subsequent looping over the cells are done in ascending cell order. This closes the performance gap between `MixedDofHandler` and `DofHandler` in benchmarks from #629 of `apply_analytical!`, `reshape_to_nodes`, and `vtk_point_data`. For example, here is the benchmark results for `apply_analytical!`: ``` 387.853 ms (72 allocations: 34.50 MiB) # MixedDofHandler master 55.262 ms (38 allocations: 553.45 KiB) # MixedDofHandler patch 41.861 ms (14 allocations: 2.27 KiB) # DofHandler master/patch ```
This patch uses `BitSet` in `apply_analytical!` and `reshape_to_nodes` for `MixedDofHandler`. The benefit here is twofold: computing the intersection is much faster (basically just bitwise `&`) and the subsequent looping over the cells are done in ascending cell order. This closes the performance gap between `MixedDofHandler` and `DofHandler` in benchmarks from #629 of `apply_analytical!`, `reshape_to_nodes`, and `vtk_point_data`. For example, here is the benchmark results for `apply_analytical!`: ``` 387.853 ms (72 allocations: 34.50 MiB) # MixedDofHandler master 55.262 ms (38 allocations: 553.45 KiB) # MixedDofHandler patch 41.861 ms (14 allocations: 2.27 KiB) # DofHandler master/patch ```
Before merging the dofhandlers, the
MixedDofHandler
must be able to do everything that theDofHandler
does and should ideally be equally fast. This issue keeps track of how far we've come along that way.Progress tracking
The focus here is to compare how a
MixedDofHandler
performs on a concrete grid with all fields on the full domain, i.e. how wellMixedDofHandler
works as a drop-in replacement ofDofHandler
.Benchmarking code should go in the comments, ideally one category at a time. However, if a method does not work / does not perform well, open a separate issue about it and reference this one. That way it will be easier to keep an overview and do small reviewable PRs to fix issues.
Syntax: Given syntax works and yields correct results for
MixedDofHandler
Performance: performance with
MixedDofHandler
is comparable toDofHandler
+ there is a benchmark of it!DofHandler(grid)
add!(dh, name, dim[, ip])
close!(dh)
ndofs(dh)
ndofs_per_cell(dh[, cell])
dof_range(dh, field_name)
celldofs(dh, i)
celldofs!(dofs, dh, i)
renumber!(dh, order)
renumber!(dh, DofOrder.FieldWise())
renumber!(dh, DofOrder.ComponentWise())
create_sparsity_pattern(dh)
create_sparsity_pattern(dh; coupling)
create_symmetric_sparsity_pattern(dh)
create_symmetric_sparsity_pattern(dh; coupling)
ConstraintHandler(dh)
add!(ch, dbc::Dirichlet)
add!(ch, ac::AffineConstraint)
add!(ch, pdbc::PeriodicDirichlet)
close!(ch)
apply_analytical!(a, dh, fieldname, f, cellset)
CellCache(dh)
reinit!(cc, i)
CellIterator(dh, cellset)
get_point_values(ph, dh, dof_values[, fieldname])
reshape_to_nodes(dh, u, fieldname)
vtk_point_data(vtk, dh, u)
Note that all benchmarks might not be equally important. We should discuss based on the results where regressions are acceptable and where they aren't.
Benchmarking
A base set-up for benchmarking can look like this:
The text was updated successfully, but these errors were encountered: