-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add caliper annotations to quest_candidates_example #1419
Conversation
Here's an example of the CUDA-BVH output with caliper report: CUDA-BVH output (click for dropdown)
and an example of the CUDA-Implicit Grid output with caliper report: CUDA-Implicit Grid output (click for dropdown)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -434,6 +446,7 @@ template <typename ExecSpace> | |||
axom::Array<IndexPair> findCandidatesBVH(const HexMesh& insertMesh, | |||
const HexMesh& queryMesh) | |||
{ | |||
AXOM_ANNOTATE_BEGIN("initializing BVH"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Would it make sense to remove the explicit timers now that we have caliper?
Having both will cause the outer wrapper to include timings for the inner one, and in this case, the caliper timings will include the SLIC formatting and logging times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bmhan12 thanks for adding the caliper stuff and showing the performance data. I need to pour over it a bit more. I will let you now if I have questions.
|
||
// copy pairs back to host and into return array | ||
AXOM_ANNOTATE_BEGIN("copy pairs to host"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's surprising that this loop takes so long (around 10 seconds on both platform that you showed!),
It would be interesting to explore where that time is being spent.
The only thing that sticks out to me is candidatePairs.emplace_back()
, where we don't reserve the size of candidatePairs
ahead of time. Any chance that each write is causing the array to expand by a single element each time w/ a full copy each time? (e.g. rather than reserving a buffer that's twice as big as the current one).
A quick test would be to call candidatePairs.reserve( candidates_v.size() )
before that loop and see what that does to the timings.
A different quick test might be to switch candidatePairs
to a std::vector
instead of axom::Array
and see what the performance looks like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like that's exactly what's happening:
Lines 1484 to 1490 in 6f5eaa3
template <typename T, int DIM, MemorySpace SPACE> template <typename... Args> inline void Array<T, DIM, SPACE>::emplace_back(Args&&... args) { static_assert(DIM == 1, "emplace_back is only supported for 1D arrays"); emplace(size(), std::forward<Args>(args)...); } Lines 1428 to 1436 in 6f5eaa3
template <typename T, int DIM, MemorySpace SPACE> template <typename... Args> inline void Array<T, DIM, SPACE>::emplace(IndexType pos, Args&&... args) { reserveForInsert(1, pos); OpHelper {m_allocator_id, m_executeOnGPU}.emplace(m_data, pos, std::forward<Args>(args)...); } Lines 1635 to 1660 in 6f5eaa3
template <typename T, int DIM, MemorySpace SPACE> inline T* Array<T, DIM, SPACE>::reserveForInsert(IndexType n, IndexType pos) { assert(n >= 0); assert(pos >= 0); assert(pos <= m_num_elements); if(n == 0) { return m_data + pos; } IndexType new_size = m_num_elements + n; if(new_size > m_capacity) { dynamicRealloc(new_size); } OpHelper {m_allocator_id, m_executeOnGPU}.move(m_data, pos, m_num_elements, pos + n); updateNumElements(new_size); return m_data + pos; }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick test would be to call candidatePairs.reserve( candidates_v.size() ) before that loop and see what that does to the timings.
A different quick test might be to switch candidatePairs to a std::vector instead of axom::Array and see what the performance looks like.
Surprisingly, the switch to std::vector
with reserve()
spacing beforehand saw an order of magnitude improvement in the "copy pairs to host" timing. reserve()
with axom::Array
did not make any noticeable difference from what I saw.
CUDA-BVH output with std::vector :
CUDA-BVH output (click for dropdown)
$ lrun -n 1 -g 1 ./examples/quest_candidates_example_ex -i ../../ucart23z.cycle_000000.root -q ../../ucart23z_shifted.cycle_000000.root -p 2 -m bvh --caliper report
[INFO]
Parsed parameters:
* First Blueprint mesh to insert: '../../ucart23z.cycle_000000.root'
* Second Blueprint mesh to query: '../../ucart23z_shifted.cycle_000000.root'
* Verbose logging: false
* Spatial method: 'Bounding Volume Hierarchy (BVH)'
* Resolution: 'Not Applicable'
* Runtime execution policy: 'cuda'
[INFO] Reading Blueprint file to insert: '../../ucart23z.cycle_000000.root'...
[INFO] Mesh bounding box is { min:(-1,-1,-1); max:(1,1,1); range:<2,2,2> }.
[INFO] Reading Blueprint file to query: '../../ucart23z_shifted.cycle_000000.root'...
[INFO] Mesh bounding box is { min:(-0.995,-0.995,-0.995); max:(1.005,1.005,1.005); range:<2,2,2> }.
[INFO] Finished reading in Blueprint files.
[INFO] Running BVH candidates algorithm in execution Space: [CUDA_EXEC]
[INFO] 0: Initialized BVH.
[INFO] 1: Queried candidate bounding boxes.
[INFO] 2: Initialized candidate pairs (on device).
[INFO] 3: Moved candidate pairs to host.
[INFO] Stats for query
-- Number of insert-BVH mesh hexes 8,000,000
-- Number of query mesh hexes 8,000,000
-- Total possible candidates 64,000,000,000,000
-- Candidates from BVH query 63,521,199
[INFO] Mesh had 63,521,199 candidates pairs
Path Min time/rank Max time/rank Avg time/rank Time %
quest candidates example 5.479472 5.479472 5.479472 99.997047
load Blueprint meshes 5.093233 5.093233 5.093233 92.948415
load Blueprint hexahedron mesh 4.996896 4.996896 4.996896 91.190336
find candidates 0.374126 0.374126 0.374126 6.827578
initializing BVH 0.071898 0.071898 0.071898 1.312089
BVH::initialize 0.071845 0.071845 0.071845 1.311119
LinearBVH::buildImpl 0.071836 0.071836 0.071836 1.310968
build_radix_tree 0.049027 0.049027 0.049027 0.894715
RadixTree::allocate 0.019380 0.019380 0.019380 0.353667
transform_boxes 0.001531 0.001531 0.001531 0.027946
reduce_abbs 0.006291 0.006291 0.006291 0.114814
get_mcodes 0.000524 0.000524 0.000524 0.009565
sort_mcodes 0.002679 0.002679 0.002679 0.048886
array_counting 0.000064 0.000064 0.000064 0.001160
raja_stable_sort 0.002605 0.002605 0.002605 0.047547
reorder 0.009178 0.009178 0.009178 0.167485
build_tree 0.000508 0.000508 0.000508 0.009262
propagate_abbs 0.008895 0.008895 0.008895 0.162326
LinearBVH::allocate 0.014821 0.014821 0.014821 0.270475
emit_bvh_parents 0.004830 0.004830 0.004830 0.088151
query candidates 0.056463 0.056463 0.056463 1.030416
BVH::findBoundingBoxes 0.054495 0.054495 0.054495 0.994504
LinearBVH::findCandidatesImpl 0.054346 0.054346 0.054346 0.991788
PASS[1]:count_traversal 0.021871 0.021871 0.021871 0.399136
exclusive_scan 0.000111 0.000111 0.000111 0.002033
allocate_candidates 0.004732 0.004732 0.004732 0.086359
PASS[2]:fill_traversal 0.027616 0.027616 0.027616 0.503972
write candidate pairs 0.012771 0.012771 0.012771 0.233071
copy pairs to host 0.223014 0.223014 0.223014 4.069871
CUDA-Implicit Grid output with std::vector :
CUDA-Implicit Grid output (click for dropdown)
$ lrun -n 1 -g 1 ./examples/quest_candidates_example_ex -i ../../ucart23z.cycle_000000.root -q ../../ucart23z_shifted.cycle_000000.root -p 2 -m implicit --caliper report
[INFO]
Parsed parameters:
* First Blueprint mesh to insert: '../../ucart23z.cycle_000000.root'
* Second Blueprint mesh to query: '../../ucart23z_shifted.cycle_000000.root'
* Verbose logging: false
* Spatial method: 'Implicit Grid'
* Resolution: '0'
* Runtime execution policy: 'cuda'
[INFO] Reading Blueprint file to insert: '../../ucart23z.cycle_000000.root'...
[INFO] Mesh bounding box is { min:(-1,-1,-1); max:(1,1,1); range:<2,2,2> }.
[INFO] Reading Blueprint file to query: '../../ucart23z_shifted.cycle_000000.root'...
[INFO] Mesh bounding box is { min:(-0.995,-0.995,-0.995); max:(1.005,1.005,1.005); range:<2,2,2> }.
[INFO] Finished reading in Blueprint files.
[INFO] Running Implicit Grid candidates algorithm in execution Space: [CUDA_EXEC]
[INFO] 0: Initialized Implicit Grid.
[INFO] 1: Queried candidate bounding boxes.
[INFO] 2: Initialized candidate pairs (on device).
[INFO] 3: Moved candidate pairs to host.
[INFO] Stats for query
-- Number of insert mesh hexes 8,000,000
-- Number of query mesh hexes 8,000,000
-- Total possible candidates 64,000,000,000,000
-- Candidates from Implicit Grid query 63,521,199
[INFO] Mesh had 63,521,199 candidates pairs
Path Min time/rank Max time/rank Avg time/rank Time %
quest candidates example 7.952440 7.952440 7.952440 99.997933
load Blueprint meshes 5.054810 5.054810 5.054810 63.561687
load Blueprint hexahedron mesh 4.985796 4.985796 4.985796 62.693869
find candidates 2.884664 2.884664 2.884664 36.273199
initializing implicit grid 0.126207 0.126207 0.126207 1.586988
query candidates 0.912268 0.912268 0.912268 11.471313
write candidate pairs 1.354428 1.354428 1.354428 17.031247
copy pairs to host 0.443690 0.443690 0.443690 5.579181
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linked this finding to related #287
…dup, remove Timer usage
As well as for my own future self-reference, this is the script I used to collect timing data for the table: script (click for dropdown)
Collects data for ten runs and dumps them into files labeled by RAJA policy and spatial index used. Script is pretty rough around the edges. Not completely automated, as it assumes you have an allocation, have configured and compiled a single Axom build for each system, etc. Also data requires some post-processing, going through my code editor to grab values and a quick plug into Excel to calculate averages. |
This PR:
quest_candidates_example
As part of this, also re-ran my test scripts using the same setup as before to get the average numbers (in seconds) for the spatial index performances. In addition, I added numbers for rzwhippet for 112 threads.
Notably, the initialization times for both bvh and implicit grid are an order of magnitude faster than before for HIP and CUDA (previous PR #1278 for comparison):
Same testing setup as last time, but with caliper:
time ./examples/quest_candidates_example_ex -i ucart23z.cycle_000000.root -q ucart23z_shifted.cycle_000000.root -p <raja policy number> -m <method, either "bvh" or "implicit"> --caliper report
flux run -N 1 -g 1
lrun -n 1 -g 1
salloc -N 1 -n 36
for rzgenie,salloc -N 1 -n 112
for rzwhippetucart23z
is an 8,000,000 element mesh, whileucart23z_shifted
is the same mesh but shifted slightly.