Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unitrace mpi -host issue #75

Open
sdeyati opened this issue Aug 10, 2024 · 3 comments
Open

Unitrace mpi -host issue #75

sdeyati opened this issue Aug 10, 2024 · 3 comments

Comments

@sdeyati
Copy link

sdeyati commented Aug 10, 2024

Unitrace is not working when the application is run with mpi -host command.

@Sarbojit2019
Copy link
Contributor

@sdeyati, Please provide more details about what is not working in Unitrace when MPI's '-host' option is used?
Also provide MPI version and system details for better debug.

@sdeyati
Copy link
Author

sdeyati commented Aug 19, 2024

System_Info.txt
Attached a system info file with all system information

@sdeyati
Copy link
Author

sdeyati commented Aug 19, 2024

An example from mpirun -host command 

=== Device #0 Metrics ===

Kernel, GpuTime[ns], GpuCoreClocks[cycles], AvgGpuCoreFrequencyMHz[MHz], GpuSliceClocksCount[events], AvgGpuSliceFrequencyMHz[MHz], L3_BYTE_READ[bytes], L3_BYTE_WRITE[bytes], GPU_MEMORY_BYTE_READ[bytes], GPU_MEMORY_BYTE_WRITE[bytes], XVE_ACTIVE[%], XVE_STALL[%], XVE_BUSY[events], XVE_THREADS_OCCUPANCY_ALL[%], XVE_COMPUTE_THREAD_COUNT[threads], XVE_ATOMIC_ACCESS_COUNT[messages], XVE_BARRIER_MESSAGE_COUNT[messages], XVE_INST_EXECUTED_ALU0_ALL[events], XVE_INST_EXECUTED_ALU1_ALL[events], XVE_INST_EXECUTED_XMX_ALL[events], XVE_INST_EXECUTED_SEND_ALL[events], XVE_INST_EXECUTED_CONTROL_ALL[events], XVE_PIPE_ALU0_AND_ALU1_ACTIVE[%], XVE_PIPE_ALU0_AND_XMX_ACTIVE[%], XVE_INST_EXECUTED_ALU0_ALL_UTILIZATION[%], XVE_INST_EXECUTED_ALU1_ALL_UTILIZATION[%], XVE_INST_EXECUTED_SEND_ALL_UTILIZATION[%], XVE_INST_EXECUTED_CONTROL_ALL_UTILIZATION[%], XVE_INST_EXECUTED_XMX_ALL_UTILIZATION[%], QueryBeginTime[ns], CoreFrequencyMHz[MHz], XveSliceFrequencyMHz[MHz], ReportReason, ContextIdValid, ContextId, SourceId, StreamMarker,
, 40960, 65536, 1600, 65520, 1599, 0, 0, 0, 0, 0.000000, 0.000000, 0, 0.000000, 0, 0, 0, 0, 0, 0, 0, 0, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 144741601280, 1599, 0, 1, 0, 0, 0, 0,
, 40960, 65536, 1600, 65520, 1599, 0, 0, 0, 0, 0.000000, 0.000000, 0, 0.000000, 0, 0, 0, 0, 0, 0, 0, 0, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 144741642240, 1599, 0, 1, 0, 0, 0, 0,
, 40960, 65536, 1600, 65560, 1600, 0, 0, 0, 0, 0.000000, 0.000000, 0, 0.000000, 0, 0, 0, 0, 0, 0, 0, 0, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 144741683200, 1599, 0, 1, 0, 0, 0, 0,
, 40960, 65536, 1600, 65520, 1599, 0, 0, 0, 0, 0.000000, 0.000000, 0, 0.000000, 0, 0, 0, 0, 0, 0, 0, 0, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 144741724160, 1599, 0, 1, 0, 0, 0, 0,
, 40960, 65536, 1600, 65560, 1600, 0, 0, 0, 0, 0.000000, 0.000000, 0, 0.000000, 0, 0, 0, 0, 0, 0, 0, 0, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 144741765120, 1599, 0, 1, 0, 0, 0, 0,
, 40960, 65536, 1600, 65520, 1599, 0, 0, 0, 0, 0.000000, 0.000000, 0, 0.000000, 0, 0, 0, 0, 0, 0, 0, 0, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 144741806080, 1599, 0, 1, 0, 0, 0, 0,
, 40960, 65536, 1600, 65520, 1599, 0, 0, 0, 0, 0.000000, 0.000000, 0, 0.000000, 0, 0, 0, 0, 0, 0, 0, 0, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 144741847040, 1599, 0, 1, 0, 0, 0, 0,
, 40960,

An example from mpirun  without host command 

=== Device #0 Metrics ===

Kernel, GpuTime[ns], GpuCoreClocks[cycles], AvgGpuCoreFrequencyMHz[MHz], GpuSliceClocksCount[events], AvgGpuSliceFrequencyMHz[MHz], L3_BYTE_READ[bytes], L3_BYTE_WRITE[bytes], GPU_MEMORY_BYTE_READ[bytes], GPU_MEMORY_BYTE_WRITE[bytes], XVE_ACTIVE[%], XVE_STALL[%], XVE_BUSY[events], XVE_THREADS_OCCUPANCY_ALL[%], XVE_COMPUTE_THREAD_COUNT[threads], XVE_ATOMIC_ACCESS_COUNT[messages], XVE_BARRIER_MESSAGE_COUNT[messages], XVE_INST_EXECUTED_ALU0_ALL[events], XVE_INST_EXECUTED_ALU1_ALL[events], XVE_INST_EXECUTED_XMX_ALL[events], XVE_INST_EXECUTED_SEND_ALL[events], XVE_INST_EXECUTED_CONTROL_ALL[events], XVE_PIPE_ALU0_AND_ALU1_ACTIVE[%], XVE_PIPE_ALU0_AND_XMX_ACTIVE[%], XVE_INST_EXECUTED_ALU0_ALL_UTILIZATION[%], XVE_INST_EXECUTED_ALU1_ALL_UTILIZATION[%], XVE_INST_EXECUTED_SEND_ALL_UTILIZATION[%], XVE_INST_EXECUTED_CONTROL_ALL_UTILIZATION[%], XVE_INST_EXECUTED_XMX_ALL_UTILIZATION[%], QueryBeginTime[ns], CoreFrequencyMHz[MHz], XveSliceFrequencyMHz[MHz], ReportReason, ContextIdValid, ContextId, SourceId, StreamMarker,
"hgemm_nocopy_nn_64x40_4x8[SIMD16 {56; 1; 1} {64; 8; 1}]", 40960, 51392, 1254, 51515, 1257, 37192192, 1152, 5267840, 640, 39.828487, 43.461124, 1, 41.914211, 1792, 0, 14018, 7168, 1006711, 8440357, 276466, 38039, 0.001820, 0.000000, 0.031059, 4.362075, 1.197926, 0.164823, 36.572033, 105886781440, 1599, 0, 1, 0, 0, 0, 0,
"hgemm_nocopy_nn_64x40_4x8[SIMD16 {56; 1; 1} {64; 8; 1}]", 40960, 32768, 800, 32800, 800, 52968960, 0, 7039360, 0, 99.878021, 0.000027, 1, 49.939026, 0, 0, 23156, 0, 528143, 14676476, 435884, 45885, 0.000000, 0.000000, 0.000000, 3.594179, 2.966327, 0.312262, 99.878021, 105886822400, 1599, 0, 1, 0, 0, 0, 0,
"hgemm_nocopy_nn_64x40_4x8[SIMD16 {56; 1; 1} {64; 8; 1}]", 49240, 20369, 413, 20366, 413, 32738816, 0, 4319488, 0, 100.340034, 0.000000, 1, 49.658131, 0, 0, 14276, 0, 325946, 9061920, 268967, 28297, 0.000000, 0.000000, 0.000000, 3.572415, 2.947917, 0.310139, 99.319946, 105886871640, 1499, 0, 1, 0, 0, 0, 0,
"hgemm_nocopy_nn_64x40_4x8[SIMD16 {56; 1; 1} {64; 8; 1}]", 32680, 14657, 448, 14688, 449, 23468544, 0, 3195520, 0, 99.776604, 0.003830, 1, 50.602531, 0, 0, 10383, 0, 239186, 6660264, 197797, 20780, 0.000000, 0.000000, 0.000000, 3.634922, 3.005931, 0.315795, 101.216377, 105886904320, 1499, 0, 1, 0, 0, 0, 0,
"hgemm_nocopy_nn_64x40_4x8[SIMD16 {56; 1; 1} {64; 8; 1}]", 40960, 21294, 519, 21295, 519, 33897472, 0, 4564096, 0, 99.955978, 0.000587, 1, 49.980042, 0, 0, 14802, 0, 342513, 9535614, 283075, 29876, 0.000000, 0.000000, 0.000000, 3.590223, 2.967193, 0.313160, 99.952347, 105886945280, 1499, 0, 1, 0, 0, 0, 0,
"hgemm_nocopy_nn_64x40_4x8[SIMD16 {56; 1; 1} {64; 8; 1}]", 40960, 18432, 450, 18440, 450, 29799424, 0, 3935360, 0, 100.005424, 0.000000, 1, 50.000679, 0, 0, 12689, 0, 296241, 8261568, 245001, 25797, 0.000000, 0.000000, 0.000000, 3.585967, 2.965712, 0.312270, 100.005424, 105886986240, 1499, 0, 1, 0, 0, 0, 0,
"hgemm_nocopy_nn_64x40_4x8[SIMD16 {56; 1; 1} {64; 8; 1}]", 81920, 22754, 277, 22748, 277, 36967936, 0, 4893952, 0, 100.010994, 0.000000, 1, 50.008793, 0, 0, 16030, 0, 366535, 10193568, 302737, 31840, 0.000000, 0.000000, 0.000000, 3.596617, 2.970601, 0.312429, 100.024178, 105887068160, 1399, 0, 1, 0, 0, 0, 0,
"hgemm_nocopy_nn_64x40_4x8[SIMD16 {56; 1; 1} {64; 8; 1}]", 40960, 21164, 516, 21120, 515, 34225152, 0, 4517504, 0, 100.028404, 0.000000, 1, 50.014202, 0, 0, 14717, 0, 340009, 9464784, 280942, 29569, 0.000000, 0.000000, 0.000000, 3.593507, 2.969236, 0.312511, 100.031960, 105887109120, 1399, 0, 1, 0, 0, 0, 0,
"hgemm_nocopy_nn_64x40_4x8[SIMD16 {56; 1; 1} {64; 8; 1}]", 40960, 18469, 450, 18540, 452, 29526016, 0, 3984128, 0, 99.987862, 0.000000, 1, 49.993931, 0, 0, 13242, 0, 299339, 8303904, 246795, 25947, 0.000000, 0.000000, 0.000000, 3.603923, 2.971314, 0.312392, 99.975731, 105887150080, 1399, 0, 1, 0, 0, 0, 0,
"hgemm_nocopy_nn_64x40_4x8[SIMD16 {56; 1; 1} {64; 8; 1}]", 40960, 18432, 450, 18440, 450, 29273088, 0, 3955712, 0, 99.989151, 0.000000, 1, 49.994576, 0, 0, 13008, 0, 297142, 8261232, 245353, 25800, 0.000000, 0.000000, 0.000000, 3.596873, 2.969973, 0.312306, 100.001358, 105887191040, 1399, 0, 1, 0, 0, 0, 0,
"hgemm_noco

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants