Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce number of calls to rocprof #384

Merged
merged 8 commits into from
Jul 18, 2024
Merged

Conversation

benrichard-amd
Copy link
Contributor

Reduces number of calls to rocprof by improving perfmon coalescing.

@coleramos425
Copy link
Collaborator

I reviewed the PR and in general, things look good. The one thing that I wanted to review closely was the Omniperf metrics that are explicitly dependent on SQ_*.csv (indicated by the coll_level field in yaml configs) to ensure results are still matching original implementation, i.e.
image
See: Code search results (github.com)

Things line up for the most part however when contrasting the two I find that INSTR_FETCH_LATENCY and SMEM_LATENCY differ. It seems to me that counter <-> output file mapping is staying the same in which case I think we can narrow this down to run-to-run variation. @benrichard-amd could you confirm my assumption? My results are below:

v1 (original code vs. original code)

omniperf analyze -p workloads/orig_run1/MI300A/ -p workloads/orig_run2/MI300A/ -b 11.2.10 13.2.5

v1_v1

v2 (original code vs. ben's modification)

omniperf analyze -p workloads/orig_run1/MI300A/ -p workloads/bens_run1/MI300A/ -b 11.2.10 13.2.5

v1_v2

@coleramos425
Copy link
Collaborator

Other than the above inquiry my only other request before we merge this is that you sign off on commits per DCO requirements, e.g.
image

@benrichard-amd
Copy link
Contributor Author

Hi @coleramos425,

Things line up for the most part however when contrasting the two I find that INSTR_FETCH_LATENCY and SMEM_LATENCY differ. It seems to me that counter <-> output file mapping is staying the same in which case I think we can narrow this down to run-to-run variation. @benrichard-amd could you confirm my assumption? My results are below:

I think this is run-to-run variation. Might depend on the workload. I ran the same experiment using the occupancy.hip sample workload (MI300X):

original code vs original code:
occupancy_origin

original code vs ben's modification:
occupancy_ben

Signed-off-by: benrichard-amd <ben.richard@amd.com>
Signed-off-by: benrichard-amd <ben.richard@amd.com>
Interleve TCC channel counters in putput file  e.g.  TCC_HIT[0] TCC_ATOMIC[0] ... TCC_HIT[1] TCC_ATOMIC[1]

Signed-off-by: benrichard-amd <ben.richard@amd.com>
Omniperf analyze expects the accumulate files to be in SQ_*.csv files.

Since these files also contain PMC counters (we are trying to
fit as many counters into each file as possible to minimize runs),
we need to include these SQ_*.csv files in pmc_perf.csv.

Signed-off-by: benrichard-amd <ben.richard@amd.com>
Signed-off-by: benrichard-amd <ben.richard@amd.com>
Signed-off-by: benrichard-amd <ben.richard@amd.com>
Ran into rocprof error:
ROCProfiler: fatal error: input metric'TCC_EA0_RDREQ[16]' not supported on this hardware: gfx942

gfx942 has 16 channels, not 32.

Signed-off-by: benrichard-amd <ben.richard@amd.com>
Signed-off-by: benrichard-amd <ben.richard@amd.com>
@coleramos425
Copy link
Collaborator

LGTM

@coleramos425 coleramos425 merged commit 3d4b48d into ROCm:dev Jul 18, 2024
8 of 10 checks passed
@coleramos425
Copy link
Collaborator

For the record, in the event of a vanilla Omniperf profiling run (e.g. no IP block filtering) this PR reduces the num. of required application replays from 24 -> 15. In the study of mixbench profiling performance (below), I found this leads to a ~27% improvement.

Note: This test was ran using a production rocprofiler build. We can expect an even larger improvement when this is applied in combination with profiler performance enhancements in a future release.

Runtime (sec) # of req. application replays
Original code 32.28 24
This PR 23.67 15

CC: @koomie

@benrichard-amd benrichard-amd deleted the fewer-runs branch July 18, 2024 19:26
benrichard-amd added a commit that referenced this pull request Sep 9, 2024
…m-rel-6.2 (#422)

* Improve perfmon coalescing

Signed-off-by: benrichard-amd <ben.richard@amd.com>

* Interleve TCC channel counters

Signed-off-by: benrichard-amd <ben.richard@amd.com>

* Remove duplicate normal counters
Interleve TCC channel counters in putput file  e.g.  TCC_HIT[0] TCC_ATOMIC[0] ... TCC_HIT[1] TCC_ATOMIC[1]

Signed-off-by: benrichard-amd <ben.richard@amd.com>

* Save accumulate counters to SQ_ files

Omniperf analyze expects the accumulate files to be in SQ_*.csv files.

Since these files also contain PMC counters (we are trying to
fit as many counters into each file as possible to minimize runs),
we need to include these SQ_*.csv files in pmc_perf.csv.

Signed-off-by: benrichard-amd <ben.richard@amd.com>

* Update to work with rocprof v1

Signed-off-by: benrichard-amd <ben.richard@amd.com>

* Remove unused method

Signed-off-by: benrichard-amd <ben.richard@amd.com>

* Set correct number of TCC channels for gfx942

Ran into rocprof error:
ROCProfiler: fatal error: input metric'TCC_EA0_RDREQ[16]' not supported on this hardware: gfx942

gfx942 has 16 channels, not 32.

Signed-off-by: benrichard-amd <ben.richard@amd.com>

* Fix code formatting

Signed-off-by: benrichard-amd <ben.richard@amd.com>

---------

Signed-off-by: benrichard-amd <ben.richard@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants