-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Warnings in trace_link.py
when running Chakra on AMD GPUs
#128
Comments
To solve issue 1: in Nvidia Kineto traces, each entry in the JSON file contains two fields {
"ph": "X", "cat": "cuda_runtime", "name": "cudaStreamWaitEvent", "pid": 2012624, "tid": 1142494784,
"ts": 1720537333825191, "dur": 1,
"args": {
"External id": 350,
"cbid": 147, "correlation": 350
}
} AMD traces look like this: {
"ph": "X", "cat": "gpu_memcpy", "name": "CopyHostToDevice", "pid": 2, "tid": 0,
"ts": 1720537542569197, "dur": 32,
"args": {
"External id": 131
}
} It is unclear if |
Thanks for sharing this, @rohitdwivedula. We had a chat with the PyTorch profiler team, and they advised us to use the correlation ID to link GPU operators with the launcher operators. Previously, we used the external ID for linking CPU operators in a Chakra host trace and a Chakra device trace. It turned out that the external ID field is not stable, so we are currently using the rf_id field. |
Hi @TaekyungHeo - am hoping to open a PR to try to fix this issue and had a quick question. Currently, PyTorch's kineto traces do not contain
to this:
After making this one change to the JSON, we ran Question: would it be possible for us to upstream the change in our fork (essentially adding all Option 1We add a section to the documentation with the hacky fix mentioned above for AMD hardware. Before passing the kineto script to def process_kineto_file(infile, outfile):
with open(infile, 'r') as f:
data = f.read()
data = json.loads(data)
for i in range(len(data['traceEvents'])):
if 'args' in data['traceEvents'][i].keys() and 'External id' in data['traceEvents'][i]['args'].keys() and 'correlation' not in data['traceEvents'][i]['args'].keys():
data['traceEvents'][i]['args']['correlation'] = data['traceEvents'][i]['args']['External id']
with open(outfile, 'w') as f:
json.dump(data, f, indent=2) Option 2Inside the |
Thanks for raising this issue, @rohitdwivedula. I prefer option 1 mainly because this issue needs to be fixed in PyTorch. We had faced a ton of issues around this problem in the past and needed to make sure Kineto was doing the right thing for consistent behavior. Simple traces would work but more complex ones would fail. |
When using
chakra_trace_link
on AMD Instinct MI210 GPUs, a bunch of warnings crop up when linking the Kineto and ET JSON files:Steps to Reproduce
python3 toy_model_train.py
KINETO_demo.json
andET_demo.json
are generated.chakra_trace_link --pytorch-et-file ET_demo.json --kineto-file KINETO_demo.json --output-file LINKED.json
Environment Details
fc6bfa8c5adb18ea581498445aeb448d5af1bb60
884a1f0154a16e2c170e456f8027f2646c9108ae
Possible Causes
correlation
field in the JSON objects. However, on AMD GPUs, the PyTorch Kineto traces do not contain thecorrelation
field - (see this PyTorch issue for more information).is_cuda_launch_op
function (link), thecuda_launch_operations
list does not contain operation names such ashipLaunchKernel
.The text was updated successfully, but these errors were encountered: