Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encode communicator groups in Chakra traces #140

Merged
merged 5 commits into from
Sep 6, 2024

Conversation

JoongunPark
Copy link
Contributor

@JoongunPark JoongunPark commented Jul 24, 2024

Summary

Encoding communicator groups in Chakra traces is essential for accurately simulating collective communication when multiple communicator groups are present. With the latest PyTorch version, you can collect communicator groups in Chakra host traces (PyTorch execution traces) and Chakra device traces (Kineto traces). In Chakra host traces, you will find a process_group:init operator that presents the available communicator groups in the run. Moreover, whenever there is a collective communication operator, you can find essential fields in its attributes to correlate the collective operator with a communicator group. You can use the pg_name field for correlation. Additionally, Chakra device traces now include communicator group information in ncclDevKernel_* operators.

Below is an example with AllReduce.

{
  "ph": "X",
  "cat": "kernel",
  "name": "ncclDevKernel_AllReduce_Sum_bf16_RING_LL(ncclDevKernelArgsStorage<4096ul>)",
  "pid": 0,
  "tid": 60,
  "args": {
    "External id": 14728,
    "queued": 0,
    "device": 0,
    "context": 1,
    "stream": 60,
    "correlation": 136816,
    "registers per thread": 96,
    "shared memory": 89296,
    "blocks per SM": 0.222222,
    "warps per SM": 3.777778,
    "grid": [24, 1, 1],
    "block": [544, 1, 1],
    "est. achieved occupancy %": 0,
    "Collective name": "allreduce",
    "In msg nelems": 6291456,
    "Out msg nelems": 6291456,
    "Group size": 2,
    "dtype": "BFloat16",
    "In split size": "[]",
    "Out split size": "[]",
    "Process Group Name": "27",
    "Process Group Description": "undefined",
    "Process Group Ranks": "[0, 1]"
  }
}

It includes "Group size," "Process Group Name," "Process Group Description," and "Process Group Ranks."

Most of the information, except for Process Group Name, is redundant since it is already defined in the metadata, as shown in the example below.

"distributedInfo": {
  "backend": "nccl",
  "rank": 0,
  "world_size": 8,
  "pg_count": 67,
  "pg_config": [
    {"pg_name": "0", "pg_desc": "default_pg", "backend_config": "cuda:nccl", "pg_size": 8, "ranks": [0, 1, 2, 3, 4, 5, 6, 7]},
    {"pg_name": "1", "pg_desc": "undefined", "backend_config": "cuda:nccl", "pg_size": 2, "ranks": [0, 2]},
    {"pg_name": "2", "pg_desc": "undefined", "backend_config": "cpu:gloo,cuda:gloo", "pg_size": 2, "ranks": [0, 2]},
    {"pg_name": "9", "pg_desc": "undefined", "backend_config": "cuda:nccl", "pg_size": 2, "ranks": [0, 2]},
    {"pg_name": "10", "pg_desc": "undefined", "backend_config": "cpu:gloo,cuda:gloo", "pg_size": 2, "ranks": [0, 2]},
    {"pg_name": "17", "pg_desc": "undefined", "backend_config": "cuda:nccl", "pg_size": 1, "ranks": [0]},
    {"pg_name": "25", "pg_desc": "undefined", "backend_config": "cuda:nccl", "pg_size": 4, "ranks": [0, 1, 4, 5]},
    {"pg_name": "27", "pg_desc": "undefined", "backend_config": "cuda:nccl", "pg_size": 2, "ranks": [0, 1]},
    {"pg_name": "31", "pg_desc": "undefined", "backend_config": "cuda:nccl", "pg_size": 2, "ranks": [0, 4]},
    {"pg_name": "32", "pg_desc": "undefined", "backend_config": "cuda:nccl", "pg_size": 2, "ranks": [0, 4]},
    {"pg_name": "33", "pg_desc": "undefined", "backend_config": "cuda:nccl", "pg_size": 1, "ranks": [0]},
    {"pg_name": "43", "pg_desc": "undefined", "backend_config": "cuda:nccl", "pg_size": 4, "ranks": [0, 1, 2, 3]},
    {"pg_name": "45", "pg_desc": "undefined", "backend_config": "cuda:nccl", "pg_size": 4, "ranks": [0, 1, 2, 3]},
    {"pg_name": "47", "pg_desc": "undefined", "backend_config": "cuda:nccl", "pg_size": 2, "ranks": [0, 1]},
    {"pg_name": "51", "pg_desc": "undefined", "backend_config": "cuda:nccl", "pg_size": 1, "ranks": [0]},
    {"pg_name": "59", "pg_desc": "undefined", "backend_config": "cuda:nccl", "pg_size": 2, "ranks": [0, 2]},
    {"pg_name": "60", "pg_desc": "undefined", "backend_config": "cpu:gloo,cuda:gloo", "pg_size": 2, "ranks": [0, 2]}
  ],
  "nccl_version": "2.22.3"
}

This PR allows users to identify the pg_init operator by classifying the node explicitly as a metadata node. Moreover, this PR explicitly encodes pg_name as an attribute of collective communication operators. Finally, this PR updates the feeder so that simulators can parse and access the pg_name field easily.

Test Plan

Generate Chakra HDT traces.

for rank in 0 1 2 3 4 5 6 7; do
    chakra_trace_link --chakra-host-trace gpt3_126m_1.1.0-chakra.0.0.4/et_${rank}.json --chakra-device-trace gpt3_126m_1.1.0-chakra.0.0.4/kineto_${rank}.json --output-file gpt3_126m_1.1.0-chakra.0.0.4/rank_${rank}.json   
    chakra_converter PyTorch --input gpt3_126m_1.1.0-chakra.0.0.4/rank_${rank}.json --output gpt3_126m_1.1.0-chakra.0.0.4/rank.${rank}.et
done

Check through Jsonizer

for rank in 0 1 2 3 4 5 6 7; do
    chakra_jsonizer --input_filename gpt3_126m_1.1.0-chakra.0.0.4/rank_${rank}.json --output_filename gpt3_126m_1.1.0-chakra.0.0.4/rank.${rank}.json
done

Test ETFeeder with ASTRA-Sim

mv ../gpt3_126m_1.1.0-chakra.0.0.4/ ./gpt3 
./build/astra_analytical/build/bin/AstraSim_Analytical_Congestion_Unaware --workload-configuration=/home/un-gpu/Project/jpark/astra-sim/gpt3/rank --system-configuration=./inputs/system/Ring.json --network-configuratio
n=./inputs/network/analytical/Ring.yml --remote-memory-configuration=./inputs/remote_memory/analytical/no_memory_expansion.json

Code in the ASTRA-Sim using ETFeeder

  if (!node->is_cpu_op() && (node->type() == ChakraNodeType::COMM_COLL_NODE)) {
    if(node->pg_name().empty() == false){
      cout << "Node Name: " << node->name() << endl;
      cout << "Process Group Name" << node->pg_name() << endl;
    }
  }

Trace

The traces are collected From PyTorch Schema 1.1.0.chakra-0.0.4
gpt3_126m_1.1.0-chakra.0.0.4.zip

@JoongunPark JoongunPark requested a review from a team as a code owner July 24, 2024 18:15
Copy link

github-actions bot commented Jul 24, 2024

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@TaekyungHeo TaekyungHeo added the enhancement New feature or request label Jul 24, 2024
@TaekyungHeo TaekyungHeo changed the title Encoding Process Group information in Chakra traces Encode communicator groups in Chakra traces Jul 26, 2024
@TaekyungHeo
Copy link
Contributor

TaekyungHeo commented Jul 26, 2024

Thank you for your contribution, @JoongunPark.

  1. I have updated the PR summary. Please review it and update it if needed.
  2. If you recall the PyTorch version that you used, please add it to the PR summary. Some users want to know the exact version number.
  3. I have merged Identify process group init nodes as METADATA nodes #109 into this PR. Please check if the updated PR works for ASTRA-sim.
  4. Please rebase your PR to the latest main branch.

@JoongunPark
Copy link
Contributor Author

Thank you for your review! @TaekyungHeo.
All of your action item has been processed.
Please check if the changes meet the requirements!

Below is the result from trace_linker and converter.

[2024-07-26 15:51:34,182] trace_link.py:48 [INFO]: Linking process successful. Output file is available at gpt3_126m_1.1.0-chakra.0.0.4/rank_0.json.
[2024-07-26 15:51:34,182] trace_link.py:49 [INFO]: Please run the chakra_converter for further postprocessing.
INFO [07/26/2024 03:51:40 PM] Conversion successful. Output file is available at gpt3_126m_1.1.0-chakra.0.0.4/rank.0.et.
[2024-07-26 15:51:50,943] trace_link.py:48 [INFO]: Linking process successful. Output file is available at gpt3_126m_1.1.0-chakra.0.0.4/rank_1.json.
[2024-07-26 15:51:50,943] trace_link.py:49 [INFO]: Please run the chakra_converter for further postprocessing.
INFO [07/26/2024 03:51:56 PM] Conversion successful. Output file is available at gpt3_126m_1.1.0-chakra.0.0.4/rank.1.et.
[2024-07-26 15:52:07,224] trace_link.py:48 [INFO]: Linking process successful. Output file is available at gpt3_126m_1.1.0-chakra.0.0.4/rank_2.json.
[2024-07-26 15:52:07,224] trace_link.py:49 [INFO]: Please run the chakra_converter for further postprocessing.
INFO [07/26/2024 03:52:12 PM] Conversion successful. Output file is available at gpt3_126m_1.1.0-chakra.0.0.4/rank.2.et.
[2024-07-26 15:52:22,995] trace_link.py:48 [INFO]: Linking process successful. Output file is available at gpt3_126m_1.1.0-chakra.0.0.4/rank_3.json.
[2024-07-26 15:52:22,995] trace_link.py:49 [INFO]: Please run the chakra_converter for further postprocessing.
INFO [07/26/2024 03:52:28 PM] Conversion successful. Output file is available at gpt3_126m_1.1.0-chakra.0.0.4/rank.3.et.
[2024-07-26 15:52:39,483] trace_link.py:48 [INFO]: Linking process successful. Output file is available at gpt3_126m_1.1.0-chakra.0.0.4/rank_4.json.
[2024-07-26 15:52:39,483] trace_link.py:49 [INFO]: Please run the chakra_converter for further postprocessing.
INFO [07/26/2024 03:52:45 PM] Conversion successful. Output file is available at gpt3_126m_1.1.0-chakra.0.0.4/rank.4.et.
[2024-07-26 15:52:56,729] trace_link.py:48 [INFO]: Linking process successful. Output file is available at gpt3_126m_1.1.0-chakra.0.0.4/rank_5.json.
[2024-07-26 15:52:56,729] trace_link.py:49 [INFO]: Please run the chakra_converter for further postprocessing.
INFO [07/26/2024 03:53:02 PM] Conversion successful. Output file is available at gpt3_126m_1.1.0-chakra.0.0.4/rank.5.et.
[2024-07-26 15:53:13,582] trace_link.py:48 [INFO]: Linking process successful. Output file is available at gpt3_126m_1.1.0-chakra.0.0.4/rank_6.json.
[2024-07-26 15:53:13,582] trace_link.py:49 [INFO]: Please run the chakra_converter for further postprocessing.
INFO [07/26/2024 03:53:19 PM] Conversion successful. Output file is available at gpt3_126m_1.1.0-chakra.0.0.4/rank.6.et.
[2024-07-26 15:53:30,121] trace_link.py:48 [INFO]: Linking process successful. Output file is available at gpt3_126m_1.1.0-chakra.0.0.4/rank_7.json.
[2024-07-26 15:53:30,121] trace_link.py:49 [INFO]: Please run the chakra_converter for further postprocessing.
INFO [07/26/2024 03:53:35 PM] Conversion successful. Output file is available at gpt3_126m_1.1.0-chakra.0.0.4/rank.7.et.

Also I checked ASTRA-Sim can print out pg_name with above code.

Node Name:  ncclDevKernel_SendRecv(ncclDevKernelArgsStorage<4096ul>)
Process Group Name: 31
Node Name:  ncclDevKernel_SendRecv(ncclDevKernelArgsStorage<4096ul>)
Process Group Name: 34
Node Name:  ncclDevKernel_SendRecv(ncclDevKernelArgsStorage<4096ul>)
Process Group Name: 40
Node Name:  ncclDevKernel_SendRecv(ncclDevKernelArgsStorage<4096ul>)
Process Group Name: 37
Node Name:  ncclDevKernel_AllReduce_Sum_bf16_RING_LL(ncclDevKernelArgsStorage<4096ul>)
Process Group Name: 28

...

@JoongunPark
Copy link
Contributor Author

JoongunPark commented Aug 27, 2024

Hello!
I wanted to check if there are any further discussions or concerns regarding this PR. This update is crucial for ensuring proper communication simulation support in ASTRA-Sim. I hope we can proceed with merging it soon.

@srinivas212 srinivas212 merged commit 73edb74 into mlcommons:main Sep 6, 2024
9 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Sep 6, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants