Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support GPU-to-CPU synchronization dependency with HolisticTraceAnalysis #57

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

JoongunPark
Copy link
Contributor

@JoongunPark JoongunPark commented May 10, 2024

Summary

This PR introduces dependencies from GPU operators to CPU operators using the critical path analysis in HolisticTraceAnalysis (HTA). In the simulation flow of Chakra, postprocessors like the trace linker and the converter are required. They are responsible for merging Chakra host traces with Chakra device traces and encoding dependencies. Currently, the dependencies encoded by the postprocessors are limited to CPU operators to GPU operators. However, there can be dependencies from GPU operators to CPU operators if a CPU operator has a dependency on a GPU operator. To identify such dependencies, this PR utilizes the critical path analysis of HTA. More specifically, this PR uses the synchronization dependency of HTA. A synchronization dependency occurs when a CPU operator has to wait for a dispatched GPU operator to be completed. Therefore, synchronization dependency is the best for identifying such dependencies.

Please note that:

from torch.autograd.profiler import profile, _ExperimentalConfig
with profile(use_kineto=True, use_cuda=True,
   experimental_config=_ExperimentalConfig(enable_cuda_sync_events=True),
) as prof:
   workload()

Test Plan

Download and Install HTA.

git clone https://github.com/facebookresearch/HolisticTraceAnalysis.git
cd HolisticTraceAnalysis
git checkout d731cc2e2249976c97129d409a83bd53d93051f6
git submodule update --init
pip install -r requirements.txt
pip install -e .

Next, you need to collect traces by following the instructions here: pytorch/pytorch#105187.

After that, you can load sync dependencies and print them out with the following script:

import argparse
import logging
import os
from typing import Dict, List

from hta.analyzers.critical_path_analysis import CPEdgeType
from hta.trace_analysis import TraceAnalysis

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


def load_sync_dependencies(
    rank: int, kineto_file: str, annotation: str = "ProfilerStep", instance_id: int = 0
) -> Dict[int, List[int]]:
    """
    Load synchronization dependencies using Holistic Trace Analysis (HTA).

    Args:
        rank (int): Rank for the input Kineto trace.
        kineto_file (str): Path to the Kineto trace file.
        annotation (str): Annotation to use for the analysis. Defaults to "ProfilerStep".
        instance_id (int): Instance ID for the analysis. Defaults to 0.

    Returns:
        Dict[int, List[int]]: A dictionary mapping end event's external ID to a list of start event's external IDs
            that have synchronization dependencies.
    """
    sync_dependencies = {}
    trace_analysis = TraceAnalysis(trace_dir=os.path.dirname(kineto_file))
    cp_graph, success = trace_analysis.critical_path_analysis(rank=rank, annotation=annotation, instance_id=instance_id)
    if not success:
        logger.error("Failed to load Critical Path Graph")
        return sync_dependencies

    raw_events = trace_analysis.t.get_raw_trace_for_one_rank(rank=rank)["traceEvents"]
    for edge in cp_graph.critical_path_edges_set:
        if edge.type in [CPEdgeType.SYNC_DEPENDENCY]:
            start_event_id, end_event_id = cp_graph.get_events_for_edge(edge)
            start_event, end_event = raw_events[start_event_id], raw_events[end_event_id]
            if "External id" in end_event["args"] and "External id" in start_event["args"]:
                start_event_external_id = start_event["args"]["External id"]
                end_event_external_id = end_event["args"]["External id"]
                start_event_name = start_event["name"]
                end_event_name = end_event["name"]
                if start_event_external_id != end_event_external_id:
                    print(
                        f"start_event_id {start_event_id}, end_event_id {end_event_id}, "
                        f"start_event_external_id {start_event_external_id}, end_event_external_id {end_event_external_id}, "
                        f"start_event_name '{start_event_name}', end_event_name '{end_event_name}'"
                    )
            else:
                logger.warning(
                    f"Synchronization dependency from event {start_event_id} to event {end_event_id} will "
                    "not be considered due to missing external IDs."
                )
    return sync_dependencies


def main() -> None:
    """
    Main function to parse arguments and load synchronization dependencies.
    """
    parser = argparse.ArgumentParser(description="Load and print critical paths from Kineto traces.")
    parser.add_argument("--input", type=str, help="Path to the Kineto trace file.")
    parser.add_argument("--rank", type=int, help="Rank for the input traces.")
    args = parser.parse_args()

    load_sync_dependencies(args.rank, args.input)


if __name__ == "__main__":
    main()

You can run it with the following command:

$ python sync_dep.py --input ~/Downloads/cuda-sync/kineto_0.json --rank 0 > /tmp/out

cuda-sync.zip

/tmp/out

start_event_id 24868, end_event_id 24874, start_event_external_id 94785, end_event_external_id 94792, start_event_name 'void multi_tensor_apply_kernel<TensorListMetadata<6>, DistAdamWithParamRemaindersFunctor<float>, float*, float, float, float, float, float, float, adamMode_t, float>(long, int volatile*, TensorListMetadata<6>, DistAdamWithParamRemaindersFunctor<float>, float*, float, float, float, float, float, float, adamMode_t, float)', end_event_name 'cudaDeviceSynchronize'
start_event_id 24536, end_event_id 24650, start_event_external_id 13847, end_event_external_id 91874, start_event_name 'ncclDevKernel_ReduceScatter_Sum_f32_RING_LL(ncclDevComm*, unsigned long, ncclWork*)', end_event_name 'cudaDeviceSynchronize'

Two synchronization dependencies are identified with the script. In this test, we focus on the dependency between 'ncclDevKernel_ReduceScatter_Sum_f32_RING_LL(ncclDevComm*, unsigned long, ncclWork*)' and 'cudaDeviceSynchronize'.

Let's confirm our observation with a trace visualizer. You can read Kineto traces with https://perfetto.dev/. By searching for ncclDevKernel_ReduceScatter_Sum_f32_RING_LL, you can find that it is a GPU kernel (category field) with an external ID of 13847. Around the operator but in the CPU row of the visualization, you can find cudaDeviceSynchronize where the external ID is 94792. It is a cuda_runtime operator. As the cuda_runtime operator is not considered a simulatable operator in the toolchains, the closest but later CPU operator, aten::empty, with the external ID of 16392, should rely on the GPU kernel.

Let's see if the synchronization dependency is properly encoded in trace_link. Make sure you install Chakra.

$ pip install .

Run chakra_trace_link.

chakra_trace_link \
  --pytorch-et-file /Users/theo/Downloads/et_0.json\
  --kineto-file /Users/theo/Downloads/kineto_0.json\
  --output-file ~/megatron_0.json\
  --rank 0

You can review ~/megatron_0.json and find that sync dependencies are encoded.

        {
            "id": 15899,
            "name": "ncclDevKernel_ReduceScatter_Sum_f32_RING_LL(ncclDevComm*, unsigned long, ncclWork*)",
            "ctrl_deps": 15898,
            "inputs": {
                "values": [
                    [
                        87,
                        49,
                        576716800,
                        52428800,
                        4,
                        "cuda:0"
                    ],
                    [
                        90,
                        49,
                        629145600,
                        52428800,
                        4,
                        "cuda:0"
                    ]
                ],
                "shapes": [
                    [
                        52428800
                    ],
                    [
                        52428800
                    ]
                ],
                "types": [
                    "Tensor(float)",
                    "Tensor(float)"
                ]
            },
            "outputs": {
                "values": [],
                "shapes": [],
                "types": []
            },
            "attrs": [
                {
                    "name": "rf_id",
                    "type": "uint64",
                    "value": 13651
                },
                {
                    "name": "fw_parent",
                    "type": "uint64",
                    "value": 0
                },
                {
                    "name": "seq_id",
                    "type": "int64",
                    "value": -1
                },
                {
                    "name": "scope",
                    "type": "uint64",
                    "value": 7
                },
                {
                    "name": "tid",
                    "type": "uint64",
                    "value": 4
                },
                {
                    "name": "fw_tid",
                    "type": "uint64",
                    "value": 0
                },
                {
                    "name": "op_schema",
                    "type": "string",
                    "value": ""
                }
            ],
            "inclusive_dur": 44160,
            "exclusive_dur": 44160,
            "ts": 1719249141376319,
            "inter_thread_dep": 15685,
            "cat": "kernel",
            "ph": "X",
            "stream": 64
        },
        {
            "id": 15982,
            "name": "aten::detach",
            "ctrl_deps": 29,
            "inputs": {
                "values": [
                    [
                        20832,
                        1363,
                        0,
                        1,
                        4,
                        "cuda:0"
                    ]
                ],
                "shapes": [
                    []
                ],
                "types": [
                    "Tensor(float)"
                ]
            },
            "outputs": {
                "values": [
                    [
                        20842,
                        1363,
                        0,
                        1,
                        4,
                        "cuda:0"
                    ]
                ],
                "shapes": [
                    []
                ],
                "types": [
                    "Tensor(float)"
                ]
            },
            "attrs": [
                {
                    "name": "rf_id",
                    "type": "uint64",
                    "value": 13722
                },
                {
                    "name": "fw_parent",
                    "type": "uint64",
                    "value": 0
                },
                {
                    "name": "seq_id",
                    "type": "int64",
                    "value": 19404
                },
                {
                    "name": "scope",
                    "type": "uint64",
                    "value": 0
                },
                {
                    "name": "tid",
                    "type": "uint64",
                    "value": 1
                },
                {
                    "name": "fw_tid",
                    "type": "uint64",
                    "value": 0
                },
                {
                    "name": "op_schema",
                    "type": "string",
                    "value": "aten::detach(Tensor(a) self) -> Tensor(a)"
                }
            ],
            "inclusive_dur": 17,
            "exclusive_dur": 11,
            "ts": 1719249141527040,
            "inter_thread_dep": 15901,
            "sync_dep": 15899
        },

Run chakra_converter

chakra_converter --input_filename ~/megatron_0.json\
    --output_filename megatron_0.chakra\
    --input_type PyTorch\
     --log_filename /tmp/rank_0

Here are traces that I used.
cuda-sync.zip
Resnet-50.zip
llama2.zip

@JoongunPark JoongunPark requested a review from a team as a code owner May 10, 2024 22:10
Copy link

github-actions bot commented May 10, 2024

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@JoongunPark JoongunPark force-pushed the HolisticTraceAnalysis branch 2 times, most recently from 0076451 to 742abd4 Compare May 10, 2024 22:41
@JoongunPark
Copy link
Contributor Author

Hi, how is the review going?

@TaekyungHeo
Copy link
Contributor

Hi, @JoongunPark. We did not get a chance to review and test because we have an urgent task internally. Thank you for your patience.

@srinivas212
Copy link
Contributor

@JoongunPark - we may need 1-2 more weeks since we are setting up integration tests internally as we speak. We will try to expedite this asap. Thank you for your patience.

@JoongunPark JoongunPark mentioned this pull request Jun 25, 2024
@TaekyungHeo TaekyungHeo changed the title Support Synchronization Dependency with Holistic Trace Analysis Support Synchronization Dependency with HolisticTraceAnalysis Jun 25, 2024
@TaekyungHeo TaekyungHeo changed the title Support Synchronization Dependency with HolisticTraceAnalysis Support GPU-to-CPU Synchronization Dependency with HolisticTraceAnalysis Jun 25, 2024
@TaekyungHeo TaekyungHeo changed the title Support GPU-to-CPU Synchronization Dependency with HolisticTraceAnalysis Support GPU-to-CPU synchronization dependency with HolisticTraceAnalysis Jun 25, 2024
@TaekyungHeo TaekyungHeo removed their request for review June 25, 2024 22:21
@TaekyungHeo TaekyungHeo added the enhancement New feature or request label Jun 27, 2024
@TaekyungHeo TaekyungHeo force-pushed the HolisticTraceAnalysis branch 2 times, most recently from 0eae1ab to 882e10a Compare July 1, 2024 12:54
@JoongunPark
Copy link
Contributor Author

JoongunPark commented Jul 1, 2024

I have tested with Taekyung's lastest enhancement. It works well on my environment (Python 3.10.13, Linux 5.15.0-105-generic)
Below is the log that I have obtained after converting traces into Chakra HDT.

Llama2 
Node ID 14063 now has an synchonization dependency on Node ID 13783
Node ID 13782 now has an synchonization dependency on Node ID 13779
Node ID 13784 now has an synchonization dependency on Node ID 13779
Node ID 13837 now has an synchonization dependency on Node ID 13779
Node ID 13852 now has an synchonization dependency on Node ID 13783
Node ID 13779 now has an synchonization dependency on Node ID 13776
Node ID 13781 now has an synchonization dependency on Node ID 13776
Node ID 13831 now has an synchonization dependency on Node ID 13776
Node ID 13849 now has an synchonization dependency on Node ID 13780
Resnet-50
Node ID 4864 now has an synchonization dependency on Node ID 4861
Node ID 4866 now has an synchonization dependency on Node ID 4861
Node ID 5270 now has an synchonization dependency on Node ID 4865

Also, as he mentioned, now the code builds sync dependency with the closest next CPU operator instead of cuda_runtime op.

@TaekyungHeo TaekyungHeo force-pushed the HolisticTraceAnalysis branch 3 times, most recently from eba5712 to eff2dc6 Compare July 3, 2024 09:45
@TaekyungHeo TaekyungHeo force-pushed the HolisticTraceAnalysis branch 3 times, most recently from 3441a2c to c5db738 Compare July 13, 2024 15:44
TaekyungHeo and others added 5 commits July 23, 2024 13:51
Co-authored-by: Joongun Park <8554137+JoongunPark@users.noreply.github.com>
…onverter

Co-authored-by: Joongun Park <8554137+JoongunPark@users.noreply.github.com>
…akra_trace_link

Co-authored-by: Joongun Park <8554137+JoongunPark@users.noreply.github.com>
Co-authored-by: Joongun Park <8554137+JoongunPark@users.noreply.github.com>
rvinaybharadwaj pushed a commit to rvinaybharadwaj/chakra that referenced this pull request Sep 23, 2024
Update et_feeder for compatibility with Chakra schema v0.0.4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants