Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PTI-SDK][Discussion] Handling of host events with PTI-SDK PoC #51

Open
Thyre opened this issue Dec 19, 2023 · 2 comments
Open

[PTI-SDK][Discussion] Handling of host events with PTI-SDK PoC #51

Thyre opened this issue Dec 19, 2023 · 2 comments

Comments

@Thyre
Copy link

Thyre commented Dec 19, 2023

Handling of host events with Level Zero PTI-SDK PoC

This issue represents more of a discussion for us to know how potential host events will be handeled in PTI-SDK for us to prepare.

What is the current situation in the PTI-SDK PoC (as of December 19th 2023)

PTI-SDK PoC offers a simple interface for potential tools. Simply said, tools can register two callback functions, where one returns a buffer upon request and the second one is called when this buffer is being flushed.

Tools can enable or disable certain parts of this interface to only look at the operations being of interest. From our point of view, these kinds will be interesting for us.

void
scorep_level0_event_device_tracing_enable()
{
    UTILS_DEBUG( "Enable tracing views!" );
    ptiViewEnable( PTI_VIEW_DEVICE_GPU_KERNEL );
    ptiViewEnable( PTI_VIEW_DEVICE_GPU_MEM_COPY );
    ptiViewEnable( PTI_VIEW_DEVICE_GPU_MEM_FILL );
    ptiViewEnable( PTI_VIEW_LEVEL_ZERO_CALLS );
    ptiViewEnable( PTI_VIEW_COLLECTION_OVERHEAD );
}

We decide against SYCL and OpenCL, since we have an adapter for OpenCL and prefer to have a standardised SYCL adapter at some point.

During the buffer flush event, we receive information about the device, queue and context. This is enough to reconstruct our internal structure and write events.

There's one issue from our side however... right now, we are not able to write a profile or trace successfully. This can be reduced to a single issue: There are no host events (with PTI-SDK only)!

How Score-P handles accelerators in other adapters

I'm mostly working on development for our OpenMP adapter, including support for OpenMP offloading, but will try to explain it as best as I can.

Score-P includes several adapters for accelerator libraries, including ROCprofiler/ROCtracer, CUPTI and (in development) OpenMP offload. All those adapters follow a similar principle to PTI-SDK PoC. There is some kind of buffer where events are being stored. At some point, this buffer is flushed and we can write events to locations based on streams, contexts and so on.

For this, devices need to be known before we're writing the events. Especially OpenMP offload is tricky, since events arrive on threads not known by Score-P (essentially helper threads). Here, libraries diverge a bit, but offer the same idea in principle: Callbacks that are triggered on the host.

OpenMP offload takes the simplest approach. At some point a device will need to be initialized and we get a ompt_callback_device_initialize with all required information. For CUPTI, we register a callback via cuptiSubscribe, for ROCtracer we use roctracer_enable_op_callback. On callback calls, we try to find the context/stream and create our internal structures if it isn't found.

In the case of PTI-SDK PoC, there is no such thing (yet). There are only events in a buffer related to the devices. All host events would need to get registered though the low-level Level0 interface, which seems counterintuitive.

Questions

Will PTI-SDK handle any kind of host events, similar to CUPTI, rocTracer and other frameworks?

In the current state, tool developers would need to implement both parts of the Level0 interface and PTI-SDK to get a functional adapters. Which is, to be honest, still easier than completely implementing everything with Level0. If that's the plan going forward, there should be at least a short guide on how to implement things. The examples in this repository can be overwhelming to look at. The Tools Programming Guide here doesn't help either, especially since the API Tracing, which would be the most interesting section for us, is being deprecated. The new (?) interface can instead be found hidden in the Level0 repository (see here)

How will those host events be delivered to the tool?

Looking at _pti_view_kind I fear that we will receive host events the same way we get accelerator events: On a buffer at some point during program execution. Simply said: This will not work for our tool, since we require events for a location to be added in timestamp order. PTI-SDK would be the exception here, with all other APIs delivering the events on time.

@jfedorov
Copy link
Contributor

Hello @Thyre ,
thank you very much for your feedback!
if I understood you right - you are asking about Callback APIs like CUPTI has.
If so - we have this on our list reasonably high. Although at the moment - this is not the first priority, for example, in comparison with the overhead and start/stop.

The question for you - are you expecting such callbacks for Level0 and SYCL runtime APIs? Or in the some reasonable time interval - callback for low level API as Level0 would suffice?

@Thyre
Copy link
Author

Thyre commented Dec 23, 2023

Having callbacks for Level0 only would be sufficient as we want to focus on that first.
For SYCL, we hope that the standard gets a standardized tools API at some point in the future which isn't the case yet AFAIK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants