Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/trace events #268

Draft
wants to merge 31 commits into
base: v0.7-dev
Choose a base branch
from

Conversation

westoncadena
Copy link

Added Tracing capabilities to YGM's communicator across a job. This includes:

  • Tracing every ygm::async() call
  • Tracing MPI sends
  • Tracing MPI receives

Weston James Cadena and others added 30 commits May 31, 2024 16:41
…ration_events

adding changes from v0.7-dev to keep feature/duration_events up to date
Copy link
Collaborator

@steiltre steiltre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. One feature that would be good to add is the ability to turn tracing on and off from the comm so that we can set up tracing at a more granular level.

Not requesting any changes, but we may want to consider other levels of tracing information. We currently have coarse-grained tracing at the level of MPI operations and fine-grained tracing at the level of individual async calls in YGM and their associated receipt and processing at their destination, but we do not have any tracking of individual active messages as they are routed through processes. This may generate an amount of data we don't want to handle, but may be useful information for us.

<< "async_counts = " << stats.get_async_count()
<< std::endl;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a block of debug code

m_send_buffer_bytes += header_bytes;
}

// TODO: Add tracing header
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer a TODO?


std::memcpy(&*iter, &bytes,
sizeof(header_t::dest)); // TODO:: TYPO? should it be
// header_t::message_size
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you are right!

@@ -4,6 +4,7 @@
// SPDX-License-Identifier: MIT

#pragma once
#include <iomanip>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be able to remove

if (h.dest == m_layout.rank() || (h.dest == -1 && h.message_size == 0)) {
uint16_t lid;
iarchive.loadBinary(&lid, sizeof(lid));
m_lambda_map.execute(lid, this, &iarchive);
m_recv_count++;
stats.rpc_execute();

// TODO: IMPLEMENTING Async 'e'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this TODO still relevant/needing to be done?

m_send_buffer_bytes += header_bytes;

size_t traciing_header_bytes = pack_tracing_header(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

misspelled "tracing"

@@ -908,11 +1066,34 @@ inline void comm::handle_next_receive(std::shared_ptr<std::byte[]> buffer,
flush_to_capacity();
}
} else {
// TODO: load binary for tracing header if it exists
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove TODO

uint16_t lid;
iarchive.loadBinary(&lid, sizeof(lid));
m_lambda_map.execute(lid, this, &iarchive);
m_recv_count++;
stats.rpc_execute();

// TODO: IMPLEMENTING Async 'e'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO still relevant?

@@ -528,6 +638,10 @@ inline void comm::flush_send_buffer(int dest) {
static size_t counter = 0;
if (m_vec_send_buffers[dest].size() > 0) {
mpi_isend_request request;

m_next_message_id += size();
request.id = m_next_message_id;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m_next_message_id is being used and manipulated outside of an if (config.trace) block. This made sense for tracking when MPI_Ireceives started and stopped, but currently isn't useful and leads to missing IDs in our trace files.

It is better to have the tracer handle the IDs for events and have the comm request new ones whenever they are needed to avoid increments of m_next_message_id scattered around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants