Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding ProfilerService to Node for diagnostics. Closes #223 #225

Merged
merged 5 commits into from
Aug 7, 2023

Conversation

edavalosanaya
Copy link
Collaborator

Added ProfilerService for the Node and make changes in non-breaking DataChunk, ProcessorService, and PollerService to compute latency and payload sizes.

Still need to add a method for enable/disable diagnostic logging from the front-facing API of the Engine.

…nges in non-breaking DataChunk, ProcessorService, and PollerService to compute latency and payload sizes.
@edavalosanaya edavalosanaya self-assigned this Aug 5, 2023
…low making unittest transferring the data from the Node->Worker->Manager.
… to the Orchestrator, the NodeDiagnostics dataclass was embedded to the NodeState.
@edavalosanaya edavalosanaya marked this pull request as ready for review August 7, 2023 01:41
@edavalosanaya edavalosanaya merged commit 662d976 into main Aug 7, 2023
2 checks passed
@edavalosanaya edavalosanaya deleted the 223-diagnostics-service branch August 7, 2023 01:41
@edavalosanaya
Copy link
Collaborator Author

These are the new configuration options in the chimerapyrc.yaml:

diagnostics:
  deque-length: 10000
  interval: 10
  logging-enabled: false

Instead of adding a new parameter to a class construction, you can enable the logging of diagnostics info via the logging-enabled variable. The diagnostics information is transmitted from Node->Worker->Manager->Orchestrator at intervals, specified via the interval config variable (not sure if 10 seconds is too fast or just right).
These are the agreed metrics (NodeDiagnostics):

latency (ms)
payload_size(KB)
memory_usage(KB)
cpu_usage(%)
num_of_steps(int)

In the Latency Node implementation, all the metrics were computed during the step. Now in the ProfilerService, only the latency and payload size information is extracted during the step. Memory, cpu, and num_of_steps are determined at fixed intervals (with an async timer). Latency and payload information are computed after receiving information from pre successor nodes, i.e source nodes don't have these metrics, step and sink nodes do.

Since multiple steps might occur during the fixed interval, the DataChunk's payload size and latency metrics are stored in deques. The latency value report is the average of all the record instances. The payload size is the summed total of all the DataChunks' payloads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant