Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add an option to delete metrics from multiproc files by the provided key #988

Closed
wants to merge 1 commit into from
Closed

feat: Add an option to delete metrics from multiproc files by the provided key #988

wants to merge 1 commit into from

Conversation

alekseiapa
Copy link

Changes Made: This merge request introduces a new feature in the multiprocess mode that allows metrics deletion. The new functionality is encapsulated in a method called delete_value within the MmapedDict class.

Motivation: Until now, there was no option to remove individual metrics once they are written in multiprocess mode. It limited the flexibility of our system in dynamic scenarios where metrics might need to be removed post-write based on certain conditions. This created unnecessary clutter and could lead to memory inefficiencies over time.

How it Works: The delete_value method takes a key as an argument, which corresponds to the metric you want to delete. If the metric exists, it removes it from the dictionary and decreases the overall used memory size accordingly.

Testing: New unittest cases have been added in TestMmapedDict. These tests verify correct deletion of existing metrics, handle attempts of deleting non-existing keys, check the unaffected state of other metrics upon deletion of one, and write-new after delete operations.

Impact: The new deletion feature will allow more flexible and efficient handling of metrics in multiprocess mode. However, it should be used judiciously considering the operation involves memory manipulation.

The update doesn't affect the existing feature set and maintains backward compatibility.

@alekseiapa
Copy link
Author

Hi @csmarchbanks ! May I ask you to take a look at the PR ? Thanks in advance.

Copy link
Member

@csmarchbanks csmarchbanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I don't think this is safe, if a scrape came along on a different process while writing the left/right data I think it could end up being corrupted?

If that is true a different approach could be taken, for example we could use a specific NaN value to signal that a metric has been deleted, possibly 0x7ff0000000000002 since that is the stale NaN used in Prometheus. That wouldn't lower the disk usage but would be safer.

@alekseiapa
Copy link
Author

alekseiapa commented Dec 20, 2023

🤔 I don't think this is safe, if a scrape came along on a different process while writing the left/right data I think it could end up being corrupted?

If that is true a different approach could be taken, for example we could use a specific NaN value to signal that a metric has been deleted, possibly 0x7ff0000000000002 since that is the stale NaN used in Prometheus. That wouldn't lower the disk usage but would be safer.

@csmarchbanks Thank you for looking into this. I do not think that would be corrupted because we lock the deletion process of a specific metric. In addition, I am making this request so that I do not need to monkey patch the values module with a customized method. I have been using this approach for a few months and have not encountered any problems yet. Of course we can apply the Nan value approach. I'll take a look at this

@csmarchbanks
Copy link
Member

I believe the lock is only in the process updating, not in the process doing the read so unless we can guarantee an atomic write I have concerns about this. It would be fairly rare to have happen (metric deletion during a scrape).

Another approach would be to create a new file during deletion and then move it to the same name as the old file, like what is down for the textfile. That should be atomic and would clean up disk space in the long term (though double it in the short term).

@alekseiapa
Copy link
Author

@csmarchbanks Hi. Thank you for your review. I went with the temporary file approach. May I ask you to take another look?

@alekseiapa
Copy link
Author

@csmarchbanks Hi. Could you please review the changes ?

Copy link
Member

@csmarchbanks csmarchbanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello! I left one comment, but haven't had the time to fully test this yet due to some leave. Hopefully next week.

prometheus_client/values.py Show resolved Hide resolved
…vided key

Signed-off-by: Aleksei A <aapaseev@bk.ru>
@alekseiapa
Copy link
Author

Hi, @csmarchbanks . Thanks for the review. Could you please give it another look ? Thanks in advance.

@alekseiapa
Copy link
Author

Hi @csmarchbanks . The suggested change has been applied

Copy link
Member

@csmarchbanks csmarchbanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, I gave this a further review today and have a couple more questions/concerns:

  1. How do you intend to use this? To fully remove a metric I think each PID would have to do the removal. For example, add a second PID into one of your tests.
    • At a minimum we would need documentation that it is necessary to delete each PID individually, but it might just be too confusing/prone to errors.
  2. I don't think this is safe to do for counters, and could cause an inappropriate counter reset. This is why mark_process_dead only removes gauges.

@anChaOs
Copy link

anChaOs commented Feb 1, 2024

Hi, @csmarchbanks .

I think this pr is a possible solution to solve this issue(#916 ).
When it is necessary to call .remove() and .clear(), just call .delete() first.

Is that right?

@csmarchbanks
Copy link
Member

Yes, it would be a solution to those, or at least .remove(). If we can figure out the edge cases.

@alekseiapa
Copy link
Author

Hi, @csmarchbanks .

I think this pr is a possible solution to solve this issue(#916 ). When it is necessary to call .remove() and .clear(), just call .delete() first.

Is that right?

Hi @anChaOs @csmarchbanks I use this approach in my env for the Gauge type of metrics only. So should I leave the functionality for this type only?

@csmarchbanks
Copy link
Member

Probably only Gauge to start with. I am also not sure if it should be implemented for any of the Gauge aggregations? Those still behave similarly where a removal of just one pid might be confusing when the output is aggregated.

I am also torn about just implementing remove instead of the new delete function. Keeping remove might be nice to avoid any new surface area?

@alekseiapa alekseiapa closed this by deleting the head repository Sep 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants