-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics Monitoring: A minimal viable product #348
Conversation
User can now just run clipper like regular run and spin up a metric monitor under the condition: 1. There is a clipper exporter docker image. <Simon Mo>
`rpc.py` is modified so that the metric monitor is put in. Metric exporter now run along side with the model closure and communicate via a pipe. The prometheus client is updated whenever a new replica is added. `docker_metric_utils.py` is the file for all metric related code that is put in `docker_container_manager.py` to reduce clutter. Prometheus listen to port 1390, a port specified in `container_manager.py` as a constant. The requirement for `prometheus_client` package is added to three places: `requirements.txt`, `setup.py`, `RPCDockerfile` I still need to work on integration test though. <Simon Mo>
Changed the corresponding files for the frontend exporter to be built. - `build_docker_image.sh` and `docker_metric_utils.py` agree on the name 'frontend-exporter' - python file and Dockerfile are in place <Simon Mo>
This is just a simple modification of basic_query example. It tests for two replicas to see if the prometheus client is monitoring all three node exporter.
Can one of the admins verify this patch? |
jenkins okay to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a great MVP for monitoring. I just had a few small comments.
|
||
# Metric Section | ||
query_frontend_metric_name = "query_frontend_exporter-{}".format( | ||
random.randint(0, 100000)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use the same random integer as the query_frontend_name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -0,0 +1,69 @@ | |||
import requests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The containers directory is for stuff related to model containers. Can you create a separate monitoring directory (CLIPPER_ROOT/monitoring
) and put this file in it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also put a short README in the monitoring directory that provides instructions on how to access the Prometheus server once it is up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! This is a much better idea.
containers/python/rpc.py
Outdated
def collect(self): | ||
curr = None | ||
while self.pipe_conn.poll(): | ||
curr = self.pipe_conn.recv() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are multiple items in the pipe, won't we overwrite all except the last item? It seems like curr
should be a list you append to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose is to just get the the last item. Prometheus 'scrape' for the latest metric every 5 seconds and the metric is updated only for each GET
call.
Because the prediction time is variable, in the rpc loop, we push the most recent metric dictionary into the pipe after each prediction call. In the metric collector, we just need to get the most recent one.
I changed the name to lastest_metric_dict
for better clarity.
containers/python/rpc.py
Outdated
if curr: | ||
for name, val in curr.items(): | ||
try: | ||
yield GaugeMetricFamily(name, 'help', value=val) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the "help" string for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a required argument for this constructor. But not available for end-user.
I changed it to:
yield GaugeMetricFamily(name=name,
documentation=name, # Required Argument
value=val)
for clarity.
Update the code according to Dan's comments. Moved the front-end container to a separate directory.
@dcrankshaw It looks like Jenkins isn't starting. Can you start the check manually? |
jenkins ok to test |
Test FAILed. |
|
Test FAILed. |
Test FAILed. |
Pretty sure it isn't the issue with my code. But it now appears common enough.
@dcrankshaw Help please. It seems
|
Yeah there's an issue with out Jenkins tests. It causes the master branch builder to be super flaky too. I haven't had time to chase it down. Can you file a issue with the above comment? |
Sure! Done. |
Test FAILed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is cool! I got it working and even just playing around with the bare bones Prometheus is pretty cool. I'm excited to start building out our monitoring infrastructure.
Just a couple minor comments and then this is good to go.
containers/python/rpc.py
Outdated
@@ -515,13 +515,16 @@ def __init__(self, pipe_child_conn): | |||
self.pipe_conn = pipe_child_conn | |||
|
|||
def collect(self): | |||
curr = None | |||
lastest_metric_dict = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/lastest/latest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
monitoring/README.md
Outdated
@@ -0,0 +1,13 @@ | |||
This module is related Clipper's metric monitoring function. For full design, see [Design Doc.](https://docs.google.com/document/d/10whRxCc97gOJl4j2lY6R-v7cI_ZoAMVcNGPG_9oj6iY/edit?usp=sharing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"is related to"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -306,6 +312,15 @@ def run(self): | |||
|
|||
response.send(socket, self.event_history) | |||
|
|||
pred_metric['model_pred_count'] += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're accumulating the counts here, but just sending samples of the latencies. This is why I was confused about how you were dequeuing from the pipe. In order to handle the latencies correctly you should be creating histograms. For this MVP PR, let's just accumulate the count. So delete the *_time
metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I was thinking about this recently. I deleted *_time
metrics.
If we can finalize (at least for now) all the metrics needed to be collected, I will initialize them in the prometheus collector process just once [1] and run update through each dictionary in the pipe. So that every metric record is updated in the metric process (especially matters for histogram and summary data type.
[1]: the current process initialize a new gauge
every time the metric is updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's definitely the right way to do it. Let's merge this PR, then you can start on one that tracks the timing metrics with histograms.
yield GaugeMetricFamily( | ||
name=name, | ||
documentation=name, # Required Argument | ||
value=val) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/lastest/latest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Test FAILed. |
So that it won't clash with other testing progress
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Fix the merge conflicts then I'll get this merged.
Test FAILed. |
jenkins test this please |
Test PASSed. |
This PR sets up the basic structure of monitoring and provide basic functionalities:
RPC image needs to be rebuilt. A new frontend-exporter image is also created.
Full implementation available in commit messages.
[Simon Mo]