Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics Monitoring: A minimal viable product #348

Merged
merged 36 commits into from
Jan 11, 2018
Merged

Metrics Monitoring: A minimal viable product #348

merged 36 commits into from
Jan 11, 2018

Conversation

simon-mo
Copy link
Contributor

This PR sets up the basic structure of monitoring and provide basic functionalities:

  • It exports query frontend metrics through a coupled node exporter.
  • It exports model container metrics (count, 3 types of latencies in seconds) through a node exporter in the same container but different python process.
  • It initiates a mechanism to spin up a prometheus monitor and update it when necessary.

RPC image needs to be rebuilt. A new frontend-exporter image is also created.

Full implementation available in commit messages.

[Simon Mo]

User can now just run clipper like regular run and spin up a
metric monitor under the condition:
1. There is a clipper exporter docker image.

<Simon Mo>
`rpc.py` is modified so that the metric monitor is put in. Metric
exporter now run along side with the model closure and communicate
via a pipe. The prometheus client is updated whenever a new replica
is added.

`docker_metric_utils.py` is the file for all metric related code that
is put in `docker_container_manager.py` to reduce clutter. Prometheus
listen to port 1390, a port specified in `container_manager.py` as a
constant.

The requirement for `prometheus_client` package is added to three
places: `requirements.txt`, `setup.py`, `RPCDockerfile`

I still need to work on integration test though.

<Simon Mo>
Changed the corresponding files for the frontend exporter to be built.
- `build_docker_image.sh` and `docker_metric_utils.py` agree on the name
'frontend-exporter'
- python file and Dockerfile are in place

<Simon Mo>
This is just a simple modification of basic_query example.
It tests for two replicas to see if the prometheus client
is monitoring all three node exporter.
@simon-mo simon-mo requested a review from dcrankshaw December 20, 2017 06:02
@simon-mo simon-mo self-assigned this Dec 20, 2017
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@dcrankshaw
Copy link
Contributor

jenkins okay to test

Copy link
Contributor

@dcrankshaw dcrankshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great MVP for monitoring. I just had a few small comments.


# Metric Section
query_frontend_metric_name = "query_frontend_exporter-{}".format(
random.randint(0, 100000))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the same random integer as the query_frontend_name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -0,0 +1,69 @@
import requests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The containers directory is for stuff related to model containers. Can you create a separate monitoring directory (CLIPPER_ROOT/monitoring) and put this file in it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also put a short README in the monitoring directory that provides instructions on how to access the Prometheus server once it is up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! This is a much better idea.

def collect(self):
curr = None
while self.pipe_conn.poll():
curr = self.pipe_conn.recv()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are multiple items in the pipe, won't we overwrite all except the last item? It seems like curr should be a list you append to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose is to just get the the last item. Prometheus 'scrape' for the latest metric every 5 seconds and the metric is updated only for each GET call.

Because the prediction time is variable, in the rpc loop, we push the most recent metric dictionary into the pipe after each prediction call. In the metric collector, we just need to get the most recent one.

I changed the name to lastest_metric_dict for better clarity.

if curr:
for name, val in curr.items():
try:
yield GaugeMetricFamily(name, 'help', value=val)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the "help" string for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a required argument for this constructor. But not available for end-user.
I changed it to:

yield GaugeMetricFamily(name=name, 
                                          documentation=name, # Required Argument
                                           value=val)

for clarity.

Update the code according to Dan's comments. Moved the front-end
container to a separate directory.
@simon-mo
Copy link
Contributor Author

simon-mo commented Jan 4, 2018

@dcrankshaw It looks like Jenkins isn't starting. Can you start the check manually?

@dcrankshaw
Copy link
Contributor

jenkins ok to test

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/824/
Test FAILed.

@simon-mo
Copy link
Contributor Author

simon-mo commented Jan 5, 2018

Don't merge. I commented out the metrics section to let Jenkins to build new docker images. Resolved

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/825/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/826/
Test FAILed.

@simon-mo
Copy link
Contributor Author

simon-mo commented Jan 5, 2018

@dcrankshaw Help please.

It seems clipper_admin_tests.py causes the failure. It fails on the same error but with different unit tests that has nothing to do with the code I changed:

  • test_inspect_instance_returns_json_dict: Bind for 0.0.0.0:37967 failed: port is already allocated. 829 log.
  • test_python_closure_deploys_successfully: Error starting userland proxy: listen tcp 0.0.0.0:38198: bind: address already in use. 828 log.
  • test_link_not_registered_model_to_app_fails : Error starting userland proxy: listen tcp 0.0.0.0:39526: bind: address already in use. 822 log (for batch predict).

@dcrankshaw
Copy link
Contributor

Yeah there's an issue with out Jenkins tests. It causes the master branch builder to be super flaky too. I haven't had time to chase it down. Can you file a issue with the above comment?

@simon-mo
Copy link
Contributor Author

simon-mo commented Jan 5, 2018

Sure! Done.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/830/
Test FAILed.

Copy link
Contributor

@dcrankshaw dcrankshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool! I got it working and even just playing around with the bare bones Prometheus is pretty cool. I'm excited to start building out our monitoring infrastructure.

Just a couple minor comments and then this is good to go.

@@ -515,13 +515,16 @@ def __init__(self, pipe_child_conn):
self.pipe_conn = pipe_child_conn

def collect(self):
curr = None
lastest_metric_dict = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/lastest/latest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -0,0 +1,13 @@
This module is related Clipper's metric monitoring function. For full design, see [Design Doc.](https://docs.google.com/document/d/10whRxCc97gOJl4j2lY6R-v7cI_ZoAMVcNGPG_9oj6iY/edit?usp=sharing)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"is related to"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -306,6 +312,15 @@ def run(self):

response.send(socket, self.event_history)

pred_metric['model_pred_count'] += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're accumulating the counts here, but just sending samples of the latencies. This is why I was confused about how you were dequeuing from the pipe. In order to handle the latencies correctly you should be creating histograms. For this MVP PR, let's just accumulate the count. So delete the *_time metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I was thinking about this recently. I deleted *_time metrics.

If we can finalize (at least for now) all the metrics needed to be collected, I will initialize them in the prometheus collector process just once [1] and run update through each dictionary in the pipe. So that every metric record is updated in the metric process (especially matters for histogram and summary data type.

[1]: the current process initialize a new gauge every time the metric is updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's definitely the right way to do it. Let's merge this PR, then you can start on one that tracks the timing metrics with histograms.

yield GaugeMetricFamily(
name=name,
documentation=name, # Required Argument
value=val)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/lastest/latest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/831/
Test FAILed.

So that it won't clash with other testing progress
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/832/
Test PASSed.

Copy link
Contributor

@dcrankshaw dcrankshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Fix the merge conflicts then I'll get this merged.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/849/
Test FAILed.

@dcrankshaw
Copy link
Contributor

jenkins test this please

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/850/
Test PASSed.

@dcrankshaw dcrankshaw merged commit f20cfa5 into ucbrise:develop Jan 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants