Metrics Monitoring: A minimal viable product #348

simon-mo · 2017-12-20T06:01:04Z

This PR sets up the basic structure of monitoring and provide basic functionalities:

It exports query frontend metrics through a coupled node exporter.
It exports model container metrics (count, 3 types of latencies in seconds) through a node exporter in the same container but different python process.
It initiates a mechanism to spin up a prometheus monitor and update it when necessary.

RPC image needs to be rebuilt. A new frontend-exporter image is also created.

Full implementation available in commit messages.

[Simon Mo]

User can now just run clipper like regular run and spin up a metric monitor under the condition: 1. There is a clipper exporter docker image. <Simon Mo>

`rpc.py` is modified so that the metric monitor is put in. Metric exporter now run along side with the model closure and communicate via a pipe. The prometheus client is updated whenever a new replica is added. `docker_metric_utils.py` is the file for all metric related code that is put in `docker_container_manager.py` to reduce clutter. Prometheus listen to port 1390, a port specified in `container_manager.py` as a constant. The requirement for `prometheus_client` package is added to three places: `requirements.txt`, `setup.py`, `RPCDockerfile` I still need to work on integration test though. <Simon Mo>

Changed the corresponding files for the frontend exporter to be built. - `build_docker_image.sh` and `docker_metric_utils.py` agree on the name 'frontend-exporter' - python file and Dockerfile are in place <Simon Mo>

This is just a simple modification of basic_query example. It tests for two replicas to see if the prometheus client is monitoring all three node exporter.

AmplabJenkins · 2017-12-20T06:05:04Z

Can one of the admins verify this patch?

dcrankshaw · 2018-01-03T18:15:20Z

jenkins okay to test

dcrankshaw

This looks like a great MVP for monitoring. I just had a few small comments.

dcrankshaw · 2018-01-03T19:24:02Z

clipper_admin/clipper_admin/docker/docker_container_manager.py

+
+        # Metric Section
+        query_frontend_metric_name = "query_frontend_exporter-{}".format(
+            random.randint(0, 100000))


Let's use the same random integer as the query_frontend_name

dcrankshaw · 2018-01-03T20:43:59Z

containers/python/front_end_exporter.py

@@ -0,0 +1,69 @@
+import requests


The containers directory is for stuff related to model containers. Can you create a separate monitoring directory (CLIPPER_ROOT/monitoring) and put this file in it?

Can you also put a short README in the monitoring directory that provides instructions on how to access the Prometheus server once it is up?

Got it! This is a much better idea.

dcrankshaw · 2018-01-03T20:56:58Z

containers/python/rpc.py

+    def collect(self):
+        curr = None
+        while self.pipe_conn.poll():
+            curr = self.pipe_conn.recv()


If there are multiple items in the pipe, won't we overwrite all except the last item? It seems like curr should be a list you append to?

The purpose is to just get the the last item. Prometheus 'scrape' for the latest metric every 5 seconds and the metric is updated only for each GET call.

Because the prediction time is variable, in the rpc loop, we push the most recent metric dictionary into the pipe after each prediction call. In the metric collector, we just need to get the most recent one.

I changed the name to lastest_metric_dict for better clarity.

dcrankshaw · 2018-01-03T20:57:20Z

containers/python/rpc.py

+        if curr:
+            for name, val in curr.items():
+                try:
+                    yield GaugeMetricFamily(name, 'help', value=val)


What is the "help" string for?

This is a required argument for this constructor. But not available for end-user.
I changed it to:

yield GaugeMetricFamily(name=name, documentation=name, # Required Argument value=val)

for clarity.

Update the code according to Dan's comments. Moved the front-end container to a separate directory.

simon-mo · 2018-01-04T00:22:45Z

@dcrankshaw It looks like Jenkins isn't starting. Can you start the check manually?

dcrankshaw · 2018-01-05T00:18:51Z

jenkins ok to test

AmplabJenkins · 2018-01-05T00:48:03Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/824/
Test FAILed.

simon-mo · 2018-01-05T01:04:24Z

~~Don't merge. I commented out the metrics section to let Jenkins to build new docker images.~~ Resolved

AmplabJenkins · 2018-01-05T01:41:55Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/825/
Test FAILed.

…trics

AmplabJenkins · 2018-01-05T07:15:54Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/826/
Test FAILed.

Pretty sure it isn't the issue with my code. But it now appears common enough.

simon-mo · 2018-01-05T20:47:36Z

@dcrankshaw Help please.

It seems clipper_admin_tests.py causes the failure. It fails on the same error but with different unit tests that has nothing to do with the code I changed:

test_inspect_instance_returns_json_dict: Bind for 0.0.0.0:37967 failed: port is already allocated. 829 log.
test_python_closure_deploys_successfully: Error starting userland proxy: listen tcp 0.0.0.0:38198: bind: address already in use. 828 log.
test_link_not_registered_model_to_app_fails : Error starting userland proxy: listen tcp 0.0.0.0:39526: bind: address already in use. 822 log (for batch predict).

dcrankshaw · 2018-01-05T22:53:15Z

Yeah there's an issue with out Jenkins tests. It causes the master branch builder to be super flaky too. I haven't had time to chase it down. Can you file a issue with the above comment?

simon-mo · 2018-01-05T22:55:19Z

Sure! Done.

AmplabJenkins · 2018-01-05T23:48:02Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/830/
Test FAILed.

dcrankshaw

This is cool! I got it working and even just playing around with the bare bones Prometheus is pretty cool. I'm excited to start building out our monitoring infrastructure.

Just a couple minor comments and then this is good to go.

dcrankshaw · 2018-01-05T22:59:02Z

containers/python/rpc.py

@@ -515,13 +515,16 @@ def __init__(self, pipe_child_conn):
        self.pipe_conn = pipe_child_conn

    def collect(self):
-        curr = None
+        lastest_metric_dict = None


s/lastest/latest

dcrankshaw · 2018-01-05T23:00:16Z

monitoring/README.md

@@ -0,0 +1,13 @@
+This module is related Clipper's metric monitoring function. For full design, see [Design Doc.](https://docs.google.com/document/d/10whRxCc97gOJl4j2lY6R-v7cI_ZoAMVcNGPG_9oj6iY/edit?usp=sharing)


"is related to"

dcrankshaw · 2018-01-08T18:59:51Z

containers/python/rpc.py

@@ -306,6 +312,15 @@ def run(self):

                        response.send(socket, self.event_history)

+                        pred_metric['model_pred_count'] += 1


You're accumulating the counts here, but just sending samples of the latencies. This is why I was confused about how you were dequeuing from the pipe. In order to handle the latencies correctly you should be creating histograms. For this MVP PR, let's just accumulate the count. So delete the *_time metrics.

Got it. I was thinking about this recently. I deleted *_time metrics.

If we can finalize (at least for now) all the metrics needed to be collected, I will initialize them in the prometheus collector process just once [1] and run update through each dictionary in the pipe. So that every metric record is updated in the metric process (especially matters for histogram and summary data type.

[1]: the current process initialize a new gauge every time the metric is updated.

Yeah that's definitely the right way to do it. Let's merge this PR, then you can start on one that tracks the timing metrics with histograms.

dcrankshaw · 2018-01-08T19:14:49Z

containers/python/rpc.py

+                    yield GaugeMetricFamily(
+                        name=name,
+                        documentation=name,  # Required Argument
+                        value=val)


s/lastest/latest

…trics

AmplabJenkins · 2018-01-08T21:09:30Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/831/
Test FAILed.

So that it won't clash with other testing progress

AmplabJenkins · 2018-01-08T22:36:46Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/832/
Test PASSed.

dcrankshaw

LGTM. Fix the merge conflicts then I'll get this merged.

AmplabJenkins · 2018-01-10T22:48:15Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/849/
Test FAILed.

dcrankshaw · 2018-01-10T23:12:54Z

jenkins test this please

AmplabJenkins · 2018-01-11T00:12:45Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Clipper-PRB/850/
Test PASSed.

simon-mo added 5 commits December 16, 2017 19:56

[Metrics] A working node exporter done.

6d5f9e2

User can now just run clipper like regular run and spin up a metric monitor under the condition: 1. There is a clipper exporter docker image. <Simon Mo>

[Metric] Add FrontendExporter Docker image

3166bc0

Changed the corresponding files for the frontend exporter to be built. - `build_docker_image.sh` and `docker_metric_utils.py` agree on the name 'frontend-exporter' - python file and Dockerfile are in place <Simon Mo>

[Metric] Add docstrings

8fbba2f

[Metric] Add integration test for clipper metric

7947a5f

This is just a simple modification of basic_query example. It tests for two replicas to see if the prometheus client is monitoring all three node exporter.

simon-mo requested a review from dcrankshaw December 20, 2017 06:02

simon-mo self-assigned this Dec 20, 2017

simon-mo added 2 commits December 19, 2017 22:40

[Metric] Format Code

79f3145

[Metrics] Bug Fix, update the name accoridingly

e1c7227

dcrankshaw requested changes Jan 3, 2018

View reviewed changes

dcrankshaw added status: needs revision type: enhancement labels Jan 3, 2018

dcrankshaw added this to the 0.3.0 Release milestone Jan 3, 2018

simon-mo added 2 commits January 3, 2018 17:02

[Metric] Small Fixes

fd3b078

Update the code according to Dan's comments. Moved the front-end container to a separate directory.

[Metric] Format Code

bc55071

simon-mo added status: needs review and removed status: needs revision labels Jan 5, 2018

[Metric] Skipped Metrics to Let Jenkins Build Images

5efc602

simon-mo added 3 commits January 5, 2018 00:45

[Metric] Revert the files; docker imgs built before test

d86d79a

Merge branch 'metrics' of https://github.com/simon-mo/clipper into me…

3ffd51b

…trics

Move the comments; restart the tests

21b3926

[Metric] Add version tag to the frontend-exporter

ceb6065

simon-mo added 6 commits January 5, 2018 14:40

[Metric] Skipped Metrics to Let Jenkins Build Images

26e0024

[Metric] Revert the files; docker imgs built before test

a93606d

Move the comments; restart the tests

3dd8298

[Metric] Add version tag to the frontend-exporter

56425ae

[Metric] Add clipper_metric to run_unittests.sh

e2d8f46

[Metric] Trigger another CI check

086e305

Pretty sure it isn't the issue with my code. But it now appears common enough.

Merge branch 'develop' into metrics

87a67e2

dcrankshaw requested changes Jan 8, 2018

View reviewed changes

dcrankshaw added status: needs revision and removed status: needs review labels Jan 8, 2018

simon-mo added 2 commits January 8, 2018 14:03

Address comments, fix typo

b89cbae

Merge branch 'metrics' of https://github.com/simon-mo/clipper into me…

aa4b7c8

…trics

[Metric] change the redis port for integration-test

dd9d2e5

So that it won't clash with other testing progress

simon-mo added status: needs review and removed status: needs revision labels Jan 8, 2018

dcrankshaw approved these changes Jan 10, 2018

View reviewed changes

dcrankshaw added status: accepted and removed status: needs review labels Jan 10, 2018

Merge branch 'develop' into metrics

b218000

simon-mo mentioned this pull request Jan 11, 2018

Improved metrics and monitoring infrastructure #339

Closed

dcrankshaw merged commit f20cfa5 into ucbrise:develop Jan 11, 2018

		@@ -0,0 +1,13 @@
		This module is related Clipper's metric monitoring function. For full design, see [Design Doc.](https://docs.google.com/document/d/10whRxCc97gOJl4j2lY6R-v7cI_ZoAMVcNGPG_9oj6iY/edit?usp=sharing)

		@@ -306,6 +312,15 @@ def run(self):

		response.send(socket, self.event_history)

		pred_metric['model_pred_count'] += 1

Metrics Monitoring: A minimal viable product #348

Metrics Monitoring: A minimal viable product #348

Conversation

simon-mo commented Dec 20, 2017

AmplabJenkins commented Dec 20, 2017

dcrankshaw commented Jan 3, 2018

dcrankshaw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simon-mo commented Jan 4, 2018

dcrankshaw commented Jan 5, 2018

AmplabJenkins commented Jan 5, 2018

simon-mo commented Jan 5, 2018 • edited Loading

AmplabJenkins commented Jan 5, 2018

AmplabJenkins commented Jan 5, 2018

simon-mo commented Jan 5, 2018

dcrankshaw commented Jan 5, 2018

simon-mo commented Jan 5, 2018

AmplabJenkins commented Jan 5, 2018

dcrankshaw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Jan 8, 2018

AmplabJenkins commented Jan 8, 2018

dcrankshaw left a comment

Choose a reason for hiding this comment

AmplabJenkins commented Jan 10, 2018

dcrankshaw commented Jan 10, 2018

AmplabJenkins commented Jan 11, 2018

simon-mo commented Jan 5, 2018 •

edited

Loading